OMOP Embeddings
omop-emb generates and retrieves vector embeddings for OMOP CDM concepts. It
works standalone out of the box (sqlite-vec, no external database required) and
optionally scales to PostgreSQL via the pgvector extension.
The package supports:
- dynamic embedding model registration — multiple models per backend, tracked in the embedding database
- embedding and lookup for OMOP concepts across configurable storage backends
- Two storage backends:
sqlite-vec(default): zero-config, file-based or in-memory — no external service requiredpgvector: PostgreSQL with the pgvector extension (FLAT sequential scan or HNSW SQL index)
- FAISS sidecar on top of
sqlite-vecbackend for approximate nearest-neighbour search - CLI scripts to ingest OMOP CDM concepts and manage registered models
Installation
Install the backend you want to use:
pip install omop-emb # sqlite-vec only (default backend)
pip install "omop-emb[pgvector]" # adds PostgreSQL/pgvector support
pip install "omop-emb[faiss-cpu]" # adds FAISS sidecar support
pip install "omop-emb[pgvector,faiss-cpu]" # everything
Environment Variables
Backend selector
| Variable | Default | Description |
|---|---|---|
OMOP_EMB_BACKEND |
sqlitevec |
Backend to use: sqlitevec or pgvector. |
sqlite-vec connection
| Variable | Description |
|---|---|
OMOP_EMB_SQLITE_PATH |
Path to the sqlite-vec database file. Use :memory: for an in-memory database. |
pgvector connection (individual components)
| Variable | Default | Description |
|---|---|---|
OMOP_EMB_DB_HOST |
— | PostgreSQL host. |
OMOP_EMB_DB_PORT |
5432 |
PostgreSQL port. |
OMOP_EMB_DB_USER |
— | PostgreSQL user. |
OMOP_EMB_DB_PASSWORD |
— | PostgreSQL password. |
OMOP_EMB_DB_NAME |
— | PostgreSQL database name. |
OMOP_EMB_DB_DRIVER |
postgresql+psycopg |
SQLAlchemy driver string. Override to use e.g. psycopg2. |
OMOP_EMB_DB_URL |
— | Full SQLAlchemy connection URL. Overrides all individual components above when set. |
Embedding API (CLI concept ingestion)
| Variable | Description |
|---|---|
OMOP_CDM_DB_URL |
SQLAlchemy URL for the OMOP CDM database. Required only for concept ingestion commands. |
OMOP_EMB_DOCUMENT_EMBEDDING_PREFIX |
Task prefix prepended to concept texts at index time. |
OMOP_EMB_QUERY_EMBEDDING_PREFIX |
Task prefix prepended to search queries at query time. |
The prefix variables are optional and default to "". They are only needed for
asymmetric embedding models (e.g. nomic-embed-text, E5, BGE) that require
different task prefixes for indexing versus querying.