Skip to content

OMOP Embeddings

omop-emb generates and retrieves vector embeddings for OMOP CDM concepts. It works standalone out of the box (sqlite-vec, no external database required) and optionally scales to PostgreSQL via the pgvector extension.

The package supports:

  • dynamic embedding model registration — multiple models per backend, tracked in the embedding database
  • embedding and lookup for OMOP concepts across configurable storage backends
  • Two storage backends:
    • sqlite-vec (default): zero-config, file-based or in-memory — no external service required
    • pgvector: PostgreSQL with the pgvector extension (FLAT sequential scan or HNSW SQL index)
  • FAISS sidecar on top of sqlite-vec backend for approximate nearest-neighbour search
  • CLI scripts to ingest OMOP CDM concepts and manage registered models

Installation

Install the backend you want to use:

pip install omop-emb                       # sqlite-vec only (default backend)
pip install "omop-emb[pgvector]"           # adds PostgreSQL/pgvector support
pip install "omop-emb[faiss-cpu]"          # adds FAISS sidecar support
pip install "omop-emb[pgvector,faiss-cpu]" # everything

Environment Variables

Backend selector

Variable Default Description
OMOP_EMB_BACKEND sqlitevec Backend to use: sqlitevec or pgvector.

sqlite-vec connection

Variable Description
OMOP_EMB_SQLITE_PATH Path to the sqlite-vec database file. Use :memory: for an in-memory database.

pgvector connection (individual components)

Variable Default Description
OMOP_EMB_DB_HOST PostgreSQL host.
OMOP_EMB_DB_PORT 5432 PostgreSQL port.
OMOP_EMB_DB_USER PostgreSQL user.
OMOP_EMB_DB_PASSWORD PostgreSQL password.
OMOP_EMB_DB_NAME PostgreSQL database name.
OMOP_EMB_DB_DRIVER postgresql+psycopg SQLAlchemy driver string. Override to use e.g. psycopg2.
OMOP_EMB_DB_URL Full SQLAlchemy connection URL. Overrides all individual components above when set.

Embedding API (CLI concept ingestion)

Variable Description
OMOP_CDM_DB_URL SQLAlchemy URL for the OMOP CDM database. Required only for concept ingestion commands.
OMOP_EMB_DOCUMENT_EMBEDDING_PREFIX Task prefix prepended to concept texts at index time.
OMOP_EMB_QUERY_EMBEDDING_PREFIX Task prefix prepended to search queries at query time.

The prefix variables are optional and default to "". They are only needed for asymmetric embedding models (e.g. nomic-embed-text, E5, BGE) that require different task prefixes for indexing versus querying.

Documentation overview