Skip to content

CLI Reference

omop-emb provides a CLI for concept ingestion, similarity search, index management, and diagnostics. All commands load a .env file from the working directory automatically.

Commands are organised into four subcommand groups:

Group Purpose
embeddings Ingestion, search, index creation
maintenance Model management, FAISS export/import
diagnostics Health checks
legacy Import pre-built embeddings from HDF5 files

Run omop-emb <group> --help to list commands within a group.

Prerequisites

  • Backend installed: pip install omop-emb (sqlite-vec) or pip install "omop-emb[pgvector]".
  • Backend configured: set OMOP_EMB_BACKEND and the matching connection variables (see Installation).
  • Embedding API: an OpenAI-compatible embeddings endpoint. Required for ingestion and search commands.
  • OMOP CDM (OMOP_CDM_DB_URL): required only for concept ingestion (add-embeddings, add-embeddings-with-index). Not required for search, list-models, rebuild-index, delete-model, or diagnostics.

Environment Variables

Backend

Variable Default Description
OMOP_EMB_BACKEND sqlitevec Backend selector: sqlitevec, pgvector.

sqlite-vec

Variable Description
OMOP_EMB_SQLITE_PATH Path to the sqlite-vec database file (or :memory:).

pgvector (individual components)

Variable Default Description
OMOP_EMB_DB_HOST PostgreSQL host.
OMOP_EMB_DB_PORT 5432 PostgreSQL port.
OMOP_EMB_DB_USER PostgreSQL user.
OMOP_EMB_DB_PASSWORD PostgreSQL password.
OMOP_EMB_DB_NAME PostgreSQL database name.
OMOP_EMB_DB_DRIVER postgresql+psycopg SQLAlchemy driver string.
OMOP_EMB_DB_URL Full connection URL. Overrides individual components.

Ingestion (CDM access)

Variable Description
OMOP_CDM_DB_URL SQLAlchemy URL for the OMOP CDM database.

FAISS sidecar

Variable Description
OMOP_EMB_FAISS_CACHE_DIR Default FAISS cache directory. Used by EmbeddingReaderInterface when faiss_cache_dir is not passed explicitly. Equivalent to the --faiss-cache-dir CLI option.

embeddings group

add-embeddings

Bulk-generate and store embeddings for OMOP concepts that do not yet have embeddings. Models are registered with a FLAT index; use maintenance rebuild-index afterwards to build an HNSW index.

omop-emb embeddings add-embeddings --api-base <URL> --api-key <KEY> [OPTIONS]

Embedding API Options

Option Short Default Description
--api-base required Base URL of the embedding API.
--api-key required API key for the embedding API.
--model -m text-embedding-3-small Embedding model name.
--batch-size -b 100 Concepts per API batch.

Concept Filters

Option Short Default Description
--standard-only False Embed only standard concepts (standard_concept = 'S').
--vocabulary None Restrict to specific OMOP vocabularies (repeatable).
--domain None Restrict to specific OMOP domains (repeatable).
--num-embeddings -n None Cap on total concepts processed (useful for testing).

CDM Fetch Options

Option Default Description
--cdm-batch-size 50000 Batch size for fetching concept metadata from the CDM.
Option Short Description
--verbose -v Increase log verbosity (pass twice for DEBUG).

add-embeddings-with-index

Ingest embeddings and immediately build an index in one step. Equivalent to running add-embeddings followed by create-index.

omop-emb embeddings add-embeddings-with-index --api-base <URL> --api-key <KEY> [OPTIONS]

Accepts all options from add-embeddings, plus:

Index Options

Option Default Description
--index-type flat Index to build after ingestion (flat or hnsw).
--metric-type cosine Distance metric. Required and locked in for hnsw.
--index-hnsw-num-neighbors None HNSW graph connectivity (M).
--index-hnsw-ef-search None HNSW query recall parameter.
--index-ef-construction None HNSW build quality parameter.

create-index

Build or rebuild the index for a model that already has embeddings stored. --api-base and --api-key are used only to resolve the canonical model name.

omop-emb embeddings create-index --api-base <URL> --api-key <KEY> --model <NAME> [OPTIONS]

Embedding API Options

Option Short Default Description
--api-base required Base URL of the embedding API.
--api-key required API key.
--model -m text-embedding-3-small Embedding model name.

Index Options

Option Default Description
--index-type flat flat or hnsw.
--metric-type cosine Distance metric. Required and locked in for hnsw.
--index-hnsw-num-neighbors None HNSW M parameter.
--index-hnsw-ef-search None HNSW query recall parameter.
--index-ef-construction None HNSW build quality parameter.
Option Short Description
--verbose -v Increase log verbosity.

Query stored embeddings for nearest OMOP concepts. Outputs tab-separated rows: query_id, query_text, rank, concept_id, similarity, concept_name.

If OMOP_CDM_DB_URL is set, results are enriched with concept names from the CDM. Without it, the concept_name column is left empty.

omop-emb embeddings search --api-base <URL> --api-key <KEY> --query "hypertension" [OPTIONS]

Embedding API Options

Option Short Default Description
--api-base required Base URL of the embedding API.
--api-key required API key.
--model -m text-embedding-3-small Embedding model name.
--batch-size -b 100 Batch size for embedding generation.

Search Options

Option Default Description
--query None Query text (repeatable). At least one of --query or --queries-file is required.
--queries-file None Path to a .txt file with one query per line.
--metric-type cosine Distance metric for search.
--k 10 Number of nearest concepts to return per query.
--faiss-cache-dir None Use a FAISS sidecar index instead of the primary backend. Requires omop-emb[faiss-cpu]. Also readable from OMOP_EMB_FAISS_CACHE_DIR.

Concept Filters

Option Default Description
--standard-only False Return only standard OMOP concepts.
--vocabulary None Filter results to specific vocabularies (repeatable).
--domain None Filter results to specific domains (repeatable).
Option Short Description
--verbose -v Increase log verbosity.

maintenance group

list-models

List all registered embedding models in the configured backend.

omop-emb maintenance list-models [OPTIONS]
Option Short Default Description
--model -m None Filter by model name.
--provider-type None Filter by provider.
--verbose -v Increase log verbosity.

rebuild-index

Build or rebuild the storage index for an already-registered model. Use this to switch between FLAT and HNSW without re-ingesting. The canonical model name is passed directly via --model; supply --provider-type to canonicalize a raw name if needed.

omop-emb maintenance rebuild-index --model <CANONICAL_NAME> [OPTIONS]
Option Short Default Description
--model -m required Canonical model name.
--provider-type None Provider used to canonicalize the model name when needed.

Index Options

Option Default Description
--index-type flat flat or hnsw.
--metric-type cosine Distance metric (required and locked in for hnsw).
--index-hnsw-num-neighbors None HNSW M parameter.
--index-hnsw-ef-search None HNSW query recall parameter.
--index-ef-construction None HNSW build quality parameter.
Option Short Description
--verbose -v Increase log verbosity.

delete-model

Permanently delete a registered model and all its stored embeddings. This operation is irreversible.

omop-emb maintenance delete-model --model <NAME> [OPTIONS]
Option Short Default Description
--model -m required Canonical model name.
--provider-type None Provider used to canonicalize the model name when needed.
--yes -y False Skip confirmation prompt.
--verbose -v Increase log verbosity.

export-faiss-cache

Export all embeddings from the primary backend into a FAISS index on disk. Requires pip install "omop-emb[faiss-cpu]".

omop-emb maintenance export-faiss-cache --model <NAME> --cache-dir <DIR> [OPTIONS]
Option Short Default Description
--model -m required Canonical model name.
--cache-dir required Root directory for FAISS index files.
--provider-type None Provider used to canonicalize the model name when needed.
--batch-size -b 100000 Embeddings fetched per backend round-trip.

Index Options

Option Default Description
--metric-type cosine Distance metric for the FAISS index (cosine or l2).
--index-type flat FAISS index type: flat (exact) or hnsw (approximate).
--hnsw-m 32 HNSW number of neighbours. Only used when --index-type=hnsw.
Option Short Description
--verbose -v Increase log verbosity.

check-faiss-cache

Check whether the FAISS index on disk is fresh relative to the primary backend. Exits with code 0 if fresh, 1 if stale or missing.

omop-emb maintenance check-faiss-cache --model <NAME> --cache-dir <DIR> [OPTIONS]
Option Short Default Description
--model -m required Canonical model name.
--cache-dir required Root cache directory.
--provider-type None Provider used to canonicalize the model name when needed.

Index Options

Option Default Description
--metric-type cosine Metric of the index to check.
--index-type flat Index type to check (flat or hnsw).
Option Short Description
--verbose -v Increase log verbosity.

import-faiss-cache

Import embeddings from an on-disk FAISS index back into the primary backend. Reconstructs raw vectors from the .faiss file (exact reconstruction requires flat or hnsw index types; IVF/PQ indices are lossy and unsupported).

omop-emb maintenance import-faiss-cache --model <NAME> --cache-dir <DIR> --provider-type <TYPE> [OPTIONS]
Option Short Default Description
--model -m required Canonical model name.
--cache-dir required Root cache directory containing the FAISS index files.
--provider-type required Embedding provider. Used to register the model if not already present.
--batch-size -b 10000 Vectors upserted per backend call.
--force False Overwrite existing embeddings without prompting.

Index Options

Option Default Description
--metric-type cosine Metric of the index to import from.
--index-type flat Index type to import from (flat or hnsw).
Option Short Description
--verbose -v Increase log verbosity.

diagnostics group

health-check

Verify backend connectivity and list registered models with embedding counts.

omop-emb diagnostics health-check [--verbose]

legacy group

add-embeddings-from-h5

Ingest pre-built embeddings from an HDF5 file into the configured backend. Use this to import embeddings generated outside of omop-emb.

The HDF5 file must contain two datasets:

  • concept_ids: 1-D integer array of OMOP concept IDs
  • embeddings: 2-D float array of shape (N, dimensions)

Concept metadata (domain, vocabulary, standard status) is fetched from the OMOP CDM per batch. Requires pip install h5py.

omop-emb legacy add-embeddings-from-h5 --h5-file <PATH> --model <NAME> --omop-cdm-db-url <URL> [OPTIONS]
Option Short Default Description
--h5-file required Path to the HDF5 file.
--model -m required Canonical model name to register the embeddings under.
--omop-cdm-db-url required SQLAlchemy URL for the OMOP CDM (used to populate concept metadata).
--provider-type ollama Embedding provider that produced these embeddings.
--metric-type cosine Distance metric to use when storing.
--batch-size -b 10000 Embeddings written per backend call.
--cdm-batch-size 50000 Batch size for fetching concept metadata from the CDM.
--verbose -v Increase log verbosity.