Skip to content

orm-loader

Loader Implementations

Loader Implementations¶

This page documents the concrete loader implementations provided by orm_loader.

All loaders implement the same interface and differ only in how data is read and processed.

LoaderInterface¶

LoaderInterface defines the contract for all loaders.

Required methods¶

orm_file_load(ctx)
dedupe(data, ctx)

Shared behaviour¶

All loaders:

load into staging tables only
respect LoaderContext flags
return row counts
avoid implicit commits

PandasLoader¶

PandasLoader uses pandas to read and process files.

Characteristics¶

Supports CSV and TSV inputs
Easy to debug and inspect
Supports chunked loading
Flexible transformation pipeline

Trade-offs¶

Slower for very large datasets
Higher memory overhead than columnar approaches

Best suited for¶

initial data exploration
messy or inconsistent files
pipelines requiring heavy cleaning or inspection

ParquetLoader¶

ParquetLoader uses PyArrow for columnar ingestion.

Characteristics¶

Efficient for very large datasets
Supports Parquet and CSV inputs
Batch-oriented processing
Lower memory overhead

Trade-offs¶

More complex pipeline
Less flexible row-wise transformations
DB-level deduplication not yet implemented

Best suited for¶

high-volume ingestion
repeated production loads
columnar data sources

Deduplication behaviour¶

Deduplication occurs in two phases:

Internal deduplication
Removes duplicate primary key rows within the incoming data.
Database-level deduplication (optional)
Removes rows that already exist in the database.

Database-level deduplication is currently implemented for pandas-based loads.

Normalisation behaviour¶

When enabled, loaders:

cast values to ORM column types
drop rows violating required constraints
log casting failures with examples

No schema changes are performed.