Loader Implementations¶
This page documents the concrete loader implementations provided by
orm_loader.
All loaders implement the same interface and differ only in how data is read and processed.
LoaderInterface¶
LoaderInterface defines the contract for all loaders.
Required methods¶
orm_file_load(ctx)dedupe(data, ctx)
Shared behaviour¶
All loaders:
- load into staging tables only
- respect
LoaderContextflags - return row counts
- avoid implicit commits
PandasLoader¶
PandasLoader uses pandas to read and process files.
Characteristics¶
- Supports CSV and TSV inputs
- Easy to debug and inspect
- Supports chunked loading
- Flexible transformation pipeline
Trade-offs¶
- Slower for very large datasets
- Higher memory overhead than columnar approaches
Best suited for¶
- initial data exploration
- messy or inconsistent files
- pipelines requiring heavy cleaning or inspection
ParquetLoader¶
ParquetLoader uses PyArrow for columnar ingestion.
Characteristics¶
- Efficient for very large datasets
- Supports Parquet and CSV inputs
- Batch-oriented processing
- Lower memory overhead
Trade-offs¶
- More complex pipeline
- Less flexible row-wise transformations
- DB-level deduplication not yet implemented
Best suited for¶
- high-volume ingestion
- repeated production loads
- columnar data sources
Deduplication behaviour¶
Deduplication occurs in two phases:
-
Internal deduplication
Removes duplicate primary key rows within the incoming data. -
Database-level deduplication (optional)
Removes rows that already exist in the database.
Database-level deduplication is currently implemented for pandas-based loads.
Normalisation behaviour¶
When enabled, loaders:
- cast values to ORM column types
- drop rows violating required constraints
- log casting failures with examples
No schema changes are performed.