Skip to content

Loader Implementations

This page documents the concrete loader implementations provided by orm_loader.

All loaders implement the same interface. The difference is in how they read data and how much work they do before rows reach the staging table.


LoaderInterface

LoaderInterface defines the contract for all loaders.

Required methods

  • orm_file_load(ctx)
  • dedupe(data, ctx)

Shared behaviour

All loaders:

  • load into staging tables only
  • respect LoaderContext flags
  • return row counts
  • leave final merge behaviour to the table layer

PandasLoader

PandasLoader uses pandas to read and process files.

Characteristics

  • Works well with CSV and TSV inputs
  • Easy to debug and inspect
  • Supports chunked loading
  • Flexible transformation pipeline

Trade-offs

  • Slower for very large datasets
  • Higher memory overhead than columnar approaches

Best suited for

  • initial data exploration
  • messy or inconsistent files
  • pipelines requiring heavy cleaning or inspection

ParquetLoader

ParquetLoader uses PyArrow for columnar ingestion.

Characteristics

  • Efficient for very large datasets
  • Supports Parquet and CSV inputs
  • Batch-oriented processing
  • Lower memory overhead

Trade-offs

  • More complex pipeline
  • Less flexible row-wise transformations

Best suited for

  • high-volume ingestion
  • repeated production loads
  • columnar data sources

Deduplication behaviour

Deduplication here means deduplicating within the incoming data before it is inserted into staging. The merge step is what decides what happens when incoming rows overlap with existing target rows.


Normalisation behaviour

When enabled, loaders:

  • cast values to ORM column types
  • drop rows violating required constraints
  • log casting failures with examples

No schema changes are performed at the loader layer.