Skip to content

Loader Helper Utilities

This page documents low-level helper functions used by loaders.

These utilities are stateless and intentionally conservative.


Encoding and delimiter detection

infer_encoding(path)

Detects file encoding using chardet.

  • ASCII is normalised to UTF-8
  • Only a small prefix of the file is inspected

infer_delim(path)

Detects CSV delimiter by inspecting the first line.

  • Prefers tabs over commas if more frequent

Duplicate detection (PyArrow)

arrow_drop_duplicates(table, pk_names)

Removes duplicate rows from a PyArrow table based on primary key columns.

  • Sorts by PKs
  • Keeps the first occurrence
  • Handles chunked arrays safely

Used by ParquetLoader for internal deduplication.


Conservative CSV parsing

conservative_load_parquet(...)

Reads CSV files using PyArrow with:

  • strict column inclusion
  • malformed row skipping
  • chunked batch iteration

This is used when loading CSVs via the Parquet pipeline.


PostgreSQL fast-path loading

quick_load_pg(...)

Loads CSV files into PostgreSQL staging tables using COPY.

Characteristics

  • Extremely fast
  • Bypasses ORM
  • Sensitive to data quality issues

Failure handling

  • Errors trigger rollback
  • Loader falls back to ORM-based loading
  • No partial silent loads

This helper is only used when explicitly supported by the database.