Ingest Module (Draft)

This page introduces the scaffolding for the forthcoming nemora.ingest module. It covers the core abstractions (DatasetSource, TransformPipeline) that new connectors will extend to transform raw forest inventory releases (BC FAIB, FIA, etc.) into the tidy stand tables consumed by nemora.fit, nemora.sampling, and other modules.

DatasetSource

DatasetSource captures enough metadata for the toolkit to locate/download raw files. Provide a fetcher callable when remote retrieval is required:

from pathlib import Path

from nemora.ingest import DatasetSource


def fetch_bc_faib(source: DatasetSource) -> list[Path]:
    output_dir = Path("data/external") / source.name
    output_dir.mkdir(parents=True, exist_ok=True)
    # TODO: integrate with the FAIB portal API (https://bcgov-env.shinyapps.io/FAIB_GROUND_SAMPLE/)
    # to download PSP/CMI/NFI/YSM extracts. For now, drop a placeholder.
    (output_dir / "README.txt").write_text("FAIB data placeholder\n", encoding="utf-8")
    return [output_dir]


BC_FAIB_SOURCE = DatasetSource(
    name="bc-faib",
    description="BC FAIB ground sample plots (PSP, CMI, NFI, YSM)",
    uri="https://bcgov-env.shinyapps.io/FAIB_GROUND_SAMPLE/",
    metadata={
        "notes": (
            "Public FAIB portal; subsample by BAF/prism size as needed. "
            "Bulk downloads also available via FTP under "
            "ftp://ftp.for.gov.bc.ca/HTS/external/!publish/ground_plot_compilations/psp/"
            " and the companion web interface at "
            "https://bcgov-env.shinyapps.io/FAIB_GROUND_SAMPLE/."
        )
    },
    fetcher=fetch_bc_faib,
)

When BC_FAIB_SOURCE.fetch() is invoked it delegates to fetch_bc_faib. Future connectors will implement authenticated fetchers and cache management.

Nemora now provides first-class helpers for building these sources:

from nemora.ingest.faib import build_faib_dataset_source
from nemora.ingest.fia import build_fia_dataset_source

faib_source = build_faib_dataset_source(
    "psp",
    destination="data/external/faib/raw",
    overwrite=False,
)
fia_source = build_fia_dataset_source(
    "HI",
    destination="data/external/fia/raw",
    tables=("TREE", "PLOT", "COND"),
)

# Trigger the downloads when required
faib_files = list(faib_source.fetch())
fia_files = list(fia_source.fetch())

Both helpers capture cache metadata (destination, filenames) in the resulting DatasetSource.metadata, making it easier to surface provenance in logs or CLI output.

TransformPipeline

TransformPipeline holds an ordered list of callables that accept/return pandas.DataFrame objects:

import pandas as pd

from nemora.ingest import TransformPipeline


def convert_units(frame: pd.DataFrame) -> pd.DataFrame:
    return frame.assign(dbh_cm=frame["dbh_mm"] / 10.0)


def compute_stand_table(frame: pd.DataFrame) -> pd.DataFrame:
    return frame.assign(stand_table=frame["tally"] * frame["expansion_factor"])


pipeline = TransformPipeline(
    name="bc-faib-hps",
    metadata={"description": "Convert FAIB tallies to Nemora stand table format"},
)
pipeline.add_step(convert_units)
pipeline.add_step(compute_stand_table)

Ingest workflows can compose these pipelines with reusable helpers. For example, the FAIB stand-table implementation now exposes a dedicated pipeline builder:

import pandas as pd

from nemora.ingest.faib import build_faib_stand_table_pipeline

tree_detail = pd.read_csv("data/external/faib/raw/faib_tree_detail.csv")
plot_header = pd.read_csv("data/external/faib/raw/faib_plot_header.csv")

pipeline = build_faib_stand_table_pipeline(
    plot_header,
    baf=12.0,
    dbh_col="DBH_CM",
    expansion_col="TREE_EXP",
    baf_col="BLOWUP_MAIN",
)
stand_table = pipeline.run(tree_detail)

This mirrors the logic used by both the CLI and generate_faib_manifest, so tests and notebooks can share the same transformation sequence.

HPS tallies

PSP-derived HPS tallies can now be generated without the standalone helper script. The ingest module exposes a convenience wrapper that streams the tree detail CSV, filters plot visits, and returns both tallies and manifest data:

from pathlib import Path

from nemora.ingest.hps import (
    SelectionCriteria,
    export_hps_outputs,
    load_plot_selections,
    run_hps_pipeline,
)

root = Path("data/external/faib")
plot_header = root / "faib_plot_header.csv"
sample_byvisit = root / "faib_sample_byvisit.csv"
tree_detail = root / "faib_tree_detail.csv"

criteria = SelectionCriteria(first_visit_only=True, max_plots=5)
selections = load_plot_selections(plot_header, sample_byvisit, baf=12.0, criteria=criteria)
result = run_hps_pipeline(tree_detail, selections, live_status=("L",), bin_width=1.0)
export_hps_outputs(
    result.tallies,
    result.manifest,
    output_dir=Path("data/examples/hps_baf12"),
    manifest_path=Path("data/examples/hps_baf12/manifest.csv"),
)

run_hps_pipeline returns a HPSPipelineResult containing the per-plot tallies (grouped DataFrames), a combined manifest, and a flattened tallies DataFrame. export_hps_outputs mirrors the historical script behaviour when writing files.

Data dictionaries

FAIB publishes companion Excel data dictionaries alongside each compilation. For example, the PSP release exposes PSP_data_dictionary_20250514.xlsx under the FTP path above. Include these files in ingest documentation so analysts can interpret column names (faib_plot_header.csv, faib_tree_detail.csv, etc.). The non-PSP directory mirrors the structure (see ftp://ftp.for.gov.bc.ca/HTS/external/!publish/ground_plot_compilations/non_psp/ and non_PSP_data_dictionary_20250514.xlsx). These spreadsheets map column codes to descriptions; keep a local copy alongside any downloads so analysts can interpret FAIB variable names when building pipelines.

.. note::

The FAIB team confirmed the portal data is fully public and can be redistributed. For bulk processing the FTP endpoints above are faster and expose the complete PSP, CMI, NFI, and YSM compilations (hundreds of megabytes per table). Nemora stores fetched CSVs under data/external/faib/, which is already .gitignore-d; treat that directory as a local cache and avoid committing the raw extracts.

During rapid iteration you can limit downloads to specific files by passing filenames=["faib_plot_header.csv"] to :func:nemora.ingest.faib.download_faib_csvs so that small metadata tables can be fetched without transferring the multi-hundred-megabyte tree detail extracts.

Running `pipeline.run(raw_frame)` applies the configured steps sequentially—
ideal for cleaning CSV extracts, building stand tables, and harmonising column
names. Pipelines will be orchestrated by future CLI commands.

See `nemora.ingest.faib` for utilities (`load_psp_dictionary`,
`load_non_psp_dictionary`, `aggregate_stand_table`) that download schemas and
collapse tree detail tables into Nemora-ready stand-table summaries.

.. todo:: Flesh out end-to-end ingestion workflows (including CLI usage and
          caching guidelines) once dataset connectors are implemented.

## CLI helper

Nemora exposes an early CLI stub for PSP stand tables:

```bash
nemora ingest-faib tests/fixtures/faib --baf 12 --output stand_table.csv

# Fetch PSP extracts and write output
nemora ingest-faib data/external/faib --baf 12 --fetch --dataset psp --output stand_table.csv
# Force a fresh download (overwrite cached files) before building the stand table
nemora ingest-faib data/external/faib --baf 12 --fetch --overwrite --output stand_table.csv
# Preview suggested BAF values and exit without generating a table
nemora ingest-faib data/external/faib --auto-bafs --fetch --dataset psp

# `faib-manifest` writes both CSV and Parquet by default; pass --no-parquet to emit CSV only.
# Fetch extracts, auto-select BAFs, and generate manifests + stand tables (CSV+Parquet)
nemora faib-manifest data/external/faib/manifest_psp --auto-bafs --auto-count 3
# Reuse an existing download, skip fetch, limit rows, and emit CSV + Parquet manifests
nemora faib-manifest examples/faib_manifest --source tests/fixtures/faib --no-fetch --baf 12 --max-rows 200
# CSV-only regeneration example (details in docs/examples/faib_manifest_parquet.md)
nemora faib-manifest data/external/faib/manifest_psp --overwrite --no-parquet

# Prepare HPS tallies and manifest (no download, reusing cached CSVs)
nemora ingest-faib-hps data/external/faib --no-fetch --output data/examples/hps_baf12
# Download PSP extracts to a cache directory and write outputs to the examples folder
nemora ingest-faib-hps data/external/faib --cache-dir data/external/psp/raw --output data/examples/hps_baf12 --fetch
# Benchmark the HPS pipeline (timing only, no output)
nemora ingest-benchmark data/external/faib --no-fetch --iterations 5
# Benchmark and capture metrics (JSONL) for later trend analysis
nemora ingest-benchmark data/external/faib --no-fetch --iterations 3 --report-path logs/ingest_benchmark.jsonl

# Generate trimmed fixtures + manifest (used in tests)
python scripts/generate_faib_manifest.py examples/faib_manifest --dataset psp
# Auto-select representative BAF values before generating the manifest
python scripts/generate_faib_manifest.py examples/faib_manifest --dataset psp --auto
# Limit stand tables to the first 200 rows when exporting the manifest samples
python scripts/generate_faib_manifest.py examples/faib_manifest --dataset psp --auto --max-rows 200

# Aggregate an FIA stand table (prototype) using local CSV extracts
python - <<'PY'
from nemora.ingest.fia import build_stand_table_from_csvs

table = build_stand_table_from_csvs(
    "data/external/fia/raw",
    plot_cn=47825253010497,
)
print(table.head())
PY

# Aggregate FIA stand tables via CLI (trimmed fixtures example)
nemora ingest-fia tests/fixtures/fia --tree-file tree_small.csv --cond-file cond_small.csv \
  --plot-file plot_small.csv --plot-cn 47825261010497 --plot-cn 47825253010497 --output fia_sample.csv

The command expects pre-downloaded FAIB CSV extracts; future versions will bundle fetch/caching logic.

Caching guidelines

Use directories under data/external/ for raw downloads (faib/raw, fia/raw, etc.). They are already ignored by Git.
Prefer invoking build_faib_dataset_source(...).fetch() or build_fia_dataset_source(...).fetch() from notebooks/scripts instead of reimplementing download logic. The helpers enforce overwrite-safe .part files and capture provenance in DatasetSource.metadata.
CLI commands pass through these helpers when --fetch or --fetch-state is supplied; cached files are reused unless --overwrite is specified.
Document licences and terms of use alongside cached datasets (see tests/fixtures/faib/README.md for an example template).

Repository sample

The repository contains a trimmed PSP example generated with scripts/generate_faib_manifest.py under examples/faib_manifest/. The manifest (faib_manifest.csv) lists each stand-table CSV (e.g., stand_table_baf12.csv) alongside the BAF, row count, and a truncated flag so tests and documentation can reference a lightweight sample of the full FAIB release. Re-run the script with --max-rows to regenerate the samples from a larger local cache without bloating the repository.

The CLI and script both call :func:nemora.ingest.faib.generate_faib_manifest, which orchestrates downloads, BAF selection, stand-table aggregation, and manifest creation. The helper returns the manifest path, generated table paths, and any files downloaded so automated workflows can inspect the output.

FIA prototype

Nemora includes early helpers for USDA FIA CSV extracts (:mod:nemora.ingest.fia). The :func:nemora.ingest.fia.build_stand_table_from_csvs function joins TREE/COND/PLOT tables, filters live trees/conditions, converts DBH to centimetres, and aggregates stand tables weighted by TPA_UNADJ and condition proportions. These utilities are the first step toward a full FIA ingest pipeline; use them to validate schema joins on downloaded samples while additional ETL automation is being planned.

The CLI supports automatic downloads via --fetch-state; Nemora maps state codes to the public FIA Datamart URLs (for example nemora ingest-fia data/fia --fetch-state hi will retrieve HI_TREE.csv, HI_PLOT.csv, and HI_COND.csv before aggregating). Downloads are optional—pass custom --tree-file/--cond-file/--plot-file arguments when working with pre-existing extracts or trimmed fixtures.

Licensing note: FIA data are public domain but attribution is appreciated; refer to the USDA legal notice at https://www.fia.fs.usda.gov/contact/legal.php. When redistributing trimmed fixtures (e.g., under tests/fixtures/fia) include the citation and acquisition date so downstream users understand the provenance.