Ingest Module (Draft)
This page introduces the scaffolding for the forthcoming nemora.ingest module.
It covers the core abstractions (DatasetSource, TransformPipeline) that new
connectors will extend to transform raw forest inventory releases (BC FAIB, FIA,
etc.) into the tidy stand tables consumed by nemora.fit, nemora.sampling,
and other modules.
DatasetSource
DatasetSource captures enough metadata for the toolkit to locate/download raw
files. Provide a fetcher callable when remote retrieval is required:
from pathlib import Path
from nemora.ingest import DatasetSource
def fetch_bc_faib(source: DatasetSource) -> list[Path]:
output_dir = Path("data/external") / source.name
output_dir.mkdir(parents=True, exist_ok=True)
# TODO: integrate with the FAIB portal API (https://bcgov-env.shinyapps.io/FAIB_GROUND_SAMPLE/)
# to download PSP/CMI/NFI/YSM extracts. For now, drop a placeholder.
(output_dir / "README.txt").write_text("FAIB data placeholder\n", encoding="utf-8")
return [output_dir]
BC_FAIB_SOURCE = DatasetSource(
name="bc-faib",
description="BC FAIB ground sample plots (PSP, CMI, NFI, YSM)",
uri="https://bcgov-env.shinyapps.io/FAIB_GROUND_SAMPLE/",
metadata={
"notes": (
"Public FAIB portal; subsample by BAF/prism size as needed. "
"Bulk downloads also available via FTP under "
"ftp://ftp.for.gov.bc.ca/HTS/external/!publish/ground_plot_compilations/psp/"
" and the companion web interface at "
"https://bcgov-env.shinyapps.io/FAIB_GROUND_SAMPLE/."
)
},
fetcher=fetch_bc_faib,
)
When BC_FAIB_SOURCE.fetch() is invoked it delegates to fetch_bc_faib. Future
connectors will implement authenticated fetchers and cache management.
Nemora now provides first-class helpers for building these sources:
from nemora.ingest.faib import build_faib_dataset_source
from nemora.ingest.fia import build_fia_dataset_source
faib_source = build_faib_dataset_source(
"psp",
destination="data/external/faib/raw",
overwrite=False,
)
fia_source = build_fia_dataset_source(
"HI",
destination="data/external/fia/raw",
tables=("TREE", "PLOT", "COND"),
)
# Trigger the downloads when required
faib_files = list(faib_source.fetch())
fia_files = list(fia_source.fetch())
Both helpers capture cache metadata (destination, filenames) in the resulting
DatasetSource.metadata, making it easier to surface provenance in logs or CLI
output.
TransformPipeline
TransformPipeline holds an ordered list of callables that accept/return
pandas.DataFrame objects:
import pandas as pd
from nemora.ingest import TransformPipeline
def convert_units(frame: pd.DataFrame) -> pd.DataFrame:
return frame.assign(dbh_cm=frame["dbh_mm"] / 10.0)
def compute_stand_table(frame: pd.DataFrame) -> pd.DataFrame:
return frame.assign(stand_table=frame["tally"] * frame["expansion_factor"])
pipeline = TransformPipeline(
name="bc-faib-hps",
metadata={"description": "Convert FAIB tallies to Nemora stand table format"},
)
pipeline.add_step(convert_units)
pipeline.add_step(compute_stand_table)
Ingest workflows can compose these pipelines with reusable helpers. For example, the FAIB stand-table implementation now exposes a dedicated pipeline builder:
import pandas as pd
from nemora.ingest.faib import build_faib_stand_table_pipeline
tree_detail = pd.read_csv("data/external/faib/raw/faib_tree_detail.csv")
plot_header = pd.read_csv("data/external/faib/raw/faib_plot_header.csv")
pipeline = build_faib_stand_table_pipeline(
plot_header,
baf=12.0,
dbh_col="DBH_CM",
expansion_col="TREE_EXP",
baf_col="BLOWUP_MAIN",
)
stand_table = pipeline.run(tree_detail)
This mirrors the logic used by both the CLI and generate_faib_manifest, so
tests and notebooks can share the same transformation sequence.
HPS tallies
PSP-derived HPS tallies can now be generated without the standalone helper script. The ingest module exposes a convenience wrapper that streams the tree detail CSV, filters plot visits, and returns both tallies and manifest data:
from pathlib import Path
from nemora.ingest.hps import (
SelectionCriteria,
export_hps_outputs,
load_plot_selections,
run_hps_pipeline,
)
root = Path("data/external/faib")
plot_header = root / "faib_plot_header.csv"
sample_byvisit = root / "faib_sample_byvisit.csv"
tree_detail = root / "faib_tree_detail.csv"
criteria = SelectionCriteria(first_visit_only=True, max_plots=5)
selections = load_plot_selections(plot_header, sample_byvisit, baf=12.0, criteria=criteria)
result = run_hps_pipeline(tree_detail, selections, live_status=("L",), bin_width=1.0)
export_hps_outputs(
result.tallies,
result.manifest,
output_dir=Path("data/examples/hps_baf12"),
manifest_path=Path("data/examples/hps_baf12/manifest.csv"),
)
run_hps_pipeline returns a HPSPipelineResult containing the per-plot tallies
(grouped DataFrames), a combined manifest, and a flattened tallies DataFrame.
export_hps_outputs mirrors the historical script behaviour when writing files.
Data dictionaries
FAIB publishes companion Excel data dictionaries alongside each compilation.
For example, the PSP release exposes PSP_data_dictionary_20250514.xlsx under
the FTP path above. Include these files in ingest documentation so analysts can
interpret column names (faib_plot_header.csv, faib_tree_detail.csv, etc.).
The non-PSP directory mirrors the structure (see
ftp://ftp.for.gov.bc.ca/HTS/external/!publish/ground_plot_compilations/non_psp/
and non_PSP_data_dictionary_20250514.xlsx). These spreadsheets map column
codes to descriptions; keep a local copy alongside any downloads so analysts can
interpret FAIB variable names when building pipelines.
.. note::
The FAIB team confirmed the portal data is fully public and can be
redistributed. For bulk processing the FTP endpoints above are faster and
expose the complete PSP, CMI, NFI, and YSM compilations (hundreds of megabytes
per table). Nemora stores fetched CSVs under data/external/faib/, which is
already .gitignore-d; treat that directory as a local cache and avoid
committing the raw extracts.
During rapid iteration you can limit downloads to specific files by passing
filenames=["faib_plot_header.csv"] to :func:nemora.ingest.faib.download_faib_csvs
so that small metadata tables can be fetched without transferring the
multi-hundred-megabyte tree detail extracts.
Running `pipeline.run(raw_frame)` applies the configured steps sequentially—
ideal for cleaning CSV extracts, building stand tables, and harmonising column
names. Pipelines will be orchestrated by future CLI commands.
See `nemora.ingest.faib` for utilities (`load_psp_dictionary`,
`load_non_psp_dictionary`, `aggregate_stand_table`) that download schemas and
collapse tree detail tables into Nemora-ready stand-table summaries.
.. todo:: Flesh out end-to-end ingestion workflows (including CLI usage and
caching guidelines) once dataset connectors are implemented.
## CLI helper
Nemora exposes an early CLI stub for PSP stand tables:
```bash
nemora ingest-faib tests/fixtures/faib --baf 12 --output stand_table.csv
# Fetch PSP extracts and write output
nemora ingest-faib data/external/faib --baf 12 --fetch --dataset psp --output stand_table.csv
# Force a fresh download (overwrite cached files) before building the stand table
nemora ingest-faib data/external/faib --baf 12 --fetch --overwrite --output stand_table.csv
# Preview suggested BAF values and exit without generating a table
nemora ingest-faib data/external/faib --auto-bafs --fetch --dataset psp
# `faib-manifest` writes both CSV and Parquet by default; pass --no-parquet to emit CSV only.
# Fetch extracts, auto-select BAFs, and generate manifests + stand tables (CSV+Parquet)
nemora faib-manifest data/external/faib/manifest_psp --auto-bafs --auto-count 3
# Reuse an existing download, skip fetch, limit rows, and emit CSV + Parquet manifests
nemora faib-manifest examples/faib_manifest --source tests/fixtures/faib --no-fetch --baf 12 --max-rows 200
# CSV-only regeneration example (details in docs/examples/faib_manifest_parquet.md)
nemora faib-manifest data/external/faib/manifest_psp --overwrite --no-parquet
# Prepare HPS tallies and manifest (no download, reusing cached CSVs)
nemora ingest-faib-hps data/external/faib --no-fetch --output data/examples/hps_baf12
# Download PSP extracts to a cache directory and write outputs to the examples folder
nemora ingest-faib-hps data/external/faib --cache-dir data/external/psp/raw --output data/examples/hps_baf12 --fetch
# Benchmark the HPS pipeline (timing only, no output)
nemora ingest-benchmark data/external/faib --no-fetch --iterations 5
# Benchmark and capture metrics (JSONL) for later trend analysis
nemora ingest-benchmark data/external/faib --no-fetch --iterations 3 --report-path logs/ingest_benchmark.jsonl
# Generate trimmed fixtures + manifest (used in tests)
python scripts/generate_faib_manifest.py examples/faib_manifest --dataset psp
# Auto-select representative BAF values before generating the manifest
python scripts/generate_faib_manifest.py examples/faib_manifest --dataset psp --auto
# Limit stand tables to the first 200 rows when exporting the manifest samples
python scripts/generate_faib_manifest.py examples/faib_manifest --dataset psp --auto --max-rows 200
# Aggregate an FIA stand table (prototype) using local CSV extracts
python - <<'PY'
from nemora.ingest.fia import build_stand_table_from_csvs
table = build_stand_table_from_csvs(
"data/external/fia/raw",
plot_cn=47825253010497,
)
print(table.head())
PY
# Aggregate FIA stand tables via CLI (trimmed fixtures example)
nemora ingest-fia tests/fixtures/fia --tree-file tree_small.csv --cond-file cond_small.csv \
--plot-file plot_small.csv --plot-cn 47825261010497 --plot-cn 47825253010497 --output fia_sample.csv
The command expects pre-downloaded FAIB CSV extracts; future versions will bundle fetch/caching logic.
Caching guidelines
Use directories under
data/external/for raw downloads (faib/raw,fia/raw, etc.). They are already ignored by Git.Prefer invoking
build_faib_dataset_source(...).fetch()orbuild_fia_dataset_source(...).fetch()from notebooks/scripts instead of reimplementing download logic. The helpers enforce overwrite-safe.partfiles and capture provenance inDatasetSource.metadata.CLI commands pass through these helpers when
--fetchor--fetch-stateis supplied; cached files are reused unless--overwriteis specified.Document licences and terms of use alongside cached datasets (see
tests/fixtures/faib/README.mdfor an example template).
Repository sample
The repository contains a trimmed PSP example generated with
scripts/generate_faib_manifest.py under examples/faib_manifest/.
The manifest (faib_manifest.csv) lists each stand-table CSV (e.g.,
stand_table_baf12.csv) alongside the BAF, row count, and a truncated flag so
tests and documentation can reference a lightweight sample of the full FAIB
release. Re-run the script with --max-rows to regenerate the samples from a
larger local cache without bloating the repository.
The CLI and script both call :func:nemora.ingest.faib.generate_faib_manifest,
which orchestrates downloads, BAF selection, stand-table aggregation, and
manifest creation. The helper returns the manifest path, generated table paths,
and any files downloaded so automated workflows can inspect the output.
FIA prototype
Nemora includes early helpers for USDA FIA CSV extracts
(:mod:nemora.ingest.fia). The :func:nemora.ingest.fia.build_stand_table_from_csvs
function joins TREE/COND/PLOT tables, filters live
trees/conditions, converts DBH to centimetres, and aggregates stand tables
weighted by TPA_UNADJ and condition proportions. These utilities are the
first step toward a full FIA ingest pipeline; use them to validate schema joins
on downloaded samples while additional ETL automation is being planned.
The CLI supports automatic downloads via --fetch-state; Nemora maps state
codes to the public FIA Datamart URLs (for example nemora ingest-fia data/fia --fetch-state hi will retrieve HI_TREE.csv, HI_PLOT.csv, and
HI_COND.csv before aggregating). Downloads are optional—pass custom
--tree-file/--cond-file/--plot-file arguments when working with
pre-existing extracts or trimmed fixtures.
Licensing note: FIA data are public domain but attribution is appreciated;
refer to the USDA legal notice at https://www.fia.fs.usda.gov/contact/legal.php.
When redistributing trimmed fixtures (e.g., under tests/fixtures/fia) include
the citation and acquisition date so downstream users understand the provenance.