# Ingest Module (Draft)

This page introduces the scaffolding for the forthcoming `nemora.ingest` module.
It covers the core abstractions (`DatasetSource`, `TransformPipeline`) that new
connectors will extend to transform raw forest inventory releases (BC FAIB, FIA,
etc.) into the tidy stand tables consumed by `nemora.fit`, `nemora.sampling`,
and other modules.

## DatasetSource

`DatasetSource` captures enough metadata for the toolkit to locate/download raw
files. Provide a `fetcher` callable when remote retrieval is required:

```python
from pathlib import Path

from nemora.ingest import DatasetSource


def fetch_bc_faib(source: DatasetSource) -> list[Path]:
    output_dir = Path("data/external") / source.name
    output_dir.mkdir(parents=True, exist_ok=True)
    # TODO: integrate with the FAIB portal API (https://bcgov-env.shinyapps.io/FAIB_GROUND_SAMPLE/)
    # to download PSP/CMI/NFI/YSM extracts. For now, drop a placeholder.
    (output_dir / "README.txt").write_text("FAIB data placeholder\n", encoding="utf-8")
    return [output_dir]


BC_FAIB_SOURCE = DatasetSource(
    name="bc-faib",
    description="BC FAIB ground sample plots (PSP, CMI, NFI, YSM)",
    uri="https://bcgov-env.shinyapps.io/FAIB_GROUND_SAMPLE/",
    metadata={
        "notes": (
            "Public FAIB portal; subsample by BAF/prism size as needed. "
            "Bulk downloads also available via FTP under "
            "ftp://ftp.for.gov.bc.ca/HTS/external/!publish/ground_plot_compilations/psp/"
            " and the companion web interface at "
            "https://bcgov-env.shinyapps.io/FAIB_GROUND_SAMPLE/."
        )
    },
    fetcher=fetch_bc_faib,
)
```

When `BC_FAIB_SOURCE.fetch()` is invoked it delegates to `fetch_bc_faib`. Future
connectors will implement authenticated fetchers and cache management.

Nemora now provides first-class helpers for building these sources:

```python
from nemora.ingest.faib import build_faib_dataset_source
from nemora.ingest.fia import build_fia_dataset_source

faib_source = build_faib_dataset_source(
    "psp",
    destination="data/external/faib/raw",
    overwrite=False,
)
fia_source = build_fia_dataset_source(
    "HI",
    destination="data/external/fia/raw",
    tables=("TREE", "PLOT", "COND"),
)

# Trigger the downloads when required
faib_files = list(faib_source.fetch())
fia_files = list(fia_source.fetch())
```

Both helpers capture cache metadata (destination, filenames) in the resulting
`DatasetSource.metadata`, making it easier to surface provenance in logs or CLI
output.

## TransformPipeline

`TransformPipeline` holds an ordered list of callables that accept/return
`pandas.DataFrame` objects:

```python
import pandas as pd

from nemora.ingest import TransformPipeline


def convert_units(frame: pd.DataFrame) -> pd.DataFrame:
    return frame.assign(dbh_cm=frame["dbh_mm"] / 10.0)


def compute_stand_table(frame: pd.DataFrame) -> pd.DataFrame:
    return frame.assign(stand_table=frame["tally"] * frame["expansion_factor"])


pipeline = TransformPipeline(
    name="bc-faib-hps",
    metadata={"description": "Convert FAIB tallies to Nemora stand table format"},
)
pipeline.add_step(convert_units)
pipeline.add_step(compute_stand_table)
```

Ingest workflows can compose these pipelines with reusable helpers. For example,
the FAIB stand-table implementation now exposes a dedicated pipeline builder:

```python
import pandas as pd

from nemora.ingest.faib import build_faib_stand_table_pipeline

tree_detail = pd.read_csv("data/external/faib/raw/faib_tree_detail.csv")
plot_header = pd.read_csv("data/external/faib/raw/faib_plot_header.csv")

pipeline = build_faib_stand_table_pipeline(
    plot_header,
    baf=12.0,
    dbh_col="DBH_CM",
    expansion_col="TREE_EXP",
    baf_col="BLOWUP_MAIN",
)
stand_table = pipeline.run(tree_detail)
```

This mirrors the logic used by both the CLI and `generate_faib_manifest`, so
tests and notebooks can share the same transformation sequence.

### HPS tallies

PSP-derived HPS tallies can now be generated without the standalone helper
script. The ingest module exposes a convenience wrapper that streams the tree
detail CSV, filters plot visits, and returns both tallies and manifest data:

```python
from pathlib import Path

from nemora.ingest.hps import (
    SelectionCriteria,
    export_hps_outputs,
    load_plot_selections,
    run_hps_pipeline,
)

root = Path("data/external/faib")
plot_header = root / "faib_plot_header.csv"
sample_byvisit = root / "faib_sample_byvisit.csv"
tree_detail = root / "faib_tree_detail.csv"

criteria = SelectionCriteria(first_visit_only=True, max_plots=5)
selections = load_plot_selections(plot_header, sample_byvisit, baf=12.0, criteria=criteria)
result = run_hps_pipeline(tree_detail, selections, live_status=("L",), bin_width=1.0)
export_hps_outputs(
    result.tallies,
    result.manifest,
    output_dir=Path("data/examples/hps_baf12"),
    manifest_path=Path("data/examples/hps_baf12/manifest.csv"),
)
```

`run_hps_pipeline` returns a `HPSPipelineResult` containing the per-plot tallies
(grouped DataFrames), a combined manifest, and a flattened tallies DataFrame.
`export_hps_outputs` mirrors the historical script behaviour when writing files.

### Data dictionaries

FAIB publishes companion Excel data dictionaries alongside each compilation.
For example, the PSP release exposes `PSP_data_dictionary_20250514.xlsx` under
the FTP path above. Include these files in ingest documentation so analysts can
interpret column names (`faib_plot_header.csv`, `faib_tree_detail.csv`, etc.).
The non-PSP directory mirrors the structure (see
`ftp://ftp.for.gov.bc.ca/HTS/external/!publish/ground_plot_compilations/non_psp/`
and `non_PSP_data_dictionary_20250514.xlsx`). These spreadsheets map column
codes to descriptions; keep a local copy alongside any downloads so analysts can
interpret FAIB variable names when building pipelines.

.. note::

   The FAIB team confirmed the portal data is fully public and can be
   redistributed. For bulk processing the FTP endpoints above are faster and
   expose the complete PSP, CMI, NFI, and YSM compilations (hundreds of megabytes
   per table). Nemora stores fetched CSVs under `data/external/faib/`, which is
   already `.gitignore`-d; treat that directory as a local cache and avoid
   committing the raw extracts.

   During rapid iteration you can limit downloads to specific files by passing
   `filenames=["faib_plot_header.csv"]` to :func:`nemora.ingest.faib.download_faib_csvs`
   so that small metadata tables can be fetched without transferring the
   multi-hundred-megabyte tree detail extracts.
```

Running `pipeline.run(raw_frame)` applies the configured steps sequentially—
ideal for cleaning CSV extracts, building stand tables, and harmonising column
names. Pipelines will be orchestrated by future CLI commands.

See `nemora.ingest.faib` for utilities (`load_psp_dictionary`,
`load_non_psp_dictionary`, `aggregate_stand_table`) that download schemas and
collapse tree detail tables into Nemora-ready stand-table summaries.

.. todo:: Flesh out end-to-end ingestion workflows (including CLI usage and
          caching guidelines) once dataset connectors are implemented.

## CLI helper

Nemora exposes an early CLI stub for PSP stand tables:

```bash
nemora ingest-faib tests/fixtures/faib --baf 12 --output stand_table.csv

# Fetch PSP extracts and write output
nemora ingest-faib data/external/faib --baf 12 --fetch --dataset psp --output stand_table.csv
# Force a fresh download (overwrite cached files) before building the stand table
nemora ingest-faib data/external/faib --baf 12 --fetch --overwrite --output stand_table.csv
# Preview suggested BAF values and exit without generating a table
nemora ingest-faib data/external/faib --auto-bafs --fetch --dataset psp

# `faib-manifest` writes both CSV and Parquet by default; pass --no-parquet to emit CSV only.
# Fetch extracts, auto-select BAFs, and generate manifests + stand tables (CSV+Parquet)
nemora faib-manifest data/external/faib/manifest_psp --auto-bafs --auto-count 3
# Reuse an existing download, skip fetch, limit rows, and emit CSV + Parquet manifests
nemora faib-manifest examples/faib_manifest --source tests/fixtures/faib --no-fetch --baf 12 --max-rows 200
# CSV-only regeneration example (details in docs/examples/faib_manifest_parquet.md)
nemora faib-manifest data/external/faib/manifest_psp --overwrite --no-parquet

# Prepare HPS tallies and manifest (no download, reusing cached CSVs)
nemora ingest-faib-hps data/external/faib --no-fetch --output data/examples/hps_baf12
# Download PSP extracts to a cache directory and write outputs to the examples folder
nemora ingest-faib-hps data/external/faib --cache-dir data/external/psp/raw --output data/examples/hps_baf12 --fetch
# Benchmark the HPS pipeline (timing only, no output)
nemora ingest-benchmark data/external/faib --no-fetch --iterations 5
# Benchmark and capture metrics (JSONL) for later trend analysis
nemora ingest-benchmark data/external/faib --no-fetch --iterations 3 --report-path logs/ingest_benchmark.jsonl

# Generate trimmed fixtures + manifest (used in tests)
python scripts/generate_faib_manifest.py examples/faib_manifest --dataset psp
# Auto-select representative BAF values before generating the manifest
python scripts/generate_faib_manifest.py examples/faib_manifest --dataset psp --auto
# Limit stand tables to the first 200 rows when exporting the manifest samples
python scripts/generate_faib_manifest.py examples/faib_manifest --dataset psp --auto --max-rows 200

# Aggregate an FIA stand table (prototype) using local CSV extracts
python - <<'PY'
from nemora.ingest.fia import build_stand_table_from_csvs

table = build_stand_table_from_csvs(
    "data/external/fia/raw",
    plot_cn=47825253010497,
)
print(table.head())
PY

# Aggregate FIA stand tables via CLI (trimmed fixtures example)
nemora ingest-fia tests/fixtures/fia --tree-file tree_small.csv --cond-file cond_small.csv \
  --plot-file plot_small.csv --plot-cn 47825261010497 --plot-cn 47825253010497 --output fia_sample.csv
```

The command expects pre-downloaded FAIB CSV extracts; future versions will
bundle fetch/caching logic.

### Caching guidelines

- Use directories under `data/external/` for raw downloads (`faib/raw`, `fia/raw`,
  etc.). They are already ignored by Git.
- Prefer invoking `build_faib_dataset_source(...).fetch()` or
  `build_fia_dataset_source(...).fetch()` from notebooks/scripts instead of
  reimplementing download logic. The helpers enforce overwrite-safe `.part`
  files and capture provenance in `DatasetSource.metadata`.
- CLI commands pass through these helpers when `--fetch` or `--fetch-state` is
  supplied; cached files are reused unless `--overwrite` is specified.
- Document licences and terms of use alongside cached datasets (see
  `tests/fixtures/faib/README.md` for an example template).

## Repository sample

The repository contains a trimmed PSP example generated with
`scripts/generate_faib_manifest.py` under `examples/faib_manifest/`.
The manifest (`faib_manifest.csv`) lists each stand-table CSV (e.g.,
`stand_table_baf12.csv`) alongside the BAF, row count, and a `truncated` flag so
tests and documentation can reference a lightweight sample of the full FAIB
release. Re-run the script with `--max-rows` to regenerate the samples from a
larger local cache without bloating the repository.

The CLI and script both call :func:`nemora.ingest.faib.generate_faib_manifest`,
which orchestrates downloads, BAF selection, stand-table aggregation, and
manifest creation. The helper returns the manifest path, generated table paths,
and any files downloaded so automated workflows can inspect the output.

## FIA prototype

Nemora includes early helpers for USDA FIA CSV extracts
(:mod:`nemora.ingest.fia`). The :func:`nemora.ingest.fia.build_stand_table_from_csvs`
function joins ``TREE``/``COND``/``PLOT`` tables, filters live
trees/conditions, converts DBH to centimetres, and aggregates stand tables
weighted by ``TPA_UNADJ`` and condition proportions. These utilities are the
first step toward a full FIA ingest pipeline; use them to validate schema joins
on downloaded samples while additional ETL automation is being planned.

The CLI supports automatic downloads via ``--fetch-state``; Nemora maps state
codes to the public FIA Datamart URLs (for example ``nemora ingest-fia
data/fia --fetch-state hi`` will retrieve ``HI_TREE.csv``, ``HI_PLOT.csv``, and
``HI_COND.csv`` before aggregating). Downloads are optional—pass custom
``--tree-file``/``--cond-file``/``--plot-file`` arguments when working with
pre-existing extracts or trimmed fixtures.

**Licensing note:** FIA data are public domain but attribution is appreciated;
refer to the USDA legal notice at <https://www.fia.fs.usda.gov/contact/legal.php>.
When redistributing trimmed fixtures (e.g., under ``tests/fixtures/fia``) include
the citation and acquisition date so downstream users understand the provenance.