# BC PSP HPS Data

This note outlines how to assemble publicly available horizontal point sampling
datasets from the BC Forest Analysis and Inventory Branch (FAIB) compilations.
The goal is to obtain a clean, reproducible subset that mirrors the BAF 12 HPS
workflow used in the Vegetation Resource Inventory (VRI).

## Source

- FTP: `ftp://ftp.for.gov.bc.ca/HTS/external/!publish/ground_plot_compilations/`
  - `psp/` – Provincial Vegetation Resource Inventory permanent sample plots.
  - `non_psp/` – Related compilations for non‑PSP programmes.
- Metadata: `PSP_data_dictionary_20250514.xlsx`, `non_PSP_data_dictionary_20250514.xlsx`
  (download and store checksums alongside scripts).

## Relevant Tables

| File | Purpose | Key Fields |
| ---- | ------- | ---------- |
| `faib_plot_header.csv` | Plot descriptors (one per plot/visit) | `CLSTR_ID`, `VISIT_NUMBER`, `PLOT`, `SITE_IDENTIFIER` |
| `faib_sample_byvisit.csv` | Plot visit metadata | `CLSTR_ID`, `VISIT_NUMBER`, `FIRST_MSMT`, `MEAS_DT`, `SAMP_TYP` |
| `faib_tree_detail.csv` | Per-tree measurements (large; chunked download) | `CLSTR_ID`, `VISIT_NUMBER`, `PLOT`, `DBH`, `LV_D`, `TREE_NO`, `SP0` |

Additional summary tables (`faib_compiled_*`) provide aggregated basal area and
heights but are optional for the initial HPS tally pipeline.

## Extraction Recipe

1. **Mirror metadata**: save the data dictionaries and record SHA256 hashes in
   `data/external/psp/CHECKSUMS`.
2. **Filter plots**: load `faib_plot_header.csv` and retain rows that correspond
   to the desired PSP visit(s). The compilations do not store the BAF explicitly,
   so the workflow records the assumed value (BAF 12) alongside each plot.
3. **Join visit context**: merge `faib_sample_byvisit` on
   `(CLSTR_ID, VISIT_NUMBER)` to identify active measurement cycles (e.g.,
   `FIRST_MSMT == "Y"` for baseline PSP visits).
4. **Build tallies**: stream `faib_tree_detail.csv` with `pandas.read_csv(..., chunksize=...)`
   selecting the columns above; filter to plots discovered in step 2, keep live
   trees (`STATUS_CD == "L"`), and bin DBH to centimetre midpoints. Output per plot:
   - `dbh_cm` bin centre,
   - `tally` counts,
   - `baf` (12),
   - optional species/stratum attributes for future use.
   Store under `data/examples/hps_baf12/<plot_id>.csv`.
5. **Document lineage**: create `data/examples/hps_baf12/README.md` summarising
   the selection criteria, transformation script, and citation requirements.

## Command-line helper

Use `scripts/prepare_hps_dataset.py` to automate the recipe above. The script
downloads (or reuses cached) PSP CSVs, filters to first-measurement BAF 12 plots,
and writes per-plot tallies plus a manifest, following the data preparation steps
documented in the EarthArXiv preprint by Paradis (2025).

```bash
python scripts/prepare_hps_dataset.py \
  --output-dir data/examples/hps_baf12 \
  --cache-dir data/external/psp/raw \
  --baf 12 \
  --max-plots 25
```

Key options:

- `--include-all-visits`: keep every measurement instead of first-measurement
  plots.
- `--sample-type F`: restrict to specific `SAMP_TYP` codes if required.
- `--status L --status I`: define which tree status codes count as “live”.
- `--dry-run`: inspect how many plots would be produced without writing files.

### DataLad shortcut

If you prefer to mirror the manuscript dataset directly, the CLI exposes a helper that prints the
DataLad commands required to install the reference data:

```bash
nemora fetch-reference-data --dry-run
```

Run with `--no-dry-run` (and a working DataLad installation) to automatically install the dataset.
If DataLad is not present:

- From a source checkout, use `pip install -e ".[data]"` to pull in the optional extra.
- From PyPI, use `pip install --upgrade "nemora[data]"` (which installs `datalad[full]`).

The command also attempts to enable the `arbutus-s3` sibling by default. Pass
`--enable-remote ""` to skip, or another sibling name if your configuration differs.

#### Installing with DataLad

```bash
pip install "nemora[data]"
nemora fetch-reference-data --no-dry-run
# if the remote requires enabling manually:
cd reference-data
datalad siblings
datalad siblings --name arbutus-s3 --action enable
datalad get -r .
```

The dataset is a standard git-annex repository. The top-level tree contains `examples/data`
artifacts used by the parity notebooks (e.g. `reference_hps/binned_meta_plots.csv`, the meta-plot
table referenced below).

You can point the notebooks (or scripted workflows) directly at the files under `reference-data/`
once they are present locally.

### Sample bundle

The repository ships a small bundle generated with:

```bash
PYTHONPATH=src python scripts/prepare_hps_dataset.py \
  --output-dir examples/hps_baf12 \
  --manifest examples/hps_baf12_manifest.csv \
  --cache-dir data/external/psp/raw \
  --baf 12 \
  --max-plots 5
```

Outputs:

- Tallies: `examples/hps_baf12/*.csv`
- Manifest: `examples/hps_baf12_manifest.csv`
- Raw downloads cached (gitignored) under `data/external/psp/raw`.

## Worked censored workflow

The censored/two-stage regression in `tests/test_censored_workflow.py` loads the
`binned_meta_plots.csv` file shipped with the DataLad dataset (or the copy committed in `examples/`).
Reuse that test as a template for exploratory analysis:

```python
import pandas as pd

from nemora.workflows.censoring import fit_censored_inventory

full_meta = pd.read_csv("examples/data/reference_hps/binned_meta_plots.csv")
censored = (
    full_meta[full_meta["dbh_cm"] >= 20.0]
    .groupby("dbh_cm", as_index=False)
    .agg({"tally": "sum", "expansion_factor": "mean"})
)

dbh = censored["dbh_cm"].to_numpy()
stand_table = censored["tally"].to_numpy() * censored["expansion_factor"].to_numpy()

results = fit_censored_inventory(dbh, stand_table, support=(20.0, float("inf")))
```

The resulting `FitResult` objects expose the same GOF metrics and residual summaries used in the PSP
examples. Combine them with the reporting pattern described in the
[programmatic HPS analysis guide](hps_api.md) or the parity notebook to regenerate the manuscript
figures.

## Automation Status

- [x] Scripted pipeline (`scripts/prepare_hps_dataset.py`) with caching and binning controls.
- [x] Pytest fixtures covering selection + tally logic (`tests/fixtures/hps`).
- [x] PSP sample bundle committed under `examples/hps_baf12` with manifest and provenance notes.
- [x] Regression guard for the reference Weibull fit (`tests/test_hps_parity.py`).
- [x] Censored meta-plot fixture + regression (`tests/fixtures/hps/meta_censored.csv`,
      `tests/test_censored_workflow.py`).
.. todo:: Update this section once the nemora.ingest / sampling / synthesis modules land to reflect the broader workflow.