FAIB Manifest Parquet Workflow

Nemora emits FAIB manifest summaries as both CSV and Parquet by default. Parquet provides columnar storage and faster downstream analytics—recommended for notebook or Spark pipelines. Pass --no-parquet if you only need CSV outputs.

CLI examples

  • Fetch PSP extracts, auto-select BAFs, and generate manifests/stats:

    nemora faib-manifest data/external/faib/manifest_psp --auto-bafs --auto-count 3

  • Reuse cached downloads, limit rows, and emit Parquet alongside CSV (default):

    nemora faib-manifest examples/faib_manifest --source tests/fixtures/faib --no-fetch --baf 12 --max-rows 200

  • Produce CSV only when downstream tooling cannot read Parquet:

    nemora faib-manifest examples/faib_manifest --source tests/fixtures/faib --no-fetch --baf 12 --max-rows 200 --no-parquet

Loading the Parquet manifest

import pandas as pd

manifest = pd.read_parquet("examples/faib_manifest/faib_manifest.parquet")
print(manifest.head())

The Parquet file mirrors the CSV schema (dataset, baf, rows, path, truncated). Use --no-parquet if you need to skip the columnar output or keep storage requirements minimal.

Feed manifest entries into sampling workflows

Once a manifest exists you can select an individual stand table, fit a distribution, and draw samples while tuning the numeric integration settings:

from pathlib import Path

import numpy as np
import pandas as pd

from nemora.core import InventorySpec
from nemora.fit import fit_inventory
from nemora.sampling import SamplingConfig, sample_distribution

manifest = pd.read_parquet("examples/faib_manifest/faib_manifest.parquet")
stand_csv = Path(manifest.loc[0, "path"])  # CSV path captured in the manifest
stand_table = pd.read_csv(stand_csv)

bins = stand_table["dbh_cm"].to_numpy()
tallies = stand_table["tally"].to_numpy(dtype=float)

inventory = InventorySpec(
    name=stand_csv.stem,
    sampling="hps",
    bins=bins,
    tallies=tallies,
    metadata={"grouped": True},
)

fit = fit_inventory(inventory, ["weibull"], configs={})[0]
config = SamplingConfig(
    grid_points=4096,
    support_multiplier=12.0,
    integration_method="quad",
    quad_abs_tol=1e-9,
    quad_rel_tol=1e-8,
    cache_numeric_cdf=True,
)

draws = sample_distribution(
    fit.distribution,
    fit.parameters,
    size=500,
    random_state=123,
    config=config,
)
print(draws[:5])

This script loads the Parquet manifest, pinpoints the original stand-table CSV, fits a Weibull distribution, and samples DBH draws using a high-resolution CDF grid. Adjust SamplingConfig to benchmark how different grid densities or integration methods affect accuracy/performance, and swap in nemora.sampling.bootstrap_inventory when you need the richer metadata tracked by BootstrapResult.

Export bootstrap DBH vectors

After calling bootstrap_inventory(..., return_result=True) you can convert the result into per-resample DBH vectors (plus an optional long-form table) using nemora.sampling.bootstrap_dbh_vectors. The Typer CLI wraps this workflow so you can export JSON + Parquet artifacts without writing code:

nemora sampling-export-bootstrap-dbh "$(python - <<'PY'
import pandas as pd
manifest = pd.read_parquet('examples/faib_manifest/faib_manifest.parquet')
print(manifest.loc[0, 'path'])
PY
)" \
  --stand-id faib-demo-001 \
  --output examples/faib_manifest/faib_demo_dbh.json \
  --table-output examples/faib_manifest/faib_demo_dbh.parquet \
  --resamples 3 \
  --sample-size 25

The JSON file captures metadata (distribution, fitted parameters, bins/tallies, RNG seed) alongside per-resample DBH arrays, while the Parquet export stores every (resample, bin, dbh) row plus tally-derived weights. Feed either artifact directly into upcoming synthesis/simulation tooling or archive them with your sampling experiment logs.