# Sampling Utilities (Draft) The `nemora.sampling` module provides helpers for converting registered PDFs into CDFs, drawing random variates, sampling fitted mixtures, and bootstrapping stand tables from `nemora.fit` results. ## Convert a PDF to a CDF ```python import numpy as np from nemora.sampling import pdf_to_cdf cdf = pdf_to_cdf("weibull", {"a": 2.5, "beta": 12.0, "s": 1.0}, method="numeric") x = np.linspace(0.0, 40.0, 100) y = cdf(x) ``` When a distribution exposes an analytic CDF, `pdf_to_cdf(..., method="analytic")` delegates to it; otherwise the helper falls back to numeric integration. ### Configuring numeric integration `pdf_to_cdf` accepts a `SamplingConfig` so you can control grid density, integration backend, and quadrature tolerances: ```python from nemora.sampling import SamplingConfig, pdf_to_cdf cfg = SamplingConfig(grid_points=2048, integration_method="quad", quad_rel_tol=1e-7) cdf = pdf_to_cdf( "gamma", {"beta": 4.0, "p": 3.0, "s": 1.0}, method="numeric", config=cfg, ) ``` The default uses a trapezoid grid; switching to `"quad"` delegates to `scipy.integrate.quad` with the tolerances above. You can also set `integration_method="simpson"` to integrate via Simpson's rule. Set `cache_numeric_cdf=True` if you plan to evaluate the same numeric CDF repeatedly—Nemora caches the computed grid/integral pair in-memory so sampling calls avoid rerunning the integrator. ## Sample from a distribution ```python from nemora.sampling import sample_distribution draws = sample_distribution("gamma", {"beta": 4.0, "p": 3.0, "s": 1.0}, size=500) ``` Distributions with closed-form inverse CDFs (Weibull, exponential, Pareto, uniform, lognormal) use analytic inversion internally for improved accuracy. Logistic/Fisk currently fall back to the numeric pathway described above; this is documented in `notes/sampling_inverse_matrix.md` and will only change once synthesis requires a closed-form helper. ## Sample from a mixture fit ```python from nemora.fit import MixtureComponentFit, MixtureFitResult from nemora.sampling import sample_mixture_fit components = [ MixtureComponentFit(name="gamma", weight=0.6, parameters={"beta": 3.0, "p": 2.0}), MixtureComponentFit(name="gamma", weight=0.4, parameters={"beta": 8.0, "p": 5.0}), ] mixture = MixtureFitResult( distribution="mixture", components=components, log_likelihood=-100.0, iterations=10, converged=True, ) draws = sample_mixture_fit(mixture, size=1000, random_state=np.random.default_rng(42)) ``` Pass a `numpy.random.Generator` (or integer seed) via `random_state` to obtain reproducible mixture draws. Use the optional `lower`/`upper` parameters to truncate the draws (rejection sampling ensures all values fall inside the interval), and `weight_overrides=[...]` when you need to re-weight components dynamically (e.g., mixture-of-experts routing): ```python draws = sample_mixture_fit( mixture, size=250, random_state=123, lower=5.0, upper=25.0, weight_overrides=[1.0, 0.0], # ignore the second component for this scenario ) ``` Invalid overrides (negative weights, wrong lengths, or zero totals) raise `ValueError` so calling code can fall back gracefully. ## Bootstrap a fitted inventory ```python import numpy as np from nemora.core import FitResult from nemora.sampling import BootstrapResult, bootstrap_inventory fit = FitResult(distribution="gamma", parameters={"beta": 5.0, "p": 2.5, "s": 1.0}) bins = np.array([10.0, 20.0, 30.0]) tallies = np.array([5, 3, 2], dtype=float) result: BootstrapResult = bootstrap_inventory( fit, bins, tallies, resamples=5, sample_size=25, return_result=True, ) samples = result.samples stacked = result.stacked() ``` Passing `return_result=True` yields a `BootstrapResult` containing the sampled arrays and metadata (distribution, parameters, bins, tallies, RNG seed). .. warning:: These APIs are experimental. Expect refinements (additional configuration, performance tuning) as we integrate them with downstream modules. ## Bootstrap result metadata `bootstrap_inventory(..., return_result=True)` returns a `BootstrapResult` containing: - `samples`: list of `(bin, draw)` arrays - `distribution` / `parameters`: metadata from the fit - `bins`, `tallies`, `resamples`, `sample_size`, `rng_seed` - Convenience helpers: `stacked()` to concatenate samples and `to_dataframe()` to return a DataFrame with `resample`, `bin`, and `draw` columns. Use the metadata when passing bootstrap outputs into synthesis or the simulation module. ## Using bootstrap outputs downstream - **Synthesis (`nemora.synthesis`)** expects paired `(bin, draw)` arrays plus the originating fit metadata so stem or stand generators can report provenance. Use `nemora.synthesis.helpers.bootstrap_payload(result)` to obtain both the stacked array and a `pandas.DataFrame` with metadata attached in `frame.attrs["nemora_bootstrap"]`. Prefer the CLI helper (`nemora sampling-describe-bootstrap `) when you want to preview metadata from an existing stand table or emit JSON for automation. - **Simulation (`nemora.simulation`)** can persist the entire `BootstrapResult` (including RNG seed) to regenerate uncertainty studies or re-run Monte Carlo workflows deterministically. - **DBH helpers** use `nemora.sampling.helpers.bootstrap_dbh_vectors(...)` (or the CLI shown below) to produce per-stand DBH arrays + metadata suitable for synthesis/simulation. See `docs/examples/faib_manifest_parquet.md` for an end-to-end manifest walkthrough. - **Ingest + benchmarking notebooks** should write the stacked output to Parquet so future steps can slice by distribution, BAF, or inventory metadata without re-running the bootstrap sampling step. Downstream modules should rely on the metadata provided here rather than reconstructing provenance manually. ## Export DBH vectors for synthesis Turn a `BootstrapResult` into per-resample DBH vectors (and an optional long-form table) via the helper: ```python from nemora.sampling import bootstrap_dbh_vectors payload = bootstrap_dbh_vectors(result, stand_id="psp-stand-001") print(payload.dbh_vectors[0][:5]) payload.frame.head() ``` The helper preserves metadata (`distribution`, `parameters`, `bins`, `tallies`, RNG seed) and annotates each row with the stand identifier plus the original tally-derived weights. When you prefer a no-code path, the CLI mirrors this workflow: ```bash nemora sampling-export-bootstrap-dbh tests/fixtures/hps_psp_stand_table.csv \ --stand-id psp-stand-001 \ --output tmp/bootstrap_dbh.json \ --table-output tmp/bootstrap_dbh.parquet \ --resamples 3 \ --sample-size 25 \ --seed 2025 ``` The JSON file captures per-resample DBH arrays and metadata, while the table export (Parquet or CSV) stores every `(resample, bin, dbh)` row for downstream analysis. ## Sampling directly from ingest-created manifests After generating manifests (CSV + Parquet by default) via `nemora faib-manifest`, select an entry, fit a distribution, and draw samples while tuning the numeric integration settings. Use `--no-parquet` if you prefer CSV-only outputs. `docs/examples/faib_manifest_parquet.md` contains a complete example that: 1. Loads the Parquet manifest and resolves the stand-table path. 2. Wraps the DBH bins/tallies in an `InventorySpec`. 3. Fits a distribution via `fit_inventory`. 4. Calls `sample_distribution(..., config=SamplingConfig(...))` to test trapezoid/Simpson/quad integration modes, grid densities, and cache settings. Use this pattern when validating ingest outputs or when you need to benchmark how numeric integration tolerances impact downstream sampling accuracy.