Aggregate Detection Results =========================== This how-to demonstrates the post-inference workflow: convert HawkEars JSON payloads into the canonical detection schema, persist a Parquet file, and summarize detections with DuckDB plus the ``badc report`` helpers. Prerequisites ------------- * ``badc infer run`` completed and wrote JSON files under ``/artifacts/infer``. * DuckDB is available (installed via the package dependencies or ``pip install duckdb``) so ``--parquet`` exports and report commands succeed. * ``badc`` is available on ``PATH`` (editable install or packaged release). Step 1 — Aggregate JSON to CSV/Parquet -------------------------------------- 1. Point ``badc infer aggregate`` at the inference output directory. Capture both CSV (easy diff) and Parquet (columnar analytics) targets:: badc infer aggregate data/datalad/bogus/artifacts/infer \ --manifest data/datalad/bogus/manifests/GNWT-290.csv \ --output data/datalad/bogus/artifacts/aggregate/summary.csv \ --parquet data/datalad/bogus/artifacts/aggregate/detections.parquet 2. The command crawls all JSON files, injects chunk metadata (start/end offsets, hashes, dataset root). When the manifest path is supplied, any missing chunk metadata is retrieved directly from the CSV so custom runners do not need to embed it into their JSON payloads. Each detection row now carries both relative/absolute start **and** end timestamps, HawkEars label codes/names, confidence, runner label, and the HawkEars ``model_version`` extracted from the submodule. The command writes the canonical schema described in :mod:`badc.aggregate`. 3. Commit outputs with ``datalad save`` or ``git add`` as appropriate so the provenance of each inference batch is preserved. If you are aggregating older inference runs that predate the embedded JSON detections, BADC now re-parses ``HawkEars_labels.csv`` directly from the ``hawkears_output`` directory inside each chunk so the resulting CSV/Parquet files still carry real label codes, names, and confidences without rerunning HawkEars. Step 2 — Summaries via ``badc report summary`` ---------------------------------------------- 1. Use the Parquet export directly from the CLI:: badc report summary \ --parquet data/datalad/bogus/artifacts/aggregate/detections.parquet \ --group-by label,recording_id \ --output data/datalad/bogus/artifacts/aggregate/summary_by_label.csv 2. The command runs a DuckDB query (``COUNT(*)`` plus ``AVG(confidence)``) and renders a Rich table. Adjust ``--group-by`` to focus on labels, recordings, or the combination thereof. 3. The optional ``--output`` path mirrors the on-screen table so collaborators without DuckDB can review the same summary. Python API shortcut ------------------- When you prefer to stay inside a notebook or script, import :mod:`badc.aggregate_api` instead of shelling out to the CLI. The helpers wrap the same canonical schema and DuckDB tooling:: from badc import aggregate_api records = aggregate_api.aggregate_inference_outputs( "data/datalad/bogus/artifacts/infer", manifest="data/datalad/bogus/manifests/GNWT-114_20230509_094500.csv", summary_csv="artifacts/aggregate/GNWT-114_summary.csv", parquet="artifacts/aggregate/GNWT-114.parquet", ) detections_df = aggregate_api.load_detection_dataframe("data/datalad/bogus/artifacts/infer") duckdb_views = aggregate_api.load_bundle_views( "data/datalad/bogus/artifacts/aggregate/GNWT-114.duckdb", limit_labels=10 ) Behind the scenes the module reuses :mod:`badc.aggregate` and :mod:`badc.duckdb_helpers`, so CSV/Parquet exports and DuckDB views stay consistent with the CLI. Step 3 — Quicklook dashboards via ``badc report quicklook`` ----------------------------------------------------------- 1. Run the quicklook command to capture label/recording highlights plus a per-chunk timeline:: badc report quicklook \ --parquet data/datalad/bogus/artifacts/aggregate/detections.parquet \ --top-labels 12 \ --top-recordings 5 \ --output-dir data/datalad/bogus/artifacts/aggregate/quicklook 2. The CLI prints Rich tables and ASCII sparklines so you can scan activity bursts directly in the terminal. When ``--output-dir`` is set, CSV snapshots land alongside the detections and can be imported into notebooks or attached to CHANGE_LOG entries for asynchronous reviews. The :doc:`/notebooks/aggregate_analysis` example shows how to load the CSVs with pandas to build plots. Step 4 — Detailed parquet report -------------------------------- 1. Generate CSV/JSON artifacts for Erin using the new DuckDB-backed helper:: badc report parquet \ --parquet data/datalad/bogus/artifacts/aggregate/detections.parquet \ --bucket-minutes 30 \ --output-dir data/datalad/bogus/artifacts/aggregate/parquet_report 2. The CLI prints overall stats, richer label/recording tables, and a bucketed timeline (detections per N-minute window). The ``--output-dir`` captures ``labels.csv``, ``recordings.csv``, ``timeline.csv``, and ``summary.json`` so Erin can drop them straight into her thesis figures or notebooks without running DuckDB herself. Step 4b — Run the bundle helper (optional) ------------------------------------------ When you want the quicklook CSVs, parquet bundle, and DuckDB database in one pass, use:: badc report bundle \ --parquet data/datalad/bogus/artifacts/aggregate/detections.parquet \ --output-dir data/datalad/bogus/artifacts/aggregate \ --bucket-minutes 30 The command derives ``detections_quicklook/``, ``detections_parquet_report/``, ``detections_duckdb_exports/``, and ``detections.duckdb`` automatically. Toggle individual stages with ``--no-quicklook`` / ``--no-parquet-report`` / ``--no-duckdb-report`` or override specific paths (e.g., ``--duckdb-database``) when needed. This is the fastest way to package Phase 2 review artifacts for Erin after each inference run. Tip: when running ``badc infer orchestrate --apply`` you can pass ``--bundle`` (plus the optional ``--bundle-*`` overrides) so these aggregation/report steps run automatically after every recording — no need to invoke the commands manually unless you want custom tweaks. Step 4c — Roll up a directory of detections ------------------------------------------- When the aggregate directory holds multiple per-recording Parquet files (e.g., after running ``badc infer orchestrate --apply --bundle`` across a dataset), use the new helper to get a quick cross-run summary:: badc report aggregate-dir data/datalad/bogus/artifacts/aggregate \ --limit 20 \ --export-dir data/datalad/bogus/artifacts/aggregate/summary_exports The command scans for ``*_detections.parquet`` (falls back to ``*.parquet`` when bundle outputs use plain run-prefix names), loads the matches via DuckDB, prints consolidated label/recording leaderboards, and optionally writes ``label_summary.csv`` / ``recording_summary.csv`` under the export directory. This is the fastest sanity check to confirm the refreshed bogus dataset (now five GNWT recordings) still contains the expected vocalizations vs. background noise mix. Tip: pass ``--bundle-rollup`` to ``badc infer orchestrate`` (enabled automatically in ``badc pipeline run``) to run this helper as soon as the queue drains. By default the rollup exports land in ``artifacts/aggregate/aggregate_summary/``, so Erin always has a dataset-wide CSV ready alongside the per-recording bundles. Step 5 — Materialize a DuckDB database -------------------------------------- 1. Turn the Parquet export into a DuckDB database (views + CSV snapshots) for Erin:: badc report duckdb \ --parquet data/datalad/bogus/artifacts/aggregate/detections.parquet \ --database data/datalad/bogus/artifacts/aggregate/detections.duckdb \ --bucket-minutes 30 \ --export-dir data/datalad/bogus/artifacts/aggregate/duckdb_exports This creates the ``detections`` table plus three convenience views (``label_summary``, ``recording_summary``, ``timeline_summary``), prints the same Rich tables shown in ``badc report parquet``, and writes ``label_summary.csv`` / ``recording_summary.csv`` / ``timeline.csv`` when ``--export-dir`` is provided. 2. Open the database interactively:: duckdb data/datalad/bogus/artifacts/aggregate/detections.duckdb -- Loading resources from /home/.../.duckdbrc D SELECT * FROM label_summary LIMIT 5; Or issue one-off queries:: duckdb -c "SELECT recording_id, SUM(detections) AS calls \ FROM recording_summary ORDER BY calls DESC LIMIT 5" \ data/datalad/bogus/artifacts/aggregate/detections.duckdb The same database can be mounted in notebooks via `duckdb.connect(".../detections.duckdb")` for richer charts without re-importing the Parquet file. Prefer the helper :func:`badc.duckdb_helpers.load_duckdb_views` when you want ready-made pandas DataFrames for the ``label_summary`` / ``recording_summary`` / ``timeline_summary`` views:: from badc.duckdb_helpers import load_duckdb_views views = load_duckdb_views("data/datalad/bogus/artifacts/aggregate/detections.duckdb", limit_labels=10) views.label_summary.head() Step 6 — Notebook/SQL exploration --------------------------------- 1. Open the Parquet file with DuckDB for ad-hoc SQL:: duckdb -c "SELECT label, COUNT(*) FROM 'data/.../detections.parquet' GROUP BY 1" 2. Or load it from Python:: import duckdb con = duckdb.connect() con.execute( """ SELECT recording_id, label, COUNT(*) AS detections FROM read_parquet('data/.../detections.parquet') GROUP BY 1, 2 ORDER BY detections DESC """ ).df() 3. Incorporate telemetry (``badc infer monitor`` / ``badc telemetry``) to correlate GPU usage with detection density; telemetry logs live alongside the aggregate files when the manifest resides within a DataLad dataset. See also -------- * :doc:`/cli/infer` for detailed aggregation options and telemetry monitoring. * :doc:`/cli/report` for additional reporting helpers that will grow alongside Phase 2 analytics.