Aggregate Detection Results¶
This how-to demonstrates the post-inference workflow: convert HawkEars JSON payloads into the
canonical detection schema, persist a Parquet file, and summarize detections with DuckDB plus the
badc report helpers.
Prerequisites¶
badc infer runcompleted and wrote JSON files under<dataset>/artifacts/infer.DuckDB is available (installed via the package dependencies or
pip install duckdb) so--parquetexports and report commands succeed.badcis available onPATH(editable install or packaged release).
Step 1 — Aggregate JSON to CSV/Parquet¶
Point
badc infer aggregateat the inference output directory. Capture both CSV (easy diff) and Parquet (columnar analytics) targets:badc infer aggregate data/datalad/bogus/artifacts/infer \ --manifest data/datalad/bogus/manifests/GNWT-290.csv \ --output data/datalad/bogus/artifacts/aggregate/summary.csv \ --parquet data/datalad/bogus/artifacts/aggregate/detections.parquet
The command crawls all JSON files, injects chunk metadata (start/end offsets, hashes, dataset root). When the manifest path is supplied, any missing chunk metadata is retrieved directly from the CSV so custom runners do not need to embed it into their JSON payloads. Each detection row now carries both relative/absolute start and end timestamps, HawkEars label codes/names, confidence, runner label, and the HawkEars
model_versionextracted from the submodule. The command writes the canonical schema described inbadc.aggregate.Commit outputs with
datalad saveorgit addas appropriate so the provenance of each inference batch is preserved.
If you are aggregating older inference runs that predate the embedded JSON detections, BADC now
re-parses HawkEars_labels.csv directly from the hawkears_output directory inside each chunk so
the resulting CSV/Parquet files still carry real label codes, names, and confidences without
rerunning HawkEars.
Step 2 — Summaries via badc report summary¶
Use the Parquet export directly from the CLI:
badc report summary \ --parquet data/datalad/bogus/artifacts/aggregate/detections.parquet \ --group-by label,recording_id \ --output data/datalad/bogus/artifacts/aggregate/summary_by_label.csv
The command runs a DuckDB query (
COUNT(*)plusAVG(confidence)) and renders a Rich table. Adjust--group-byto focus on labels, recordings, or the combination thereof.The optional
--outputpath mirrors the on-screen table so collaborators without DuckDB can review the same summary.
Python API shortcut¶
When you prefer to stay inside a notebook or script, import badc.aggregate_api instead of
shelling out to the CLI. The helpers wrap the same canonical schema and DuckDB tooling:
from badc import aggregate_api
records = aggregate_api.aggregate_inference_outputs(
"data/datalad/bogus/artifacts/infer",
manifest="data/datalad/bogus/manifests/GNWT-114_20230509_094500.csv",
summary_csv="artifacts/aggregate/GNWT-114_summary.csv",
parquet="artifacts/aggregate/GNWT-114.parquet",
)
detections_df = aggregate_api.load_detection_dataframe("data/datalad/bogus/artifacts/infer")
duckdb_views = aggregate_api.load_bundle_views(
"data/datalad/bogus/artifacts/aggregate/GNWT-114.duckdb", limit_labels=10
)
Behind the scenes the module reuses badc.aggregate and
badc.duckdb_helpers, so CSV/Parquet exports and DuckDB views stay consistent with the CLI.
Step 3 — Quicklook dashboards via badc report quicklook¶
Run the quicklook command to capture label/recording highlights plus a per-chunk timeline:
badc report quicklook \ --parquet data/datalad/bogus/artifacts/aggregate/detections.parquet \ --top-labels 12 \ --top-recordings 5 \ --output-dir data/datalad/bogus/artifacts/aggregate/quicklook
The CLI prints Rich tables and ASCII sparklines so you can scan activity bursts directly in the terminal. When
--output-diris set, CSV snapshots land alongside the detections and can be imported into notebooks or attached to CHANGE_LOG entries for asynchronous reviews. The Aggregate analysis walkthrough example shows how to load the CSVs with pandas to build plots.
Step 4 — Detailed parquet report¶
Generate CSV/JSON artifacts for Erin using the new DuckDB-backed helper:
badc report parquet \ --parquet data/datalad/bogus/artifacts/aggregate/detections.parquet \ --bucket-minutes 30 \ --output-dir data/datalad/bogus/artifacts/aggregate/parquet_report
The CLI prints overall stats, richer label/recording tables, and a bucketed timeline (detections per N-minute window). The
--output-dircaptureslabels.csv,recordings.csv,timeline.csv, andsummary.jsonso Erin can drop them straight into her thesis figures or notebooks without running DuckDB herself.
Step 4b — Run the bundle helper (optional)¶
When you want the quicklook CSVs, parquet bundle, and DuckDB database in one pass, use:
badc report bundle \
--parquet data/datalad/bogus/artifacts/aggregate/detections.parquet \
--output-dir data/datalad/bogus/artifacts/aggregate \
--bucket-minutes 30
The command derives detections_quicklook/, detections_parquet_report/,
detections_duckdb_exports/, and detections.duckdb automatically. Toggle individual stages
with --no-quicklook / --no-parquet-report / --no-duckdb-report or override specific
paths (e.g., --duckdb-database) when needed. This is the fastest way to package Phase 2 review
artifacts for Erin after each inference run.
Tip: when running badc infer orchestrate --apply you can pass --bundle (plus the optional
--bundle-* overrides) so these aggregation/report steps run automatically after every recording —
no need to invoke the commands manually unless you want custom tweaks.
Step 4c — Roll up a directory of detections¶
When the aggregate directory holds multiple per-recording Parquet files (e.g., after running
badc infer orchestrate --apply --bundle across a dataset), use the new helper to get a quick
cross-run summary:
badc report aggregate-dir data/datalad/bogus/artifacts/aggregate \
--limit 20 \
--export-dir data/datalad/bogus/artifacts/aggregate/summary_exports
The command scans for *_detections.parquet (falls back to *.parquet when bundle outputs use
plain run-prefix names), loads the matches via DuckDB, prints consolidated label/recording
leaderboards, and optionally writes label_summary.csv / recording_summary.csv under the
export directory. This is the fastest sanity check to confirm the refreshed bogus dataset (now five
GNWT recordings) still contains the expected vocalizations vs. background noise mix.
Tip: pass --bundle-rollup to badc infer orchestrate (enabled automatically in
badc pipeline run) to run this helper as soon as the queue drains. By default the rollup exports
land in artifacts/aggregate/aggregate_summary/, so Erin always has a dataset-wide CSV ready
alongside the per-recording bundles.
Step 5 — Materialize a DuckDB database¶
Turn the Parquet export into a DuckDB database (views + CSV snapshots) for Erin:
badc report duckdb \ --parquet data/datalad/bogus/artifacts/aggregate/detections.parquet \ --database data/datalad/bogus/artifacts/aggregate/detections.duckdb \ --bucket-minutes 30 \ --export-dir data/datalad/bogus/artifacts/aggregate/duckdb_exports
This creates the
detectionstable plus three convenience views (label_summary,recording_summary,timeline_summary), prints the same Rich tables shown inbadc report parquet, and writeslabel_summary.csv/recording_summary.csv/timeline.csvwhen--export-diris provided.Open the database interactively:
duckdb data/datalad/bogus/artifacts/aggregate/detections.duckdb -- Loading resources from /home/.../.duckdbrc D SELECT * FROM label_summary LIMIT 5;
Or issue one-off queries:
duckdb -c "SELECT recording_id, SUM(detections) AS calls \ FROM recording_summary ORDER BY calls DESC LIMIT 5" \ data/datalad/bogus/artifacts/aggregate/detections.duckdb
The same database can be mounted in notebooks via duckdb.connect(”…/detections.duckdb”) for richer charts without re-importing the Parquet file. Prefer the helper
badc.duckdb_helpers.load_duckdb_views()when you want ready-made pandas DataFrames for thelabel_summary/recording_summary/timeline_summaryviews:from badc.duckdb_helpers import load_duckdb_views views = load_duckdb_views("data/datalad/bogus/artifacts/aggregate/detections.duckdb", limit_labels=10) views.label_summary.head()
Step 6 — Notebook/SQL exploration¶
Open the Parquet file with DuckDB for ad-hoc SQL:
duckdb -c "SELECT label, COUNT(*) FROM 'data/.../detections.parquet' GROUP BY 1"
Or load it from Python:
import duckdb con = duckdb.connect() con.execute( """ SELECT recording_id, label, COUNT(*) AS detections FROM read_parquet('data/.../detections.parquet') GROUP BY 1, 2 ORDER BY detections DESC """ ).df()
Incorporate telemetry (
badc infer monitor/badc telemetry) to correlate GPU usage with detection density; telemetry logs live alongside the aggregate files when the manifest resides within a DataLad dataset.
See also¶
Infer Commands for detailed aggregation options and telemetry monitoring.
Report Commands for additional reporting helpers that will grow alongside Phase 2 analytics.