Aggregate analysis walkthrough¶
Run badc infer aggregate followed by badc report parquet --output-dir artifacts/aggregate/<run>_parquet_report to capture canonical detections plus ready-to-plot CSV/JSON bundles. This notebook loads those exports directly so reviewers can inspect stats without re-running DuckDB queries.
[ ]:
import json
from pathlib import Path
import pandas as pd
from badc.duckdb_helpers import load_duckdb_views
RUN_ID = "GNWT-114_20230509_094500" # update to any recording in artifacts/aggregate
DATASET_ROOT = Path("..") / "data" / "datalad" / "bogus"
AGGREGATE_DIR = DATASET_ROOT / "artifacts" / "aggregate"
PARQUET_REPORT_DIR = AGGREGATE_DIR / f"{RUN_ID}_parquet_report"
labels_df = pd.read_csv(PARQUET_REPORT_DIR / "labels.csv")
recordings_df = pd.read_csv(PARQUET_REPORT_DIR / "recordings.csv")
timeline_df = pd.read_csv(PARQUET_REPORT_DIR / "timeline.csv")
summary_metrics = json.loads((PARQUET_REPORT_DIR / "summary.json").read_text())
labels_df.head()
Parquet summary metrics¶
These values mirror summary.json from badc report parquet (total detections, unique labels/recordings, and the bucket duration used for timelines).
[ ]:
pd.DataFrame([summary_metrics])
Detections per label¶
Use the CLI-generated labels.csv (counts + optional average confidence) to review the busiest species.
[ ]:
labels_df.loc[:, ["label", "label_name", "detections", "avg_confidence"]].sort_values(
"detections", ascending=False
).head(10)
Plot detections per label¶
The Parquet bundle already contains aggregated counts, so plotting does not require any additional DuckDB queries.
[ ]:
labels_df.sort_values("detections", ascending=False).plot(
kind="bar", x="label", y="detections", legend=False, title="Detections per label"
)
Top recordings¶
recordings.csv highlights which files contributed the most detections; this is useful for QC before diving into the raw chunks.
[ ]:
recordings_df.loc[:, ["recording_id", "detections", "avg_confidence"]].sort_values(
"detections", ascending=False
).head(10)
Timeline buckets¶
Timeline CSV rows correspond to the --bucket-minutes window from badc report parquet. Plotting them surfaces bursty activity over the recording.
[ ]:
timeline_df.sort_values("bucket_start_ms").assign(
bucket_minutes=lambda df: df["bucket_start_ms"] / 60000
).plot(
kind="line",
x="bucket_minutes",
y="detections",
marker="o",
title="Detections per bucket",
xlabel="Bucket start (minutes)",
ylabel="Detections",
)
Quicklook CSVs (optional)¶
badc report quicklook --output-dir artifacts/aggregate/<run>_quicklook writes lighter-weight CSVs (labels/recordings/chunks) plus ASCII sparklines. Load them here when you want to sanity-check the same tables the CLI printed without regenerating the parquet bundle.
[ ]:
QUICKLOOK_DIR = AGGREGATE_DIR / f"{RUN_ID}_quicklook"
if QUICKLOOK_DIR.exists():
quicklook_labels = pd.read_csv(QUICKLOOK_DIR / "labels.csv")
quicklook_chunks = pd.read_csv(QUICKLOOK_DIR / "chunks.csv")
display(quicklook_labels.head())
quicklook_chunks.sort_values("chunk_start_ms").plot(
kind="line",
x="chunk_start_ms",
y="detections",
marker="o",
title="Quicklook detections per chunk",
xlabel="Chunk start (ms)",
ylabel="Detections",
)
else:
print(
f"Quicklook directory {QUICKLOOK_DIR} not found; run `badc report quicklook --output-dir {QUICKLOOK_DIR}` first."
)
Runtime vs confidence join (placeholder)¶
Once telemetry schemas finalize we will join the per-chunk runtime data with detection confidence to spot underperforming GPUs.
[ ]:
telemetry_path = DATASET_ROOT / "data" / "telemetry" / "infer" / "log.jsonl"
print("Add join logic here once telemetry schema is finalized.")
Load bundle summaries via the helper¶
Run badc report bundle (or badc report duckdb) to materialize artifacts/aggregate/<RUN_ID>.duckdb. Use badc.duckdb_helpers.load_duckdb_views to load the per-run label_summary, recording_summary, and timeline_summary views directly into pandas DataFrames for plotting.
[ ]:
duckdb_views = load_duckdb_views(AGGREGATE_DIR / f"{RUN_ID}.duckdb", limit_labels=10)
duck_label_df = duckdb_views.label_summary
duck_recording_df = duckdb_views.recording_summary
duck_label_df.head()
[ ]:
import matplotlib.pyplot as plt
duck_label_df.plot(kind="bar", x="label", y="detections", legend=False, figsize=(8, 4))
plt.ylabel("Detections")
plt.title("Top DuckDB labels (detections)")
plt.tight_layout()
Timeline buckets from DuckDB¶
Use the timeline_summary view produced by badc report bundle to plot detections per 30-minute bucket directly from the .duckdb file. This mirrors the CLI timeline table but keeps everything inside the notebook.
[ ]:
duck_timeline = duckdb_views.timeline_summary
if duck_timeline.empty:
print("No timeline rows available in the DuckDB view; rerun badc report bundle if needed.")
else:
(
duck_timeline.assign(bucket_min=duck_timeline["bucket_start_ms"] / 60000).plot(
kind="line",
x="bucket_min",
y="detections",
marker="o",
ylabel="Detections",
xlabel="Bucket start (min)",
title="DuckDB detections per 30-minute bucket",
figsize=(8, 3),
)
)
plt.tight_layout()
This quick bar chart uses the DuckDB label_summary view loaded above. Replace label_df with other view queries (e.g., recording_summary, timeline_summary) to build thesis-ready figures without re-ingesting the Parquet file.
Dataset-wide rollup¶
If you ran badc infer orchestrate --bundle --bundle-rollup (or the pipeline wrapper which enables it by default), the aggregate directory now contains aggregate_summary/label_summary.csv and aggregate_summary/recording_summary.csv. Load them to verify the refreshed bogus dataset still mixes bird vocalizations and background noise.
[ ]:
summary_dir = AGGREGATE_DIR / "aggregate_summary"
if summary_dir.exists():
dataset_label_summary = pd.read_csv(summary_dir / "label_summary.csv")
dataset_recording_summary = pd.read_csv(summary_dir / "recording_summary.csv")
display(dataset_label_summary.head())
display(dataset_recording_summary.head())
else:
print(f"No aggregate summary found at {summary_dir}")