Aggregate analysis walkthrough

Run badc infer aggregate followed by badc report parquet --output-dir artifacts/aggregate/<run>_parquet_report to capture canonical detections plus ready-to-plot CSV/JSON bundles. This notebook loads those exports directly so reviewers can inspect stats without re-running DuckDB queries.

[ ]:
import json
from pathlib import Path

import pandas as pd

from badc.duckdb_helpers import load_duckdb_views

RUN_ID = "GNWT-114_20230509_094500"  # update to any recording in artifacts/aggregate
DATASET_ROOT = Path("..") / "data" / "datalad" / "bogus"
AGGREGATE_DIR = DATASET_ROOT / "artifacts" / "aggregate"
PARQUET_REPORT_DIR = AGGREGATE_DIR / f"{RUN_ID}_parquet_report"

labels_df = pd.read_csv(PARQUET_REPORT_DIR / "labels.csv")
recordings_df = pd.read_csv(PARQUET_REPORT_DIR / "recordings.csv")
timeline_df = pd.read_csv(PARQUET_REPORT_DIR / "timeline.csv")
summary_metrics = json.loads((PARQUET_REPORT_DIR / "summary.json").read_text())

labels_df.head()

Parquet summary metrics

These values mirror summary.json from badc report parquet (total detections, unique labels/recordings, and the bucket duration used for timelines).

[ ]:
pd.DataFrame([summary_metrics])

Detections per label

Use the CLI-generated labels.csv (counts + optional average confidence) to review the busiest species.

[ ]:
labels_df.loc[:, ["label", "label_name", "detections", "avg_confidence"]].sort_values(
    "detections", ascending=False
).head(10)

Plot detections per label

The Parquet bundle already contains aggregated counts, so plotting does not require any additional DuckDB queries.

[ ]:
labels_df.sort_values("detections", ascending=False).plot(
    kind="bar", x="label", y="detections", legend=False, title="Detections per label"
)

Top recordings

recordings.csv highlights which files contributed the most detections; this is useful for QC before diving into the raw chunks.

[ ]:
recordings_df.loc[:, ["recording_id", "detections", "avg_confidence"]].sort_values(
    "detections", ascending=False
).head(10)

Timeline buckets

Timeline CSV rows correspond to the --bucket-minutes window from badc report parquet. Plotting them surfaces bursty activity over the recording.

[ ]:
timeline_df.sort_values("bucket_start_ms").assign(
    bucket_minutes=lambda df: df["bucket_start_ms"] / 60000
).plot(
    kind="line",
    x="bucket_minutes",
    y="detections",
    marker="o",
    title="Detections per bucket",
    xlabel="Bucket start (minutes)",
    ylabel="Detections",
)

Quicklook CSVs (optional)

badc report quicklook --output-dir artifacts/aggregate/<run>_quicklook writes lighter-weight CSVs (labels/recordings/chunks) plus ASCII sparklines. Load them here when you want to sanity-check the same tables the CLI printed without regenerating the parquet bundle.

[ ]:
QUICKLOOK_DIR = AGGREGATE_DIR / f"{RUN_ID}_quicklook"
if QUICKLOOK_DIR.exists():
    quicklook_labels = pd.read_csv(QUICKLOOK_DIR / "labels.csv")
    quicklook_chunks = pd.read_csv(QUICKLOOK_DIR / "chunks.csv")
    display(quicklook_labels.head())
    quicklook_chunks.sort_values("chunk_start_ms").plot(
        kind="line",
        x="chunk_start_ms",
        y="detections",
        marker="o",
        title="Quicklook detections per chunk",
        xlabel="Chunk start (ms)",
        ylabel="Detections",
    )
else:
    print(
        f"Quicklook directory {QUICKLOOK_DIR} not found; run `badc report quicklook --output-dir {QUICKLOOK_DIR}` first."
    )

Runtime vs confidence join (placeholder)

Once telemetry schemas finalize we will join the per-chunk runtime data with detection confidence to spot underperforming GPUs.

[ ]:
telemetry_path = DATASET_ROOT / "data" / "telemetry" / "infer" / "log.jsonl"
print("Add join logic here once telemetry schema is finalized.")

Load bundle summaries via the helper

Run badc report bundle (or badc report duckdb) to materialize artifacts/aggregate/<RUN_ID>.duckdb. Use badc.duckdb_helpers.load_duckdb_views to load the per-run label_summary, recording_summary, and timeline_summary views directly into pandas DataFrames for plotting.

[ ]:
duckdb_views = load_duckdb_views(AGGREGATE_DIR / f"{RUN_ID}.duckdb", limit_labels=10)
duck_label_df = duckdb_views.label_summary
duck_recording_df = duckdb_views.recording_summary
duck_label_df.head()
[ ]:
import matplotlib.pyplot as plt

duck_label_df.plot(kind="bar", x="label", y="detections", legend=False, figsize=(8, 4))
plt.ylabel("Detections")
plt.title("Top DuckDB labels (detections)")
plt.tight_layout()

Timeline buckets from DuckDB

Use the timeline_summary view produced by badc report bundle to plot detections per 30-minute bucket directly from the .duckdb file. This mirrors the CLI timeline table but keeps everything inside the notebook.

[ ]:
duck_timeline = duckdb_views.timeline_summary
if duck_timeline.empty:
    print("No timeline rows available in the DuckDB view; rerun badc report bundle if needed.")
else:
    (
        duck_timeline.assign(bucket_min=duck_timeline["bucket_start_ms"] / 60000).plot(
            kind="line",
            x="bucket_min",
            y="detections",
            marker="o",
            ylabel="Detections",
            xlabel="Bucket start (min)",
            title="DuckDB detections per 30-minute bucket",
            figsize=(8, 3),
        )
    )
    plt.tight_layout()

This quick bar chart uses the DuckDB label_summary view loaded above. Replace label_df with other view queries (e.g., recording_summary, timeline_summary) to build thesis-ready figures without re-ingesting the Parquet file.

Dataset-wide rollup

If you ran badc infer orchestrate --bundle --bundle-rollup (or the pipeline wrapper which enables it by default), the aggregate directory now contains aggregate_summary/label_summary.csv and aggregate_summary/recording_summary.csv. Load them to verify the refreshed bogus dataset still mixes bird vocalizations and background noise.

[ ]:
summary_dir = AGGREGATE_DIR / "aggregate_summary"
if summary_dir.exists():
    dataset_label_summary = pd.read_csv(summary_dir / "label_summary.csv")
    dataset_recording_summary = pd.read_csv(summary_dir / "recording_summary.csv")
    display(dataset_label_summary.head())
    display(dataset_recording_summary.head())
else:
    print(f"No aggregate summary found at {summary_dir}")