{ "cells": [ { "cell_type": "markdown", "id": "99863d82", "metadata": {}, "source": [ "# Aggregate analysis walkthrough\n", "\n", "Run `badc infer aggregate` followed by `badc report parquet --output-dir artifacts/aggregate/_parquet_report` to capture canonical detections plus ready-to-plot CSV/JSON bundles. This notebook loads those exports directly so reviewers can inspect stats without re-running DuckDB queries.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "19930793", "metadata": {}, "outputs": [], "source": [ "import json\n", "from pathlib import Path\n", "\n", "import pandas as pd\n", "\n", "from badc.duckdb_helpers import load_duckdb_views\n", "\n", "RUN_ID = \"GNWT-114_20230509_094500\" # update to any recording in artifacts/aggregate\n", "DATASET_ROOT = Path(\"..\") / \"data\" / \"datalad\" / \"bogus\"\n", "AGGREGATE_DIR = DATASET_ROOT / \"artifacts\" / \"aggregate\"\n", "PARQUET_REPORT_DIR = AGGREGATE_DIR / f\"{RUN_ID}_parquet_report\"\n", "\n", "labels_df = pd.read_csv(PARQUET_REPORT_DIR / \"labels.csv\")\n", "recordings_df = pd.read_csv(PARQUET_REPORT_DIR / \"recordings.csv\")\n", "timeline_df = pd.read_csv(PARQUET_REPORT_DIR / \"timeline.csv\")\n", "summary_metrics = json.loads((PARQUET_REPORT_DIR / \"summary.json\").read_text())\n", "\n", "labels_df.head()" ] }, { "cell_type": "markdown", "id": "d9a0a987", "metadata": {}, "source": [ "## Parquet summary metrics\n", "\n", "These values mirror `summary.json` from `badc report parquet` (total detections, unique labels/recordings, and the bucket duration used for timelines).\n" ] }, { "cell_type": "code", "execution_count": null, "id": "326284de", "metadata": {}, "outputs": [], "source": [ "pd.DataFrame([summary_metrics])" ] }, { "cell_type": "markdown", "id": "7fb27b941602401d91542211134fc71a", "metadata": {}, "source": [ "## Detections per label\n", "\n", "Use the CLI-generated `labels.csv` (counts + optional average confidence) to review the busiest species.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "acae54e37e7d407bbb7b55eff062a284", "metadata": {}, "outputs": [], "source": [ "labels_df.loc[:, [\"label\", \"label_name\", \"detections\", \"avg_confidence\"]].sort_values(\n", " \"detections\", ascending=False\n", ").head(10)" ] }, { "cell_type": "markdown", "id": "9a63283cbaf04dbcab1f6479b197f3a8", "metadata": {}, "source": [ "### Plot detections per label\n", "\n", "The Parquet bundle already contains aggregated counts, so plotting does not require any additional DuckDB queries.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8dd0d8092fe74a7c96281538738b07e2", "metadata": {}, "outputs": [], "source": [ "labels_df.sort_values(\"detections\", ascending=False).plot(\n", " kind=\"bar\", x=\"label\", y=\"detections\", legend=False, title=\"Detections per label\"\n", ")" ] }, { "cell_type": "markdown", "id": "72eea5119410473aa328ad9291626812", "metadata": {}, "source": [ "### Top recordings\n", "\n", "`recordings.csv` highlights which files contributed the most detections; this is useful for QC before diving into the raw chunks.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8edb47106e1a46a883d545849b8ab81b", "metadata": {}, "outputs": [], "source": [ "recordings_df.loc[:, [\"recording_id\", \"detections\", \"avg_confidence\"]].sort_values(\n", " \"detections\", ascending=False\n", ").head(10)" ] }, { "cell_type": "markdown", "id": "06745590", "metadata": {}, "source": [ "### Timeline buckets\n", "\n", "Timeline CSV rows correspond to the `--bucket-minutes` window from `badc report parquet`. Plotting them surfaces bursty activity over the recording.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "af2f7495", "metadata": {}, "outputs": [], "source": [ "timeline_df.sort_values(\"bucket_start_ms\").assign(\n", " bucket_minutes=lambda df: df[\"bucket_start_ms\"] / 60000\n", ").plot(\n", " kind=\"line\",\n", " x=\"bucket_minutes\",\n", " y=\"detections\",\n", " marker=\"o\",\n", " title=\"Detections per bucket\",\n", " xlabel=\"Bucket start (minutes)\",\n", " ylabel=\"Detections\",\n", ")" ] }, { "cell_type": "markdown", "id": "quicklook-md", "metadata": {}, "source": [ "## Quicklook CSVs (optional)\n", "\n", "`badc report quicklook --output-dir artifacts/aggregate/_quicklook` writes lighter-weight CSVs (labels/recordings/chunks) plus ASCII sparklines. Load them here when you want to sanity-check the same tables the CLI printed without regenerating the parquet bundle.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "quicklook-code", "metadata": {}, "outputs": [], "source": [ "QUICKLOOK_DIR = AGGREGATE_DIR / f\"{RUN_ID}_quicklook\"\n", "if QUICKLOOK_DIR.exists():\n", " quicklook_labels = pd.read_csv(QUICKLOOK_DIR / \"labels.csv\")\n", " quicklook_chunks = pd.read_csv(QUICKLOOK_DIR / \"chunks.csv\")\n", " display(quicklook_labels.head())\n", " quicklook_chunks.sort_values(\"chunk_start_ms\").plot(\n", " kind=\"line\",\n", " x=\"chunk_start_ms\",\n", " y=\"detections\",\n", " marker=\"o\",\n", " title=\"Quicklook detections per chunk\",\n", " xlabel=\"Chunk start (ms)\",\n", " ylabel=\"Detections\",\n", " )\n", "else:\n", " print(\n", " f\"Quicklook directory {QUICKLOOK_DIR} not found; run `badc report quicklook --output-dir {QUICKLOOK_DIR}` first.\"\n", " )" ] }, { "cell_type": "markdown", "id": "746994e940bc4bdca10c85c435d697d2", "metadata": {}, "source": [ "## Runtime vs confidence join (placeholder)\n", "\n", "Once telemetry schemas finalize we will join the per-chunk runtime data with detection confidence to spot underperforming GPUs.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "6babec56729f492ba2c122ac0bb5d104", "metadata": {}, "outputs": [], "source": [ "telemetry_path = DATASET_ROOT / \"data\" / \"telemetry\" / \"infer\" / \"log.jsonl\"\n", "print(\"Add join logic here once telemetry schema is finalized.\")" ] }, { "cell_type": "markdown", "id": "10185d26023b46108eb7d9f57d49d2b3", "metadata": {}, "source": [ "### Load bundle summaries via the helper\n", "Run `badc report bundle` (or `badc report duckdb`) to materialize `artifacts/aggregate/.duckdb`.\n", "Use ``badc.duckdb_helpers.load_duckdb_views`` to load the per-run `label_summary`, `recording_summary`, and `timeline_summary` views directly into pandas DataFrames for plotting.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8763a12b2bbd4a93a75aff182afb95dc", "metadata": {}, "outputs": [], "source": [ "duckdb_views = load_duckdb_views(AGGREGATE_DIR / f\"{RUN_ID}.duckdb\", limit_labels=10)\n", "duck_label_df = duckdb_views.label_summary\n", "duck_recording_df = duckdb_views.recording_summary\n", "duck_label_df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "7623eae2785240b9bd12b16a66d81610", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "duck_label_df.plot(kind=\"bar\", x=\"label\", y=\"detections\", legend=False, figsize=(8, 4))\n", "plt.ylabel(\"Detections\")\n", "plt.title(\"Top DuckDB labels (detections)\")\n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "id": "c0d6cbaa", "metadata": {}, "source": [ "### Timeline buckets from DuckDB\n", "Use the `timeline_summary` view produced by `badc report bundle` to plot detections per 30-minute bucket directly from the `.duckdb` file. This mirrors the CLI timeline table but keeps everything inside the notebook.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "13b9363e", "metadata": {}, "outputs": [], "source": [ "duck_timeline = duckdb_views.timeline_summary\n", "if duck_timeline.empty:\n", " print(\"No timeline rows available in the DuckDB view; rerun badc report bundle if needed.\")\n", "else:\n", " (\n", " duck_timeline.assign(bucket_min=duck_timeline[\"bucket_start_ms\"] / 60000).plot(\n", " kind=\"line\",\n", " x=\"bucket_min\",\n", " y=\"detections\",\n", " marker=\"o\",\n", " ylabel=\"Detections\",\n", " xlabel=\"Bucket start (min)\",\n", " title=\"DuckDB detections per 30-minute bucket\",\n", " figsize=(8, 3),\n", " )\n", " )\n", " plt.tight_layout()" ] }, { "cell_type": "markdown", "id": "7cdc8c89c7104fffa095e18ddfef8986", "metadata": {}, "source": [ "This quick bar chart uses the DuckDB `label_summary` view loaded above. Replace `label_df` with other view queries (e.g., `recording_summary`, `timeline_summary`) to build thesis-ready figures without re-ingesting the Parquet file." ] }, { "cell_type": "markdown", "id": "b118ea5561624da68c537baed56e602f", "metadata": {}, "source": [ "## Dataset-wide rollup\n", "If you ran `badc infer orchestrate --bundle --bundle-rollup` (or the pipeline wrapper which enables it by default), the aggregate directory now contains `aggregate_summary/label_summary.csv` and `aggregate_summary/recording_summary.csv`. Load them to verify the refreshed bogus dataset still mixes bird vocalizations and background noise.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "938c804e27f84196a10c8828c723f798", "metadata": {}, "outputs": [], "source": [ "summary_dir = AGGREGATE_DIR / \"aggregate_summary\"\n", "if summary_dir.exists():\n", " dataset_label_summary = pd.read_csv(summary_dir / \"label_summary.csv\")\n", " dataset_recording_summary = pd.read_csv(summary_dir / \"recording_summary.csv\")\n", " display(dataset_label_summary.head())\n", " display(dataset_recording_summary.head())\n", "else:\n", " print(f\"No aggregate summary found at {summary_dir}\")" ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 5 }