{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "99863d82",
   "metadata": {},
   "source": [
    "# Aggregate analysis walkthrough\n",
    "\n",
    "Run `badc infer aggregate` followed by `badc report parquet --output-dir artifacts/aggregate/<run>_parquet_report` to capture canonical detections plus ready-to-plot CSV/JSON bundles. This notebook loads those exports directly so reviewers can inspect stats without re-running DuckDB queries.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19930793",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from pathlib import Path\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "from badc.duckdb_helpers import load_duckdb_views\n",
    "\n",
    "RUN_ID = \"GNWT-114_20230509_094500\"  # update to any recording in artifacts/aggregate\n",
    "DATASET_ROOT = Path(\"..\") / \"data\" / \"datalad\" / \"bogus\"\n",
    "AGGREGATE_DIR = DATASET_ROOT / \"artifacts\" / \"aggregate\"\n",
    "PARQUET_REPORT_DIR = AGGREGATE_DIR / f\"{RUN_ID}_parquet_report\"\n",
    "\n",
    "labels_df = pd.read_csv(PARQUET_REPORT_DIR / \"labels.csv\")\n",
    "recordings_df = pd.read_csv(PARQUET_REPORT_DIR / \"recordings.csv\")\n",
    "timeline_df = pd.read_csv(PARQUET_REPORT_DIR / \"timeline.csv\")\n",
    "summary_metrics = json.loads((PARQUET_REPORT_DIR / \"summary.json\").read_text())\n",
    "\n",
    "labels_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9a0a987",
   "metadata": {},
   "source": [
    "## Parquet summary metrics\n",
    "\n",
    "These values mirror `summary.json` from `badc report parquet` (total detections, unique labels/recordings, and the bucket duration used for timelines).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "326284de",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame([summary_metrics])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7fb27b941602401d91542211134fc71a",
   "metadata": {},
   "source": [
    "## Detections per label\n",
    "\n",
    "Use the CLI-generated `labels.csv` (counts + optional average confidence) to review the busiest species.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "acae54e37e7d407bbb7b55eff062a284",
   "metadata": {},
   "outputs": [],
   "source": [
    "labels_df.loc[:, [\"label\", \"label_name\", \"detections\", \"avg_confidence\"]].sort_values(\n",
    "    \"detections\", ascending=False\n",
    ").head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a63283cbaf04dbcab1f6479b197f3a8",
   "metadata": {},
   "source": [
    "### Plot detections per label\n",
    "\n",
    "The Parquet bundle already contains aggregated counts, so plotting does not require any additional DuckDB queries.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8dd0d8092fe74a7c96281538738b07e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "labels_df.sort_values(\"detections\", ascending=False).plot(\n",
    "    kind=\"bar\", x=\"label\", y=\"detections\", legend=False, title=\"Detections per label\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72eea5119410473aa328ad9291626812",
   "metadata": {},
   "source": [
    "### Top recordings\n",
    "\n",
    "`recordings.csv` highlights which files contributed the most detections; this is useful for QC before diving into the raw chunks.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8edb47106e1a46a883d545849b8ab81b",
   "metadata": {},
   "outputs": [],
   "source": [
    "recordings_df.loc[:, [\"recording_id\", \"detections\", \"avg_confidence\"]].sort_values(\n",
    "    \"detections\", ascending=False\n",
    ").head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06745590",
   "metadata": {},
   "source": [
    "### Timeline buckets\n",
    "\n",
    "Timeline CSV rows correspond to the `--bucket-minutes` window from `badc report parquet`. Plotting them surfaces bursty activity over the recording.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af2f7495",
   "metadata": {},
   "outputs": [],
   "source": [
    "timeline_df.sort_values(\"bucket_start_ms\").assign(\n",
    "    bucket_minutes=lambda df: df[\"bucket_start_ms\"] / 60000\n",
    ").plot(\n",
    "    kind=\"line\",\n",
    "    x=\"bucket_minutes\",\n",
    "    y=\"detections\",\n",
    "    marker=\"o\",\n",
    "    title=\"Detections per bucket\",\n",
    "    xlabel=\"Bucket start (minutes)\",\n",
    "    ylabel=\"Detections\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "quicklook-md",
   "metadata": {},
   "source": [
    "## Quicklook CSVs (optional)\n",
    "\n",
    "`badc report quicklook --output-dir artifacts/aggregate/<run>_quicklook` writes lighter-weight CSVs (labels/recordings/chunks) plus ASCII sparklines. Load them here when you want to sanity-check the same tables the CLI printed without regenerating the parquet bundle.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "quicklook-code",
   "metadata": {},
   "outputs": [],
   "source": [
    "QUICKLOOK_DIR = AGGREGATE_DIR / f\"{RUN_ID}_quicklook\"\n",
    "if QUICKLOOK_DIR.exists():\n",
    "    quicklook_labels = pd.read_csv(QUICKLOOK_DIR / \"labels.csv\")\n",
    "    quicklook_chunks = pd.read_csv(QUICKLOOK_DIR / \"chunks.csv\")\n",
    "    display(quicklook_labels.head())\n",
    "    quicklook_chunks.sort_values(\"chunk_start_ms\").plot(\n",
    "        kind=\"line\",\n",
    "        x=\"chunk_start_ms\",\n",
    "        y=\"detections\",\n",
    "        marker=\"o\",\n",
    "        title=\"Quicklook detections per chunk\",\n",
    "        xlabel=\"Chunk start (ms)\",\n",
    "        ylabel=\"Detections\",\n",
    "    )\n",
    "else:\n",
    "    print(\n",
    "        f\"Quicklook directory {QUICKLOOK_DIR} not found; run `badc report quicklook --output-dir {QUICKLOOK_DIR}` first.\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "746994e940bc4bdca10c85c435d697d2",
   "metadata": {},
   "source": [
    "## Runtime vs confidence join (placeholder)\n",
    "\n",
    "Once telemetry schemas finalize we will join the per-chunk runtime data with detection confidence to spot underperforming GPUs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6babec56729f492ba2c122ac0bb5d104",
   "metadata": {},
   "outputs": [],
   "source": [
    "telemetry_path = DATASET_ROOT / \"data\" / \"telemetry\" / \"infer\" / \"log.jsonl\"\n",
    "print(\"Add join logic here once telemetry schema is finalized.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10185d26023b46108eb7d9f57d49d2b3",
   "metadata": {},
   "source": [
    "### Load bundle summaries via the helper\n",
    "Run `badc report bundle` (or `badc report duckdb`) to materialize `artifacts/aggregate/<RUN_ID>.duckdb`.\n",
    "Use ``badc.duckdb_helpers.load_duckdb_views`` to load the per-run `label_summary`, `recording_summary`, and `timeline_summary` views directly into pandas DataFrames for plotting.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8763a12b2bbd4a93a75aff182afb95dc",
   "metadata": {},
   "outputs": [],
   "source": [
    "duckdb_views = load_duckdb_views(AGGREGATE_DIR / f\"{RUN_ID}.duckdb\", limit_labels=10)\n",
    "duck_label_df = duckdb_views.label_summary\n",
    "duck_recording_df = duckdb_views.recording_summary\n",
    "duck_label_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7623eae2785240b9bd12b16a66d81610",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "duck_label_df.plot(kind=\"bar\", x=\"label\", y=\"detections\", legend=False, figsize=(8, 4))\n",
    "plt.ylabel(\"Detections\")\n",
    "plt.title(\"Top DuckDB labels (detections)\")\n",
    "plt.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0d6cbaa",
   "metadata": {},
   "source": [
    "### Timeline buckets from DuckDB\n",
    "Use the `timeline_summary` view produced by `badc report bundle` to plot detections per 30-minute bucket directly from the `.duckdb` file. This mirrors the CLI timeline table but keeps everything inside the notebook.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13b9363e",
   "metadata": {},
   "outputs": [],
   "source": [
    "duck_timeline = duckdb_views.timeline_summary\n",
    "if duck_timeline.empty:\n",
    "    print(\"No timeline rows available in the DuckDB view; rerun badc report bundle if needed.\")\n",
    "else:\n",
    "    (\n",
    "        duck_timeline.assign(bucket_min=duck_timeline[\"bucket_start_ms\"] / 60000).plot(\n",
    "            kind=\"line\",\n",
    "            x=\"bucket_min\",\n",
    "            y=\"detections\",\n",
    "            marker=\"o\",\n",
    "            ylabel=\"Detections\",\n",
    "            xlabel=\"Bucket start (min)\",\n",
    "            title=\"DuckDB detections per 30-minute bucket\",\n",
    "            figsize=(8, 3),\n",
    "        )\n",
    "    )\n",
    "    plt.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7cdc8c89c7104fffa095e18ddfef8986",
   "metadata": {},
   "source": [
    "This quick bar chart uses the DuckDB `label_summary` view loaded above. Replace `label_df` with other view queries (e.g., `recording_summary`, `timeline_summary`) to build thesis-ready figures without re-ingesting the Parquet file."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b118ea5561624da68c537baed56e602f",
   "metadata": {},
   "source": [
    "## Dataset-wide rollup\n",
    "If you ran `badc infer orchestrate --bundle --bundle-rollup` (or the pipeline wrapper which enables it by default), the aggregate directory now contains `aggregate_summary/label_summary.csv` and `aggregate_summary/recording_summary.csv`. Load them to verify the refreshed bogus dataset still mixes bird vocalizations and background noise.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "938c804e27f84196a10c8828c723f798",
   "metadata": {},
   "outputs": [],
   "source": [
    "summary_dir = AGGREGATE_DIR / \"aggregate_summary\"\n",
    "if summary_dir.exists():\n",
    "    dataset_label_summary = pd.read_csv(summary_dir / \"label_summary.csv\")\n",
    "    dataset_recording_summary = pd.read_csv(summary_dir / \"recording_summary.csv\")\n",
    "    display(dataset_label_summary.head())\n",
    "    display(dataset_recording_summary.head())\n",
    "else:\n",
    "    print(f\"No aggregate summary found at {summary_dir}\")"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}