End-to-End CLI Pipeline¶

Quick start¶

Run the entire chunk → infer → aggregate/report loop in one shot with the new wrapper:

$ badc pipeline run data/datalad/bogus \
    --chunk-plan plans/pipeline.json \
    --chunk-duration 60 \
    --bundle \
    --bundle-aggregate-dir artifacts/aggregate

The command saves the chunk plan JSON (for reruns/HPC scripts), enforces chunk-status completion before inference, and optionally runs badc infer aggregate + badc report bundle so every recording leaves behind quicklook CSVs, Parquet exports, and DuckDB bundles. The sections below break down the same workflow when you want to call each stage manually.

This guide stitches the chunking, inference, aggregation, and reporting CLIs into a single, reproducible workflow that produces DataLad-tracked artifacts ready for Erin’s Phase 2 analytics review.

Prerequisites¶

badc installed in an activated virtual environment.
Target DataLad dataset cloned and connected via badc data connect (e.g., data/datalad/bogus).
HawkEars vendor repo fetched via git submodule update --init --recursive (only needed when running with --use-hawkears).
GPUs visible via badc gpus (or provide --cpu-workers for stub runs).

Pipeline map¶

badc pipeline run simply strings together the commands below. When you need to reason about an interrupted run (or explain the workflow to HPC operators), keep this flow handy:

data/datalad/<dataset> (git/datalad clone)
                 |
                 |  badc chunk orchestrate --plan-json plans/chunks.json --apply
                 v
artifacts/chunks/<recording>/.chunk_status.json + manifests/*.csv
                 |
                 |  badc infer orchestrate --chunk-plan plans/chunks.json --apply --bundle
                 v
artifacts/infer/<recording>/*.{json,csv} + telemetry/*.jsonl + *.summary.json
                 |
                 |  badc infer aggregate + badc report bundle/aggregate-dir
                 v
artifacts/aggregate/<RUN_ID>_{summary,parquet,duckdb,*quicklook}/
                 |
                 |  notebooks/docs/notebooks/aggregate_analysis.ipynb
                 v
figures + tables for Erin (label_summary, recording_summary, timeline_summary)

Each arrow enforces a guardrail:

Chunk orchestrator writes .chunk_status.json per recording; inference refuses to start when the status is missing or not completed.
Inference orchestrator creates telemetry JSONL logs and resumable *.summary.json files so --resume-summary / --resume-completed can skip finished chunks.
--bundle consolidates aggregation/report helpers so every recording leaves behind Parquet, quicklook CSVs, DuckDB bundles, and rollups under artifacts/aggregate. --bundle-rollup triggers badc report aggregate-dir for dataset-wide leaderboards automatically.

Step 1 — Chunk the dataset¶

Generate manifests and chunk WAVs for every recording, capturing the plan for downstream commands and ensuring each chunk directory records .chunk_status.json:

$ badc chunk orchestrate data/datalad/bogus \
    --pattern "*.wav" \
    --chunk-duration 60 \
    --overlap 0 \
    --plan-json plans/chunks.json \
    --apply \
    --include-existing \
    --workers 4

Notes:

plans/chunks.json captures every manifest/chunk directory so inference can reference the exact recordings just processed.
The orchestrator writes artifacts/chunks/<recording>/.chunk_status.json with status="completed" whenever chunking succeeds. badc infer orchestrate refuses to run when this status is missing or not completed, keeping GPU time aligned with finished chunk jobs.

Step 2 — Run inference (plus aggregation bundle)¶

Feed the saved chunk plan into the inference orchestrator, reuse completed telemetry summaries when available, and attach the aggregation/report bundle so Phase 2 artifacts land under artifacts/aggregate automatically:

$ badc infer orchestrate data/datalad/bogus \
    --chunk-plan plans/chunks.json \
    --include-existing \
    --resume-completed \
    --apply \
    --bundle \
    --bundle-aggregate-dir artifacts/aggregate \
    --bundle-bucket-minutes 30 \
    --bundle-rollup \
    --stub-runner \
    --no-record-datalad

Tips:

Drop --stub-runner and add --use-hawkears when you are ready to call the vendor runner.
For Sockeye submissions, append --sockeye-script artifacts/sockeye/badc.sh plus the --sockeye-* overrides; the generated script now validates chunk status before chaque array task.
--bundle-rollup automatically calls badc report aggregate-dir once all manifests finish, writing label_summary.csv / recording_summary.csv to artifacts/aggregate/aggregate_summary/ (override with --bundle-rollup-export-dir). The pipeline wrapper flips this flag on by default so dataset-scale runs always leave behind a cross-run leaderboard for Erin.

Step 3 — Review + save artifacts¶

Each inference run generates:

artifacts/infer/<recording>/ detection JSON/CSV files.
artifacts/telemetry/infer/<recording>.jsonl telemetry logs plus resumable summaries.
artifacts/aggregate/<RUN_ID>* quicklook CSVs, Parquet exports, and DuckDB bundles from --bundle.

Capture the results in DataLad and sync upstream:

$ cd data/datalad/bogus
$ datalad save artifacts -m "End-to-end chunk+infer bundle"
$ datalad push --to origin

Step 4 — Analyze¶

Open docs/notebooks/aggregate_analysis.ipynb (or your own notebooks) and point it at the new artifacts/aggregate/<RUN_ID>.duckdb / Parquet bundle. Python helpers in badc.duckdb_helpers provide ready-to-use pandas DataFrames for label_summary, recording_summary, and timeline_summary views.

Troubleshooting checklist¶

Chunk guard trip — badc infer orchestrate prints “missing chunk status” when a recording lacks artifacts/chunks/<recording>/.chunk_status.json or the status differs from completed. Re-run badc chunk orchestrate --plan-json … --apply --include-existing for only the affected recordings, verify the status flips to completed, then re-run inference.
Telemetry resume missing — Sockeye jobs print warnings when the resume summary listed in the generated script does not exist. Run badc infer run --resume-summary <path> manually to confirm the path, or delete the row from plans/chunks.json and regenerate the script after a fresh badc infer orchestrate --chunk-plan … --apply run.
Dataset not connected — badc pipeline run expects the dataset to be registered via badc data connect. Run badc data status to confirm the path, or manually cd into data/datalad/<dataset> and re-run the command from there so DataLad can materialise files.
DataLad dirty tree — datalad run refuses to wrap commands when the dataset already has uncommitted files. Save or drop the pending work (datalad status, then datalad save or datalad drop) before rerunning the orchestrator with --record-datalad.
Bundle rollup missing CSVs — badc infer orchestrate --bundle-rollup writes artifacts/aggregate/aggregate_summary/*.csv only after every recording finishes. If the directory is empty, inspect telemetry logs for failed chunks, rerun the orchestrator with --resume-completed, and confirm badc report aggregate-dir succeeds manually before saving artifacts.

Next steps¶

For HPC runs, rely on badc infer orchestrate --sockeye-script with --sockeye-resume-completed/ --sockeye-bundle plus the chunk-status guard described above.
When chunking or inference needs to resume partially, re-run the commands with --include-existing / --allow-partial-chunks while trusting the status files to skip work that already completed.