End-to-End CLI Pipeline¶
Quick start¶
Run the entire chunk → infer → aggregate/report loop in one shot with the new wrapper:
$ badc pipeline run data/datalad/bogus \
--chunk-plan plans/pipeline.json \
--chunk-duration 60 \
--bundle \
--bundle-aggregate-dir artifacts/aggregate
The command saves the chunk plan JSON (for reruns/HPC scripts), enforces chunk-status completion
before inference, and optionally runs badc infer aggregate + badc report bundle so every
recording leaves behind quicklook CSVs, Parquet exports, and DuckDB bundles. The sections below break
down the same workflow when you want to call each stage manually.
This guide stitches the chunking, inference, aggregation, and reporting CLIs into a single, reproducible workflow that produces DataLad-tracked artifacts ready for Erin’s Phase 2 analytics review.
Prerequisites¶
badcinstalled in an activated virtual environment.Target DataLad dataset cloned and connected via
badc data connect(e.g.,data/datalad/bogus).HawkEars vendor repo fetched via
git submodule update --init --recursive(only needed when running with--use-hawkears).GPUs visible via
badc gpus(or provide--cpu-workersfor stub runs).
Pipeline map¶
badc pipeline run simply strings together the commands below. When you need to reason about an
interrupted run (or explain the workflow to HPC operators), keep this flow handy:
data/datalad/<dataset> (git/datalad clone)
|
| badc chunk orchestrate --plan-json plans/chunks.json --apply
v
artifacts/chunks/<recording>/.chunk_status.json + manifests/*.csv
|
| badc infer orchestrate --chunk-plan plans/chunks.json --apply --bundle
v
artifacts/infer/<recording>/*.{json,csv} + telemetry/*.jsonl + *.summary.json
|
| badc infer aggregate + badc report bundle/aggregate-dir
v
artifacts/aggregate/<RUN_ID>_{summary,parquet,duckdb,*quicklook}/
|
| notebooks/docs/notebooks/aggregate_analysis.ipynb
v
figures + tables for Erin (label_summary, recording_summary, timeline_summary)
Each arrow enforces a guardrail:
Chunk orchestrator writes
.chunk_status.jsonper recording; inference refuses to start when the status is missing or notcompleted.Inference orchestrator creates telemetry JSONL logs and resumable
*.summary.jsonfiles so--resume-summary/--resume-completedcan skip finished chunks.--bundleconsolidates aggregation/report helpers so every recording leaves behind Parquet, quicklook CSVs, DuckDB bundles, and rollups underartifacts/aggregate.--bundle-rolluptriggers badc report aggregate-dir for dataset-wide leaderboards automatically.
Step 1 — Chunk the dataset¶
Generate manifests and chunk WAVs for every recording, capturing the plan for downstream commands
and ensuring each chunk directory records .chunk_status.json:
$ badc chunk orchestrate data/datalad/bogus \
--pattern "*.wav" \
--chunk-duration 60 \
--overlap 0 \
--plan-json plans/chunks.json \
--apply \
--include-existing \
--workers 4
Notes:
plans/chunks.jsoncaptures every manifest/chunk directory so inference can reference the exact recordings just processed.The orchestrator writes
artifacts/chunks/<recording>/.chunk_status.jsonwithstatus="completed"whenever chunking succeeds.badc infer orchestraterefuses to run when this status is missing or notcompleted, keeping GPU time aligned with finished chunk jobs.
Step 2 — Run inference (plus aggregation bundle)¶
Feed the saved chunk plan into the inference orchestrator, reuse completed telemetry summaries when
available, and attach the aggregation/report bundle so Phase 2 artifacts land under
artifacts/aggregate automatically:
$ badc infer orchestrate data/datalad/bogus \
--chunk-plan plans/chunks.json \
--include-existing \
--resume-completed \
--apply \
--bundle \
--bundle-aggregate-dir artifacts/aggregate \
--bundle-bucket-minutes 30 \
--bundle-rollup \
--stub-runner \
--no-record-datalad
Tips:
Drop
--stub-runnerand add--use-hawkearswhen you are ready to call the vendor runner.For Sockeye submissions, append
--sockeye-script artifacts/sockeye/badc.shplus the--sockeye-*overrides; the generated script now validates chunk status before chaque array task.--bundle-rollupautomatically calls badc report aggregate-dir once all manifests finish, writinglabel_summary.csv/recording_summary.csvtoartifacts/aggregate/aggregate_summary/(override with--bundle-rollup-export-dir). The pipeline wrapper flips this flag on by default so dataset-scale runs always leave behind a cross-run leaderboard for Erin.
Step 3 — Review + save artifacts¶
Each inference run generates:
artifacts/infer/<recording>/detection JSON/CSV files.artifacts/telemetry/infer/<recording>.jsonltelemetry logs plus resumable summaries.artifacts/aggregate/<RUN_ID>*quicklook CSVs, Parquet exports, and DuckDB bundles from--bundle.
Capture the results in DataLad and sync upstream:
$ cd data/datalad/bogus
$ datalad save artifacts -m "End-to-end chunk+infer bundle"
$ datalad push --to origin
Step 4 — Analyze¶
Open docs/notebooks/aggregate_analysis.ipynb (or your own notebooks) and point it at the new
artifacts/aggregate/<RUN_ID>.duckdb / Parquet bundle. Python helpers in
badc.duckdb_helpers provide ready-to-use pandas DataFrames for label_summary,
recording_summary, and timeline_summary views.
Troubleshooting checklist¶
Chunk guard trip —
badc infer orchestrateprints “missing chunk status” when a recording lacksartifacts/chunks/<recording>/.chunk_status.jsonor the status differs fromcompleted. Re-runbadc chunk orchestrate --plan-json … --apply --include-existingfor only the affected recordings, verify the status flips tocompleted, then re-run inference.Telemetry resume missing — Sockeye jobs print warnings when the resume summary listed in the generated script does not exist. Run
badc infer run --resume-summary <path>manually to confirm the path, or delete the row fromplans/chunks.jsonand regenerate the script after a freshbadc infer orchestrate --chunk-plan … --applyrun.Dataset not connected —
badc pipeline runexpects the dataset to be registered viabadc data connect. Runbadc data statusto confirm the path, or manuallycdintodata/datalad/<dataset>and re-run the command from there so DataLad can materialise files.DataLad dirty tree —
datalad runrefuses to wrap commands when the dataset already has uncommitted files. Save or drop the pending work (datalad status, then datalad save or datalad drop) before rerunning the orchestrator with--record-datalad.Bundle rollup missing CSVs —
badc infer orchestrate --bundle-rollupwritesartifacts/aggregate/aggregate_summary/*.csvonly after every recording finishes. If the directory is empty, inspect telemetry logs for failed chunks, rerun the orchestrator with--resume-completed, and confirmbadc report aggregate-dirsucceeds manually before saving artifacts.
Next steps¶
For HPC runs, rely on
badc infer orchestrate --sockeye-scriptwith--sockeye-resume-completed/--sockeye-bundleplus the chunk-status guard described above.When chunking or inference needs to resume partially, re-run the commands with
--include-existing/--allow-partial-chunkswhile trusting the status files to skip work that already completed.