Usage Overview¶
The snippets below show the canonical BADC workflow end to end. Each section links back to the corresponding CLI reference page so you can dive deeper when needed.
Bootstrap a checkout¶
The CLI entry point is badc. After cloning the repo, initialise submodules and connect the bogus
DataLad dataset so sample audio is available locally:
$ git submodule update --init --recursive
$ badc data connect bogus --pull
badc data connect records the dataset path in ~/.config/badc/data.toml (see
Data Repository Commands). You can confirm the registry at any time:
$ badc data status
Tracked datasets:
- bogus: connected (/home/gep/projects/badc/data/datalad/bogus)
To detach the dataset (and optionally drop annexed content):
$ badc data disconnect bogus --drop-content
Dataset bogus marked as disconnected; data removed.
Chunk audio and build manifests¶
Refer to Chunk Commands for option details. A typical sequence:
Probe a file to estimate viable chunk durations:
$ badc chunk probe data/datalad/bogus/audio/XXXX-000_20251001_093000.wav \ --initial-duration 120 --max-duration 600 --tolerance 10 Recommended chunk duration: 248.00 s (strategy: memory_estimator_v1) Notes: GPU 0 (Quadro RTX 4000) limit 6554 MiB Telemetry log: artifacts/telemetry/chunk_probe/XXXX-000_20251001_093000_20251208T210945Z.jsonl Recent attempts: • 120.00s -> 1050.5 MiB fits (fits memory budget) • 360.00s -> 3151.6 MiB fits (fits memory budget) • 480.00s -> 4202.1 MiB fits (fits memory budget)Generate a manifest without writing audio (hashes optional):
$ badc chunk manifest data/datalad/bogus/audio/XXXX-000_20251001_093000.wav --chunk-duration 60 --hash-chunks --output manifests/XXXX-000_20251001_093000.csv Wrote manifest with chunk duration 60s to manifests/XXXX-000_20251001_093000.csv (with hashes)
Split chunks to disk and emit a manifest in one pass:
$ badc chunk run data/datalad/bogus/audio/XXXX-000_20251001_093000.wav --chunk-duration 60 --overlap 5 --output-dir artifacts/chunks --manifest manifests/XXXX-000_20251001_093000.csv Chunks written to artifacts/chunks; manifest at manifests/XXXX-000_20251001_093000.csv
When planning large batch jobs, pair badc chunk run with datalad run (see
Chunk Audio Recordings) so provenance is captured alongside the generated WAVs.
Run inference¶
The Infer Commands page covers every option; the quick hits below show common patterns.
Stub/local runs (no HawkEars, great for CI):
$ badc infer run manifests/XXXX-000_20251001_093000.csv --runner-cmd "echo hawkears-stub"
Processed 3 jobs; outputs stored in artifacts/infer
Leverage HawkEars directly (requires CUDA + vendor checkout):
$ badc infer run manifests/XXXX-000_20251001_093000.csv --use-hawkears --hawkears-arg --min_score --hawkears-arg 0.7
Processed 3 jobs; outputs stored in artifacts/infer
Reuse the shared config file instead of hand-writing flags:
$ badc infer run-config configs/hawkears-local.toml
Processed 3 jobs; outputs stored in artifacts/infer
CPU-heavy fallback (e.g., developers without GPUs or teams that want CPU assist threads):
$ badc infer run manifests/XXXX-000_20251001_093000.csv --cpu-workers 4
Processed 3 jobs; outputs stored in artifacts/infer
Preview a datalad run command without executing jobs:
$ badc infer run manifests/XXXX-000_20251001_093000.csv --print-datalad-run
Run the following from the dataset root (/home/gep/projects/badc/data/datalad/bogus):
datalad run -m "badc infer ..." --input manifests/... --output artifacts/infer -- badc infer run ...
When chunk inputs live inside a DataLad dataset (for example data/datalad/bogus), inference
outputs default to <dataset>/artifacts/infer so you can immediately datalad save and push.
GPU planning helpers:
$ badc gpus
nvidia-smi reported 'Insufficient Permissions'. GPU inventory usually requires NVML access—try running `sudo nvidia-smi` to confirm the driver works or ask the cluster admin to grant your user access to the NVIDIA devices.
No GPUs detected via nvidia-smi.
Use --max-gpus to cap the GPU pool or --cpu-workers to append CPU threads (BADC still adds
one CPU worker automatically when no GPUs are detected). When detection succeeds the utility lists
each GPU (index, name, memory). If you see a permissions warning, escalate to the system
administrator; otherwise BADC will fall back to CPU workers.
Aggregate detections and monitor telemetry¶
Summarise detections via Infer Commands, optionally emitting a Parquet file for DuckDB and pulling chunk metadata from the original manifest:
$ badc infer aggregate artifacts/infer \
--manifest manifests/XXXX-000_20251001_093000.csv \
--output artifacts/aggregate/summary.csv \
--parquet artifacts/aggregate/detections.parquet
Wrote detection summary to artifacts/aggregate/summary.csv
Wrote Parquet export to artifacts/aggregate/detections.parquet
Hand the Parquet file to Report Commands for quick pivots. Each row now carries the chunk-relative
start/end offsets, absolute timestamps, HawkEars label code/name pairs, confidence, runner label, and
the detected HawkEars model_version so downstream DuckDB queries have everything needed for Phase 2
aggregation:
$ badc report summary --parquet artifacts/aggregate/detections.parquet --group-by label
+---------+------------+----------------+
| label | detections | avg_confidence |
| grouse | 42 | 0.87 |
+---------+------------+----------------+
Need more color than a single pivot? badc report quicklook hits the same Parquet file but emits
multiple Rich tables (top labels, top recordings, chunk timeline) and ASCII sparklines while also
writing optional CSV snapshots for notebooks:
$ badc report quicklook --parquet artifacts/aggregate/detections.parquet --output-dir artifacts/aggregate/quicklook
$ ls artifacts/aggregate/quicklook
chunks.csv labels.csv recordings.csv
Telemetry logs are unique per manifest/timestamp (default path printed by badc infer run). Tail
them with Infer Commands’s monitor view (per-GPU utilization/memory trends, rolling ASCII
sparklines, success/failure counts, and a live event tail) or the lightweight Miscellaneous Commands
command:
$ badc infer monitor --log data/telemetry/infer/GNWT-290_20251207T080000Z.jsonl
$ badc telemetry --log data/telemetry/infer/GNWT-290_20251207T080000Z.jsonl
Telemetry records (4):
[success] GNWT-290_chunk_1 (GPU 0) 2025-12-06T18:22:11 runtime=12.4
Telemetries are JSONL files, so you can also ingest them into notebooks or log shippers for dashboards. See Aggregate Detection Results for a guided walkthrough tying together aggregation, Parquet exports, DuckDB summaries, and telemetry monitoring.
See also¶
Data Repository Commands – dataset helpers.
Chunk Commands – chunk probe/split/manifest/run parameters.
Infer Commands – inference, aggregation, and telemetry options.
Run Inference on Sockeye – SLURM-ready instructions for Sockeye/Chinook deployments.