Usage Overview

The snippets below show the canonical BADC workflow end to end. Each section links back to the corresponding CLI reference page so you can dive deeper when needed.

Bootstrap a checkout

The CLI entry point is badc. After cloning the repo, initialise submodules and connect the bogus DataLad dataset so sample audio is available locally:

$ git submodule update --init --recursive
$ badc data connect bogus --pull

badc data connect records the dataset path in ~/.config/badc/data.toml (see Data Repository Commands). You can confirm the registry at any time:

$ badc data status
Tracked datasets:
 - bogus: connected (/home/gep/projects/badc/data/datalad/bogus)

To detach the dataset (and optionally drop annexed content):

$ badc data disconnect bogus --drop-content
Dataset bogus marked as disconnected; data removed.

Chunk audio and build manifests

Refer to Chunk Commands for option details. A typical sequence:

  1. Probe a file to estimate viable chunk durations:

    $ badc chunk probe data/datalad/bogus/audio/XXXX-000_20251001_093000.wav \
        --initial-duration 120 --max-duration 600 --tolerance 10
    Recommended chunk duration: 248.00 s (strategy: memory_estimator_v1)
    Notes: GPU 0 (Quadro RTX 4000) limit 6554 MiB
    Telemetry log: artifacts/telemetry/chunk_probe/XXXX-000_20251001_093000_20251208T210945Z.jsonl
    Recent attempts:
     • 120.00s -> 1050.5 MiB fits (fits memory budget)
     • 360.00s -> 3151.6 MiB fits (fits memory budget)
     • 480.00s -> 4202.1 MiB fits (fits memory budget)
    
  2. Generate a manifest without writing audio (hashes optional):

    $ badc chunk manifest data/datalad/bogus/audio/XXXX-000_20251001_093000.wav           --chunk-duration 60 --hash-chunks           --output manifests/XXXX-000_20251001_093000.csv
    Wrote manifest with chunk duration 60s to manifests/XXXX-000_20251001_093000.csv (with hashes)
    
  3. Split chunks to disk and emit a manifest in one pass:

    $ badc chunk run data/datalad/bogus/audio/XXXX-000_20251001_093000.wav           --chunk-duration 60           --overlap 5           --output-dir artifacts/chunks           --manifest manifests/XXXX-000_20251001_093000.csv
    Chunks written to artifacts/chunks; manifest at manifests/XXXX-000_20251001_093000.csv
    

When planning large batch jobs, pair badc chunk run with datalad run (see Chunk Audio Recordings) so provenance is captured alongside the generated WAVs.

Run inference

The Infer Commands page covers every option; the quick hits below show common patterns.

Stub/local runs (no HawkEars, great for CI):

$ badc infer run manifests/XXXX-000_20251001_093000.csv        --runner-cmd "echo hawkears-stub"
Processed 3 jobs; outputs stored in artifacts/infer

Leverage HawkEars directly (requires CUDA + vendor checkout):

$ badc infer run manifests/XXXX-000_20251001_093000.csv        --use-hawkears        --hawkears-arg --min_score        --hawkears-arg 0.7
Processed 3 jobs; outputs stored in artifacts/infer

Reuse the shared config file instead of hand-writing flags:

$ badc infer run-config configs/hawkears-local.toml
Processed 3 jobs; outputs stored in artifacts/infer

CPU-heavy fallback (e.g., developers without GPUs or teams that want CPU assist threads):

$ badc infer run manifests/XXXX-000_20251001_093000.csv --cpu-workers 4
Processed 3 jobs; outputs stored in artifacts/infer

Preview a datalad run command without executing jobs:

$ badc infer run manifests/XXXX-000_20251001_093000.csv --print-datalad-run
Run the following from the dataset root (/home/gep/projects/badc/data/datalad/bogus):
  datalad run -m "badc infer ..." --input manifests/... --output artifacts/infer -- badc infer run ...

When chunk inputs live inside a DataLad dataset (for example data/datalad/bogus), inference outputs default to <dataset>/artifacts/infer so you can immediately datalad save and push.

GPU planning helpers:

$ badc gpus
nvidia-smi reported 'Insufficient Permissions'. GPU inventory usually requires NVML access—try running `sudo nvidia-smi` to confirm the driver works or ask the cluster admin to grant your user access to the NVIDIA devices.
No GPUs detected via nvidia-smi.

Use --max-gpus to cap the GPU pool or --cpu-workers to append CPU threads (BADC still adds one CPU worker automatically when no GPUs are detected). When detection succeeds the utility lists each GPU (index, name, memory). If you see a permissions warning, escalate to the system administrator; otherwise BADC will fall back to CPU workers.

Aggregate detections and monitor telemetry

Summarise detections via Infer Commands, optionally emitting a Parquet file for DuckDB and pulling chunk metadata from the original manifest:

$ badc infer aggregate artifacts/infer \
    --manifest manifests/XXXX-000_20251001_093000.csv \
    --output artifacts/aggregate/summary.csv \
    --parquet artifacts/aggregate/detections.parquet
Wrote detection summary to artifacts/aggregate/summary.csv
Wrote Parquet export to artifacts/aggregate/detections.parquet

Hand the Parquet file to Report Commands for quick pivots. Each row now carries the chunk-relative start/end offsets, absolute timestamps, HawkEars label code/name pairs, confidence, runner label, and the detected HawkEars model_version so downstream DuckDB queries have everything needed for Phase 2 aggregation:

$ badc report summary --parquet artifacts/aggregate/detections.parquet --group-by label
+---------+------------+----------------+
| label   | detections | avg_confidence |
| grouse  | 42         | 0.87           |
+---------+------------+----------------+

Need more color than a single pivot? badc report quicklook hits the same Parquet file but emits multiple Rich tables (top labels, top recordings, chunk timeline) and ASCII sparklines while also writing optional CSV snapshots for notebooks:

$ badc report quicklook --parquet artifacts/aggregate/detections.parquet --output-dir artifacts/aggregate/quicklook
$ ls artifacts/aggregate/quicklook
chunks.csv  labels.csv  recordings.csv

Telemetry logs are unique per manifest/timestamp (default path printed by badc infer run). Tail them with Infer Commands’s monitor view (per-GPU utilization/memory trends, rolling ASCII sparklines, success/failure counts, and a live event tail) or the lightweight Miscellaneous Commands command:

$ badc infer monitor --log data/telemetry/infer/GNWT-290_20251207T080000Z.jsonl
$ badc telemetry --log data/telemetry/infer/GNWT-290_20251207T080000Z.jsonl
Telemetry records (4):
[success] GNWT-290_chunk_1 (GPU 0) 2025-12-06T18:22:11 runtime=12.4

Telemetries are JSONL files, so you can also ingest them into notebooks or log shippers for dashboards. See Aggregate Detection Results for a guided walkthrough tying together aggregation, Parquet exports, DuckDB summaries, and telemetry monitoring.

See also