Infer Commands¶
The badc infer namespace schedules HawkEars (or a stub runner) across local GPUs/CPUs, writes
per-chunk JSON detections, and aggregates results for downstream analysis.
Overview¶
Input is always a chunk manifest CSV produced by
badc chunk.Outputs default to
artifacts/infer(or<dataset>/artifacts/inferwhen the manifest lives inside a DataLad dataset).Telemetry is logged per run (default
data/telemetry/infer/<manifest>_<timestamp>.jsonlor<dataset>/artifacts/telemetry/…when the manifest lives inside a dataset) and the CLI prints the path sobadc infer monitor/badc telemetrycan tail it.--print-datalad-runexposes a ready-to-use command for provenance-friendly workflows.
badc infer run¶
Run HawkEars against every chunk listed in the manifest.
Usage:
badc infer run MANIFEST.csv [--max-gpus N] [--cpu-workers N]
[--output-dir PATH] [--runner-cmd CMD | --use-hawkears]
[--hawkears-arg ARG ...] [--max-retries N]
[--resume-summary PATH] [--print-datalad-run]
Key options:
--max-gpusLimit how many detected GPUs are used. Defaults to “all GPUs reported by
nvidia-smi”.--cpu-workersAdditional CPU worker threads to append to the GPU pool. When no GPUs exist, BADC still runs at least one CPU worker even if this is left at
0.--runner-cmdCustom executable to run per chunk (e.g., a container wrapper). Mutually exclusive with
--use-hawkears.--use-hawkearsInvoke the vendored HawkEars
analyze.pyscript directly. BADC injects chunk/audio arguments and parsesHawkEars_labels.csvinto JSON detections.--hawkears-argRepeatable passthrough argument (e.g.,
--hawkears-arg --config--hawkears-arg config.yaml).--max-retriesNumber of automatic retries per chunk when the runner exits non-zero (default 2).
--output-dirOverride the destination for JSON outputs. When omitted and chunks live in a DataLad dataset, BADC writes under
<dataset>/artifacts/inferso the files remain inside the dataset boundary.--telemetry-logOverride the telemetry log path (JSONL) that records scheduler events. Defaults to a unique file per manifest/timestamp under
data/telemetry/inferor<dataset>/artifacts/telemetry. Each run also writes a*.summary.jsonnext to the log capturing per-worker/per-chunk outcomes for resumable workflows.--resume-summaryProvide a previously written
*.summary.json(usually next to the telemetry log) to skip chunks already markedsuccess. Helpful when resuming an interrupted run or iterating on a subset of failures.--print-datalad-runInstead of running inference, emit a
datalad runcommand tailored to the manifest/output pair.
Option reference¶
Option / Argument |
Description |
Default |
|---|---|---|
|
Chunk manifest generated by |
Required |
|
Upper bound on detected GPUs to enlist. |
All GPUs |
|
Additional CPU worker threads (at least one CPU worker is added automatically when no GPUs exist). |
|
|
Custom executable invoked per chunk. |
Stub runner |
|
Call vendored HawkEars |
Disabled |
|
Repeatable passthrough flag forwarded to HawkEars. |
None |
|
Retry budget for failed chunks. |
|
|
Destination folder for JSON detections. |
|
|
Telemetry log file capturing scheduler events. |
Derived from manifest name |
|
Scheduler summary JSON for resuming interrupted runs. |
Disabled |
|
Emit provenance-friendly command instead of executing jobs. |
Disabled |
Help excerpt¶
$ badc infer run --help
Usage: badc infer run [OPTIONS] MANIFEST
Run HawkEars (or a custom runner) for every chunk in a manifest.
Arguments:
MANIFEST Path to chunk manifest CSV. [required]
Options:
--max-gpus INTEGER Limit number of GPUs to use.
--cpu-workers INTEGER Extra CPU worker threads to append to the GPU pool.
--output-dir PATH Directory for inference outputs.
--runner-cmd TEXT Command used to invoke HawkEars (default stub).
--use-hawkears / --stub-runner Invoke the embedded HawkEars analyzer.
--hawkears-arg TEXT Extra argument to pass to HawkEars (repeatable).
--max-retries INTEGER Maximum retries per chunk.
--telemetry-log PATH Telemetry log path (JSONL).
--resume-summary PATH Skip chunks marked success in this scheduler summary JSON.
--print-datalad-run Show a ready-to-run `datalad run` command.
--help Show this message and exit.
Workflow notes:
Worker pool: BADC pairs each chunk with a
GPUWorker(index + UUID) derived fromnvidia-smi.--cpu-workersadds CPU threads on top of the GPU pool, and when no GPUs are found the CLI still spins up at least one CPU worker.Telemetry: every chunk emits a JSON record with timestamps, runtime, GPU index/name, and (when available) GPU utilization/memory snapshots. The CLI prints the log path; monitor progress via
badc infer monitor --log <file>(rich GPU summary) orbadc telemetry --log <file>(plain tail). A sibling*.summary.jsonfile captures the per-worker/per-chunk outcomes so interrupted runs can resume without repeating successful chunks—pass that path to--resume-summary <telemetry.summary.json>on the next invocation to skip completed jobs.Worker summary: once all jobs finish,
badc infer runprints a per-worker table (GPU/CPU label, total jobs, failures, successful retry counts, and failed attempts) so long runs surface retry hot spots without diving into telemetry logs.Failure handling: if any worker raises an exception, the scheduler stops submitting new jobs and re-raises the first error after threads finish.
Example:
badc infer run data/datalad/bogus/manifests/GNWT-290.csv \
--use-hawkears --max-gpus 2 --hawkears-arg --min_score --hawkears-arg 0.7
See Run inference for more end-to-end command snippets (stub, GPU, and datalad-run).
badc infer run-config¶
Load a TOML configuration (see configs/hawkears-local.toml) and delegate to badc infer
run so teams can share presets without copying long command lines.
Usage:
badc infer run-config configs/hawkears-local.toml
Behavior:
Parses the
[runner]table for manifest path, GPU/CPU limits, telemetry log, and optionalrunner_cmdoverrides.Forwards
[hawkears].extra_argsdirectly tovendor/HawkEars/analyze.pywhenrunner.use_hawkearsistrue.Supports
--print-datalad-runto preview the exact command that would be executed from inside the dataset.
Options:
Option |
Description |
|---|---|
|
Show the generated |
badc infer aggregate¶
Summarize detection JSON into a CSV that analysts can ingest into notebooks, DuckDB, or dashboards.
Usage:
badc infer aggregate artifacts/infer --output artifacts/aggregate/summary.csv \
--parquet artifacts/aggregate/detections.parquet
Behavior:
Walks the
detections_dirand parses each JSON file viabadc.aggregatehelpers.When
--manifestis supplied, missing chunk metadata (start/end offsets, hashes, recording IDs) is filled from the manifest so custom runners that omit chunk metadata still aggregate cleanly.Emits a CSV with canonical detection columns (chunk start/end offsets, detection start/end relative to the chunk, absolute timestamps, label code/name, confidence, runner + model metadata).
Optionally writes a Parquet file (requires the
duckdbpackage) suitable for DuckDB queries or downstream analytics notebooks.Skips empty directories with a warning so it is safe to run even before inference completes.
Option reference¶
Option / Argument |
Description |
Default |
|---|---|---|
|
Folder containing JSON outputs from |
Required |
|
Summary CSV destination. |
|
|
Optional chunk manifest CSV for metadata enrichment. |
Disabled |
|
Optional Parquet export (requires |
Disabled |
Help excerpt¶
$ badc infer aggregate --help
Usage: badc infer aggregate [OPTIONS] DETECTIONS_DIR
Aggregate per-chunk detection JSON files into canonical summaries.
Arguments:
DETECTIONS_DIR Directory containing inference outputs (JSON). [required]
Options:
--output PATH Summary CSV path.
--parquet PATH Optional Parquet export (requires duckdb).
--help Show this message and exit.
Common pattern:
badc infer aggregate <dataset>/artifacts/infer --output <dataset>/artifacts/aggregate/summary.csv
Combine with --manifest so chunk metadata survives even when custom runners omit per-chunk JSON
fields. Each detection row now includes chunk-relative start/end times, absolute timestamps, label
codes/names, confidence, runner, and model_version whenever --use-hawkears is active. Pair
the command with datalad run or git annex metadata to track how raw detections feed downstream
reports. When --parquet is enabled you can open the file directly in DuckDB:
duckdb -c "SELECT label, count(*) FROM 'artifacts/aggregate/detections.parquet' GROUP BY 1"
Hand the Parquet export to Report Commands for richer console summaries (group-by label, recording, or both) or to the Aggregate Detection Results workflow for notebook analysis.
badc infer monitor¶
Stream GPU utilization and per-chunk telemetry directly from the JSONL logs produced by
badc infer run. The view renders two rich tables: a per-GPU summary with success/failure
counts, retry attempts, failed-attempt totals, average runtimes, utilization trends (min/avg/max),
peak VRAM usage, and ASCII sparklines showing rolling utilization/VRAM/retry history, plus a live
tail of recent chunk events (status, runtime, attempt counter, GPU, utilization/memory snapshot).
Usage:
badc infer monitor --log data/telemetry/infer/GNWT-290_20251207T080000Z.jsonl --tail 20
Options:
--logTelemetry log path. Defaults to
data/telemetry/infer/log.jsonlbut the run command prints the exact location for each manifest/timestamp.--tailNumber of recent events to display in the lower table.
--followRefresh the tables every
--intervalseconds (Ctrl+C to stop).
Use this view during long HawkEars jobs to confirm GPUs remain busy (sustained utilization, stable
VRAM headroom) and to spot retry spikes immediately via the per-GPU retry counters, retry sparkline,
and the event tail’s attempt column. The sparkline columns update every refresh when --follow is
enabled, exposing rolling trends without leaving the CLI.
badc infer orchestrate
————————-
Plan inference runs across an entire dataset (or a saved chunk plan) without executing HawkEars immediately.
Usage:
badc infer orchestrate data/datalad/bogus \
--manifest-dir manifests \
--output-dir artifacts/infer \
--telemetry-dir artifacts/telemetry \
--plan-csv plans/infer.csv \
--print-datalad-run
Highlights:
Loads manifests from
<dataset>/manifests(or a supplied chunk plan CSV/JSON).Builds per-recording plans that capture manifest path, output directory, telemetry log, HawkEars settings, CPU/GPU worker overrides, and the chunk status located under
<dataset>/<chunks-dir>/<recording>/.chunk_status.json(defaults toartifacts/chunks). Status must becompletedunless you opt into--allow-partial-chunks; this prevents Sockeye scripts and local--applyruns from launching inference against half-finished chunk jobs.--plan-csv/--plan-jsonsave the plan for HPC submission scripts or future re-runs.--print-datalad-runemits commands such as:datalad run -m "Infer REC" --input manifests/REC.csv \ --output artifacts/infer/REC \ -- badc infer run manifests/REC.csv \ --use-hawkears \ --output-dir artifacts/infer/REC \ --telemetry-log artifacts/telemetry/infer/REC.jsonl
--applyexecutes badc infer run for each plan entry using the saved settings. When the dataset has.dataladand the CLI is available, runs are wrapped indatalad runby default (disable via--no-record-datalad) so provenance is captured automatically.--sockeye-script(plus the optional--sockeye-*overrides) writes a SLURM job-array script so Sockeye submissions no longer require hand-written sbatch files. Each array task maps to a manifest/output pair from the generated plan. Pair it with--sockeye-resume-completedto have the script automatically append--resume-summarywhenever a telemetry*.summary.jsonalready exists, so reruns skip completed chunks. Add--sockeye-bundleto chainbadc infer aggregateandbadc report bundleright after each inference run so Phase 2 quicklook/parquet artifacts land alongside the detections. The emitted script now also validates the chunk status file before running HawkEars; array tasks exit early with a descriptive error if the status file is missing or reports anything other thancompleted.--resume-completedtells--applyruns to look for the telemetry*.summary.jsonthat the prior run produced and pass--resume-summaryautomatically so only unfinished chunks are retried.--bundlemirrors the Sockeye automation locally: after each--applyrecording finishes, BADC aggregates detections intoartifacts/aggregate/and runsbadc report bundleso the quicklook CSVs, parquet report, and DuckDB database live alongside the dataset without extra commands. Override paths via--bundle-aggregate-dirand adjust the timeline window via--bundle-bucket-minutes. Append--bundle-rollup(plus its--bundle-rollup-limitand--bundle-rollup-export-dirknobs) to run badc report aggregate-dir after the queue drains; the rollup emits dataset-wide label/recording leaderboards to<aggregate_dir>/aggregate_summaryby default so Erin immediately sees cross-run coverage.
Chunk status paths default to artifacts/chunks/<recording>/.chunk_status.json; customize the
root with --chunks-dir when your dataset stores chunk WAVs elsewhere, or pass
--allow-partial-chunks if you intentionally want to run inference on manifests whose chunk status
is missing or still marked failed/in_progress.
Combine this with badc chunk orchestrate to move from chunk plans to inference runs in a single workflow.