Infer Commands

The badc infer namespace schedules HawkEars (or a stub runner) across local GPUs/CPUs, writes per-chunk JSON detections, and aggregates results for downstream analysis.

Overview

  • Input is always a chunk manifest CSV produced by badc chunk.

  • Outputs default to artifacts/infer (or <dataset>/artifacts/infer when the manifest lives inside a DataLad dataset).

  • Telemetry is logged per run (default data/telemetry/infer/<manifest>_<timestamp>.jsonl or <dataset>/artifacts/telemetry/… when the manifest lives inside a dataset) and the CLI prints the path so badc infer monitor / badc telemetry can tail it.

  • --print-datalad-run exposes a ready-to-use command for provenance-friendly workflows.

badc infer run

Run HawkEars against every chunk listed in the manifest.

Usage:

badc infer run MANIFEST.csv [--max-gpus N] [--cpu-workers N]
    [--output-dir PATH] [--runner-cmd CMD | --use-hawkears]
    [--hawkears-arg ARG ...] [--max-retries N]
    [--resume-summary PATH] [--print-datalad-run]

Key options:

--max-gpus

Limit how many detected GPUs are used. Defaults to “all GPUs reported by nvidia-smi”.

--cpu-workers

Additional CPU worker threads to append to the GPU pool. When no GPUs exist, BADC still runs at least one CPU worker even if this is left at 0.

--runner-cmd

Custom executable to run per chunk (e.g., a container wrapper). Mutually exclusive with --use-hawkears.

--use-hawkears

Invoke the vendored HawkEars analyze.py script directly. BADC injects chunk/audio arguments and parses HawkEars_labels.csv into JSON detections.

--hawkears-arg

Repeatable passthrough argument (e.g., --hawkears-arg --config --hawkears-arg config.yaml).

--max-retries

Number of automatic retries per chunk when the runner exits non-zero (default 2).

--output-dir

Override the destination for JSON outputs. When omitted and chunks live in a DataLad dataset, BADC writes under <dataset>/artifacts/infer so the files remain inside the dataset boundary.

--telemetry-log

Override the telemetry log path (JSONL) that records scheduler events. Defaults to a unique file per manifest/timestamp under data/telemetry/infer or <dataset>/artifacts/telemetry. Each run also writes a *.summary.json next to the log capturing per-worker/per-chunk outcomes for resumable workflows.

--resume-summary

Provide a previously written *.summary.json (usually next to the telemetry log) to skip chunks already marked success. Helpful when resuming an interrupted run or iterating on a subset of failures.

--print-datalad-run

Instead of running inference, emit a datalad run command tailored to the manifest/output pair.

Option reference

Option / Argument

Description

Default

MANIFEST.csv

Chunk manifest generated by badc chunk commands.

Required

--max-gpus N

Upper bound on detected GPUs to enlist.

All GPUs

--cpu-workers N

Additional CPU worker threads (at least one CPU worker is added automatically when no GPUs exist).

0

--runner-cmd CMD

Custom executable invoked per chunk.

Stub runner

--use-hawkears

Call vendored HawkEars analyze.py instead of a custom runner.

Disabled

--hawkears-arg ARG

Repeatable passthrough flag forwarded to HawkEars.

None

--max-retries N

Retry budget for failed chunks.

2

--output-dir PATH

Destination folder for JSON detections.

artifacts/infer (dataset-aware)

--telemetry-log PATH

Telemetry log file capturing scheduler events.

Derived from manifest name

--resume-summary PATH

Scheduler summary JSON for resuming interrupted runs.

Disabled

--print-datalad-run

Emit provenance-friendly command instead of executing jobs.

Disabled

Help excerpt

$ badc infer run --help
Usage: badc infer run [OPTIONS] MANIFEST
  Run HawkEars (or a custom runner) for every chunk in a manifest.
Arguments:
  MANIFEST  Path to chunk manifest CSV.  [required]
Options:
  --max-gpus INTEGER       Limit number of GPUs to use.
  --cpu-workers INTEGER    Extra CPU worker threads to append to the GPU pool.
  --output-dir PATH        Directory for inference outputs.
  --runner-cmd TEXT        Command used to invoke HawkEars (default stub).
  --use-hawkears / --stub-runner  Invoke the embedded HawkEars analyzer.
  --hawkears-arg TEXT      Extra argument to pass to HawkEars (repeatable).
  --max-retries INTEGER    Maximum retries per chunk.
  --telemetry-log PATH     Telemetry log path (JSONL).
  --resume-summary PATH    Skip chunks marked success in this scheduler summary JSON.
  --print-datalad-run      Show a ready-to-run `datalad run` command.
  --help                   Show this message and exit.

Workflow notes:

  • Worker pool: BADC pairs each chunk with a GPUWorker (index + UUID) derived from nvidia-smi. --cpu-workers adds CPU threads on top of the GPU pool, and when no GPUs are found the CLI still spins up at least one CPU worker.

  • Telemetry: every chunk emits a JSON record with timestamps, runtime, GPU index/name, and (when available) GPU utilization/memory snapshots. The CLI prints the log path; monitor progress via badc infer monitor --log <file> (rich GPU summary) or badc telemetry --log <file> (plain tail). A sibling *.summary.json file captures the per-worker/per-chunk outcomes so interrupted runs can resume without repeating successful chunks—pass that path to --resume-summary <telemetry.summary.json> on the next invocation to skip completed jobs.

  • Worker summary: once all jobs finish, badc infer run prints a per-worker table (GPU/CPU label, total jobs, failures, successful retry counts, and failed attempts) so long runs surface retry hot spots without diving into telemetry logs.

  • Failure handling: if any worker raises an exception, the scheduler stops submitting new jobs and re-raises the first error after threads finish.

Example:

badc infer run data/datalad/bogus/manifests/GNWT-290.csv \
    --use-hawkears --max-gpus 2 --hawkears-arg --min_score --hawkears-arg 0.7

See Run inference for more end-to-end command snippets (stub, GPU, and datalad-run).

badc infer run-config

Load a TOML configuration (see configs/hawkears-local.toml) and delegate to badc infer run so teams can share presets without copying long command lines.

Usage:

badc infer run-config configs/hawkears-local.toml

Behavior:

  • Parses the [runner] table for manifest path, GPU/CPU limits, telemetry log, and optional runner_cmd overrides.

  • Forwards [hawkears].extra_args directly to vendor/HawkEars/analyze.py when runner.use_hawkears is true.

  • Supports --print-datalad-run to preview the exact command that would be executed from inside the dataset.

Options:

Option

Description

--print-datalad-run

Show the generated datalad run command instead of executing inference.

badc infer aggregate

Summarize detection JSON into a CSV that analysts can ingest into notebooks, DuckDB, or dashboards.

Usage:

badc infer aggregate artifacts/infer --output artifacts/aggregate/summary.csv \
    --parquet artifacts/aggregate/detections.parquet

Behavior:

  • Walks the detections_dir and parses each JSON file via badc.aggregate helpers.

  • When --manifest is supplied, missing chunk metadata (start/end offsets, hashes, recording IDs) is filled from the manifest so custom runners that omit chunk metadata still aggregate cleanly.

  • Emits a CSV with canonical detection columns (chunk start/end offsets, detection start/end relative to the chunk, absolute timestamps, label code/name, confidence, runner + model metadata).

  • Optionally writes a Parquet file (requires the duckdb package) suitable for DuckDB queries or downstream analytics notebooks.

  • Skips empty directories with a warning so it is safe to run even before inference completes.

Option reference

Option / Argument

Description

Default

DETECTIONS_DIR

Folder containing JSON outputs from badc infer run.

Required

--output PATH

Summary CSV destination.

artifacts/aggregate/summary.csv

--manifest PATH

Optional chunk manifest CSV for metadata enrichment.

Disabled

--parquet PATH

Optional Parquet export (requires duckdb).

Disabled

Help excerpt

$ badc infer aggregate --help
Usage: badc infer aggregate [OPTIONS] DETECTIONS_DIR
  Aggregate per-chunk detection JSON files into canonical summaries.
Arguments:
  DETECTIONS_DIR  Directory containing inference outputs (JSON).  [required]
Options:
  --output PATH   Summary CSV path.
  --parquet PATH  Optional Parquet export (requires duckdb).
  --help          Show this message and exit.

Common pattern:

badc infer aggregate <dataset>/artifacts/infer --output <dataset>/artifacts/aggregate/summary.csv

Combine with --manifest so chunk metadata survives even when custom runners omit per-chunk JSON fields. Each detection row now includes chunk-relative start/end times, absolute timestamps, label codes/names, confidence, runner, and model_version whenever --use-hawkears is active. Pair the command with datalad run or git annex metadata to track how raw detections feed downstream reports. When --parquet is enabled you can open the file directly in DuckDB:

duckdb -c "SELECT label, count(*) FROM 'artifacts/aggregate/detections.parquet' GROUP BY 1"

Hand the Parquet export to Report Commands for richer console summaries (group-by label, recording, or both) or to the Aggregate Detection Results workflow for notebook analysis.

badc infer monitor

Stream GPU utilization and per-chunk telemetry directly from the JSONL logs produced by badc infer run. The view renders two rich tables: a per-GPU summary with success/failure counts, retry attempts, failed-attempt totals, average runtimes, utilization trends (min/avg/max), peak VRAM usage, and ASCII sparklines showing rolling utilization/VRAM/retry history, plus a live tail of recent chunk events (status, runtime, attempt counter, GPU, utilization/memory snapshot).

Usage:

badc infer monitor --log data/telemetry/infer/GNWT-290_20251207T080000Z.jsonl --tail 20

Options:

--log

Telemetry log path. Defaults to data/telemetry/infer/log.jsonl but the run command prints the exact location for each manifest/timestamp.

--tail

Number of recent events to display in the lower table.

--follow

Refresh the tables every --interval seconds (Ctrl+C to stop).

Use this view during long HawkEars jobs to confirm GPUs remain busy (sustained utilization, stable VRAM headroom) and to spot retry spikes immediately via the per-GPU retry counters, retry sparkline, and the event tail’s attempt column. The sparkline columns update every refresh when --follow is enabled, exposing rolling trends without leaving the CLI. badc infer orchestrate ————————-

Plan inference runs across an entire dataset (or a saved chunk plan) without executing HawkEars immediately.

Usage:

badc infer orchestrate data/datalad/bogus \
    --manifest-dir manifests \
    --output-dir artifacts/infer \
    --telemetry-dir artifacts/telemetry \
    --plan-csv plans/infer.csv \
    --print-datalad-run

Highlights:

  • Loads manifests from <dataset>/manifests (or a supplied chunk plan CSV/JSON).

  • Builds per-recording plans that capture manifest path, output directory, telemetry log, HawkEars settings, CPU/GPU worker overrides, and the chunk status located under <dataset>/<chunks-dir>/<recording>/.chunk_status.json (defaults to artifacts/chunks). Status must be completed unless you opt into --allow-partial-chunks; this prevents Sockeye scripts and local --apply runs from launching inference against half-finished chunk jobs.

  • --plan-csv / --plan-json save the plan for HPC submission scripts or future re-runs.

  • --print-datalad-run emits commands such as:

    datalad run -m "Infer REC" --input manifests/REC.csv \
      --output artifacts/infer/REC \
      -- badc infer run manifests/REC.csv \
           --use-hawkears \
           --output-dir artifacts/infer/REC \
           --telemetry-log artifacts/telemetry/infer/REC.jsonl
    
  • --apply executes badc infer run for each plan entry using the saved settings. When the dataset has .datalad and the CLI is available, runs are wrapped in datalad run by default (disable via --no-record-datalad) so provenance is captured automatically.

  • --sockeye-script (plus the optional --sockeye-* overrides) writes a SLURM job-array script so Sockeye submissions no longer require hand-written sbatch files. Each array task maps to a manifest/output pair from the generated plan. Pair it with --sockeye-resume-completed to have the script automatically append --resume-summary whenever a telemetry *.summary.json already exists, so reruns skip completed chunks. Add --sockeye-bundle to chain badc infer aggregate and badc report bundle right after each inference run so Phase 2 quicklook/parquet artifacts land alongside the detections. The emitted script now also validates the chunk status file before running HawkEars; array tasks exit early with a descriptive error if the status file is missing or reports anything other than completed.

  • --resume-completed tells --apply runs to look for the telemetry *.summary.json that the prior run produced and pass --resume-summary automatically so only unfinished chunks are retried.

  • --bundle mirrors the Sockeye automation locally: after each --apply recording finishes, BADC aggregates detections into artifacts/aggregate/ and runs badc report bundle so the quicklook CSVs, parquet report, and DuckDB database live alongside the dataset without extra commands. Override paths via --bundle-aggregate-dir and adjust the timeline window via --bundle-bucket-minutes. Append --bundle-rollup (plus its --bundle-rollup-limit and --bundle-rollup-export-dir knobs) to run badc report aggregate-dir after the queue drains; the rollup emits dataset-wide label/recording leaderboards to <aggregate_dir>/aggregate_summary by default so Erin immediately sees cross-run coverage.

Chunk status paths default to artifacts/chunks/<recording>/.chunk_status.json; customize the root with --chunks-dir when your dataset stores chunk WAVs elsewhere, or pass --allow-partial-chunks if you intentionally want to run inference on manifests whose chunk status is missing or still marked failed/in_progress.

Combine this with badc chunk orchestrate to move from chunk plans to inference runs in a single workflow.