Infer Commands ============== The ``badc infer`` namespace schedules HawkEars (or a stub runner) across local GPUs/CPUs, writes per-chunk JSON detections, and aggregates results for downstream analysis. .. contents:: On this page :local: :depth: 1 Overview -------- * Input is always a chunk manifest CSV produced by ``badc chunk``. * Outputs default to ``artifacts/infer`` (or ``/artifacts/infer`` when the manifest lives inside a DataLad dataset). * Telemetry is logged per run (default ``data/telemetry/infer/_.jsonl`` or ``/artifacts/telemetry/…`` when the manifest lives inside a dataset) and the CLI prints the path so ``badc infer monitor`` / ``badc telemetry`` can tail it. * ``--print-datalad-run`` exposes a ready-to-use command for provenance-friendly workflows. ``badc infer run`` ------------------ Run HawkEars against every chunk listed in the manifest. Usage:: badc infer run MANIFEST.csv [--max-gpus N] [--cpu-workers N] [--output-dir PATH] [--runner-cmd CMD | --use-hawkears] [--hawkears-arg ARG ...] [--max-retries N] [--resume-summary PATH] [--print-datalad-run] Key options: ``--max-gpus`` Limit how many detected GPUs are used. Defaults to "all GPUs reported by ``nvidia-smi``". ``--cpu-workers`` Additional CPU worker threads to append to the GPU pool. When no GPUs exist, BADC still runs at least one CPU worker even if this is left at ``0``. ``--runner-cmd`` Custom executable to run per chunk (e.g., a container wrapper). Mutually exclusive with ``--use-hawkears``. ``--use-hawkears`` Invoke the vendored HawkEars ``analyze.py`` script directly. BADC injects chunk/audio arguments and parses ``HawkEars_labels.csv`` into JSON detections. ``--hawkears-arg`` Repeatable passthrough argument (e.g., ``--hawkears-arg --config`` ``--hawkears-arg config.yaml``). ``--max-retries`` Number of automatic retries per chunk when the runner exits non-zero (default 2). ``--output-dir`` Override the destination for JSON outputs. When omitted *and* chunks live in a DataLad dataset, BADC writes under ``/artifacts/infer`` so the files remain inside the dataset boundary. ``--telemetry-log`` Override the telemetry log path (JSONL) that records scheduler events. Defaults to a unique file per manifest/timestamp under ``data/telemetry/infer`` or ``/artifacts/telemetry``. Each run also writes a ``*.summary.json`` next to the log capturing per-worker/per-chunk outcomes for resumable workflows. ``--resume-summary`` Provide a previously written ``*.summary.json`` (usually next to the telemetry log) to skip chunks already marked ``success``. Helpful when resuming an interrupted run or iterating on a subset of failures. ``--print-datalad-run`` Instead of running inference, emit a ``datalad run`` command tailored to the manifest/output pair. Option reference ^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 * - Option / Argument - Description - Default * - ``MANIFEST.csv`` - Chunk manifest generated by ``badc chunk`` commands. - Required * - ``--max-gpus N`` - Upper bound on detected GPUs to enlist. - All GPUs * - ``--cpu-workers N`` - Additional CPU worker threads (at least one CPU worker is added automatically when no GPUs exist). - ``0`` * - ``--runner-cmd CMD`` - Custom executable invoked per chunk. - Stub runner * - ``--use-hawkears`` - Call vendored HawkEars ``analyze.py`` instead of a custom runner. - Disabled * - ``--hawkears-arg ARG`` - Repeatable passthrough flag forwarded to HawkEars. - None * - ``--max-retries N`` - Retry budget for failed chunks. - ``2`` * - ``--output-dir PATH`` - Destination folder for JSON detections. - ``artifacts/infer`` (dataset-aware) * - ``--telemetry-log PATH`` - Telemetry log file capturing scheduler events. - Derived from manifest name * - ``--resume-summary PATH`` - Scheduler summary JSON for resuming interrupted runs. - Disabled * - ``--print-datalad-run`` - Emit provenance-friendly command instead of executing jobs. - Disabled Help excerpt ^^^^^^^^^^^^ .. code-block:: console $ badc infer run --help Usage: badc infer run [OPTIONS] MANIFEST Run HawkEars (or a custom runner) for every chunk in a manifest. Arguments: MANIFEST Path to chunk manifest CSV. [required] Options: --max-gpus INTEGER Limit number of GPUs to use. --cpu-workers INTEGER Extra CPU worker threads to append to the GPU pool. --output-dir PATH Directory for inference outputs. --runner-cmd TEXT Command used to invoke HawkEars (default stub). --use-hawkears / --stub-runner Invoke the embedded HawkEars analyzer. --hawkears-arg TEXT Extra argument to pass to HawkEars (repeatable). --max-retries INTEGER Maximum retries per chunk. --telemetry-log PATH Telemetry log path (JSONL). --resume-summary PATH Skip chunks marked success in this scheduler summary JSON. --print-datalad-run Show a ready-to-run `datalad run` command. --help Show this message and exit. Workflow notes: * Worker pool: BADC pairs each chunk with a ``GPUWorker`` (index + UUID) derived from ``nvidia-smi``. ``--cpu-workers`` adds CPU threads on top of the GPU pool, and when no GPUs are found the CLI still spins up at least one CPU worker. * Telemetry: every chunk emits a JSON record with timestamps, runtime, GPU index/name, and (when available) GPU utilization/memory snapshots. The CLI prints the log path; monitor progress via ``badc infer monitor --log `` (rich GPU summary) or ``badc telemetry --log `` (plain tail). A sibling ``*.summary.json`` file captures the per-worker/per-chunk outcomes so interrupted runs can resume without repeating successful chunks—pass that path to ``--resume-summary `` on the next invocation to skip completed jobs. * Worker summary: once all jobs finish, ``badc infer run`` prints a per-worker table (GPU/CPU label, total jobs, failures, successful retry counts, and failed attempts) so long runs surface retry hot spots without diving into telemetry logs. * Failure handling: if any worker raises an exception, the scheduler stops submitting new jobs and re-raises the first error after threads finish. Example:: badc infer run data/datalad/bogus/manifests/GNWT-290.csv \ --use-hawkears --max-gpus 2 --hawkears-arg --min_score --hawkears-arg 0.7 See :ref:`usage-infer-examples` for more end-to-end command snippets (stub, GPU, and datalad-run). ``badc infer run-config`` ------------------------- Load a TOML configuration (see :file:`configs/hawkears-local.toml`) and delegate to ``badc infer run`` so teams can share presets without copying long command lines. Usage:: badc infer run-config configs/hawkears-local.toml Behavior: * Parses the ``[runner]`` table for manifest path, GPU/CPU limits, telemetry log, and optional ``runner_cmd`` overrides. * Forwards ``[hawkears].extra_args`` directly to ``vendor/HawkEars/analyze.py`` when ``runner.use_hawkears`` is ``true``. * Supports ``--print-datalad-run`` to preview the exact command that would be executed from inside the dataset. Options: .. list-table:: :header-rows: 1 * - Option - Description * - ``--print-datalad-run`` - Show the generated ``datalad run`` command instead of executing inference. ``badc infer aggregate`` ------------------------ Summarize detection JSON into a CSV that analysts can ingest into notebooks, DuckDB, or dashboards. Usage:: badc infer aggregate artifacts/infer --output artifacts/aggregate/summary.csv \ --parquet artifacts/aggregate/detections.parquet Behavior: * Walks the ``detections_dir`` and parses each JSON file via ``badc.aggregate`` helpers. * When ``--manifest`` is supplied, missing chunk metadata (start/end offsets, hashes, recording IDs) is filled from the manifest so custom runners that omit chunk metadata still aggregate cleanly. * Emits a CSV with canonical detection columns (chunk start/end offsets, detection start/end relative to the chunk, absolute timestamps, label code/name, confidence, runner + model metadata). * Optionally writes a Parquet file (requires the ``duckdb`` package) suitable for DuckDB queries or downstream analytics notebooks. * Skips empty directories with a warning so it is safe to run even before inference completes. Option reference ^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 * - Option / Argument - Description - Default * - ``DETECTIONS_DIR`` - Folder containing JSON outputs from ``badc infer run``. - Required * - ``--output PATH`` - Summary CSV destination. - ``artifacts/aggregate/summary.csv`` * - ``--manifest PATH`` - Optional chunk manifest CSV for metadata enrichment. - Disabled * - ``--parquet PATH`` - Optional Parquet export (requires ``duckdb``). - Disabled Help excerpt ^^^^^^^^^^^^ .. code-block:: console $ badc infer aggregate --help Usage: badc infer aggregate [OPTIONS] DETECTIONS_DIR Aggregate per-chunk detection JSON files into canonical summaries. Arguments: DETECTIONS_DIR Directory containing inference outputs (JSON). [required] Options: --output PATH Summary CSV path. --parquet PATH Optional Parquet export (requires duckdb). --help Show this message and exit. Common pattern:: badc infer aggregate /artifacts/infer --output /artifacts/aggregate/summary.csv Combine with ``--manifest`` so chunk metadata survives even when custom runners omit per-chunk JSON fields. Each detection row now includes chunk-relative start/end times, absolute timestamps, label codes/names, confidence, runner, and ``model_version`` whenever ``--use-hawkears`` is active. Pair the command with ``datalad run`` or ``git annex`` metadata to track how raw detections feed downstream reports. When ``--parquet`` is enabled you can open the file directly in DuckDB:: duckdb -c "SELECT label, count(*) FROM 'artifacts/aggregate/detections.parquet' GROUP BY 1" Hand the Parquet export to :doc:`/cli/report` for richer console summaries (group-by label, recording, or both) or to the :doc:`/howto/aggregate-results` workflow for notebook analysis. ``badc infer monitor`` ---------------------- Stream GPU utilization and per-chunk telemetry directly from the JSONL logs produced by ``badc infer run``. The view renders two ``rich`` tables: a per-GPU summary with success/failure counts, retry attempts, failed-attempt totals, average runtimes, utilization trends (min/avg/max), peak VRAM usage, and ASCII sparklines showing rolling utilization/VRAM/retry history, plus a live tail of recent chunk events (status, runtime, attempt counter, GPU, utilization/memory snapshot). Usage:: badc infer monitor --log data/telemetry/infer/GNWT-290_20251207T080000Z.jsonl --tail 20 Options: ``--log`` Telemetry log path. Defaults to ``data/telemetry/infer/log.jsonl`` but the run command prints the exact location for each manifest/timestamp. ``--tail`` Number of recent events to display in the lower table. ``--follow`` Refresh the tables every ``--interval`` seconds (Ctrl+C to stop). Use this view during long HawkEars jobs to confirm GPUs remain busy (sustained utilization, stable VRAM headroom) and to spot retry spikes immediately via the per-GPU retry counters, retry sparkline, and the event tail's attempt column. The sparkline columns update every refresh when ``--follow`` is enabled, exposing rolling trends without leaving the CLI. ``badc infer orchestrate`` ------------------------- Plan inference runs across an entire dataset (or a saved chunk plan) without executing HawkEars immediately. Usage:: badc infer orchestrate data/datalad/bogus \ --manifest-dir manifests \ --output-dir artifacts/infer \ --telemetry-dir artifacts/telemetry \ --plan-csv plans/infer.csv \ --print-datalad-run Highlights: * Loads manifests from ``/manifests`` (or a supplied chunk plan CSV/JSON). * Builds per-recording plans that capture manifest path, output directory, telemetry log, HawkEars settings, CPU/GPU worker overrides, and the chunk status located under ``///.chunk_status.json`` (defaults to ``artifacts/chunks``). Status must be ``completed`` unless you opt into ``--allow-partial-chunks``; this prevents Sockeye scripts and local ``--apply`` runs from launching inference against half-finished chunk jobs. * ``--plan-csv`` / ``--plan-json`` save the plan for HPC submission scripts or future re-runs. * ``--print-datalad-run`` emits commands such as:: datalad run -m "Infer REC" --input manifests/REC.csv \ --output artifacts/infer/REC \ -- badc infer run manifests/REC.csv \ --use-hawkears \ --output-dir artifacts/infer/REC \ --telemetry-log artifacts/telemetry/infer/REC.jsonl * ``--apply`` executes :command:`badc infer run` for each plan entry using the saved settings. When the dataset has ``.datalad`` and the CLI is available, runs are wrapped in ``datalad run`` by default (disable via ``--no-record-datalad``) so provenance is captured automatically. * ``--sockeye-script`` (plus the optional ``--sockeye-*`` overrides) writes a SLURM job-array script so Sockeye submissions no longer require hand-written `sbatch` files. Each array task maps to a manifest/output pair from the generated plan. Pair it with ``--sockeye-resume-completed`` to have the script automatically append ``--resume-summary`` whenever a telemetry ``*.summary.json`` already exists, so reruns skip completed chunks. Add ``--sockeye-bundle`` to chain ``badc infer aggregate`` and ``badc report bundle`` right after each inference run so Phase 2 quicklook/parquet artifacts land alongside the detections. The emitted script now also validates the chunk status file before running HawkEars; array tasks exit early with a descriptive error if the status file is missing or reports anything other than ``completed``. * ``--resume-completed`` tells ``--apply`` runs to look for the telemetry ``*.summary.json`` that the prior run produced and pass ``--resume-summary`` automatically so only unfinished chunks are retried. * ``--bundle`` mirrors the Sockeye automation locally: after each ``--apply`` recording finishes, BADC aggregates detections into ``artifacts/aggregate/`` and runs ``badc report bundle`` so the quicklook CSVs, parquet report, and DuckDB database live alongside the dataset without extra commands. Override paths via ``--bundle-aggregate-dir`` and adjust the timeline window via ``--bundle-bucket-minutes``. Append ``--bundle-rollup`` (plus its ``--bundle-rollup-limit`` and ``--bundle-rollup-export-dir`` knobs) to run :command:`badc report aggregate-dir` after the queue drains; the rollup emits dataset-wide label/recording leaderboards to ``/aggregate_summary`` by default so Erin immediately sees cross-run coverage. Chunk status paths default to ``artifacts/chunks//.chunk_status.json``; customize the root with ``--chunks-dir`` when your dataset stores chunk WAVs elsewhere, or pass ``--allow-partial-chunks`` if you intentionally want to run inference on manifests whose chunk status is missing or still marked ``failed``/``in_progress``. Combine this with :command:`badc chunk orchestrate` to move from chunk plans to inference runs in a single workflow.