Infer Commands
==============

The ``badc infer`` namespace schedules HawkEars (or a stub runner) across local GPUs/CPUs, writes
per-chunk JSON detections, and aggregates results for downstream analysis.

.. contents:: On this page
   :local:
   :depth: 1

Overview
--------

* Input is always a chunk manifest CSV produced by ``badc chunk``.
* Outputs default to ``artifacts/infer`` (or ``<dataset>/artifacts/infer`` when the manifest lives
  inside a DataLad dataset).
* Telemetry is logged per run (default ``data/telemetry/infer/<manifest>_<timestamp>.jsonl`` or
  ``<dataset>/artifacts/telemetry/…`` when the manifest lives inside a dataset) and the CLI prints
  the path so ``badc infer monitor`` / ``badc telemetry`` can tail it.
* ``--print-datalad-run`` exposes a ready-to-use command for provenance-friendly workflows.

``badc infer run``
------------------

Run HawkEars against every chunk listed in the manifest.

Usage::

   badc infer run MANIFEST.csv [--max-gpus N] [--cpu-workers N]
       [--output-dir PATH] [--runner-cmd CMD | --use-hawkears]
       [--hawkears-arg ARG ...] [--max-retries N]
       [--resume-summary PATH] [--print-datalad-run]

Key options:

``--max-gpus``
   Limit how many detected GPUs are used. Defaults to "all GPUs reported by ``nvidia-smi``".
``--cpu-workers``
   Additional CPU worker threads to append to the GPU pool. When no GPUs exist,
   BADC still runs at least one CPU worker even if this is left at ``0``.
``--runner-cmd``
   Custom executable to run per chunk (e.g., a container wrapper). Mutually exclusive with
   ``--use-hawkears``.
``--use-hawkears``
   Invoke the vendored HawkEars ``analyze.py`` script directly. BADC injects chunk/audio arguments
   and parses ``HawkEars_labels.csv`` into JSON detections.
``--hawkears-arg``
   Repeatable passthrough argument (e.g., ``--hawkears-arg --config`` ``--hawkears-arg config.yaml``).
``--max-retries``
   Number of automatic retries per chunk when the runner exits non-zero (default 2).
``--output-dir``
   Override the destination for JSON outputs. When omitted *and* chunks live in a DataLad dataset,
   BADC writes under ``<dataset>/artifacts/infer`` so the files remain inside the dataset boundary.
``--telemetry-log``
   Override the telemetry log path (JSONL) that records scheduler events. Defaults to a unique file
   per manifest/timestamp under ``data/telemetry/infer`` or ``<dataset>/artifacts/telemetry``. Each
   run also writes a ``*.summary.json`` next to the log capturing per-worker/per-chunk outcomes for
   resumable workflows.
``--resume-summary``
   Provide a previously written ``*.summary.json`` (usually next to the telemetry log) to skip
   chunks already marked ``success``. Helpful when resuming an interrupted run or iterating on a
   subset of failures.
``--print-datalad-run``
   Instead of running inference, emit a ``datalad run`` command tailored to the manifest/output pair.

Option reference
^^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1

   * - Option / Argument
     - Description
     - Default
   * - ``MANIFEST.csv``
     - Chunk manifest generated by ``badc chunk`` commands.
     - Required
   * - ``--max-gpus N``
     - Upper bound on detected GPUs to enlist.
     - All GPUs
   * - ``--cpu-workers N``
     - Additional CPU worker threads (at least one CPU worker is added automatically when no GPUs exist).
     - ``0``
   * - ``--runner-cmd CMD``
     - Custom executable invoked per chunk.
     - Stub runner
   * - ``--use-hawkears``
     - Call vendored HawkEars ``analyze.py`` instead of a custom runner.
     - Disabled
   * - ``--hawkears-arg ARG``
     - Repeatable passthrough flag forwarded to HawkEars.
     - None
   * - ``--max-retries N``
     - Retry budget for failed chunks.
     - ``2``
   * - ``--output-dir PATH``
     - Destination folder for JSON detections.
     - ``artifacts/infer`` (dataset-aware)
   * - ``--telemetry-log PATH``
     - Telemetry log file capturing scheduler events.
     - Derived from manifest name
   * - ``--resume-summary PATH``
     - Scheduler summary JSON for resuming interrupted runs.
     - Disabled
   * - ``--print-datalad-run``
     - Emit provenance-friendly command instead of executing jobs.
     - Disabled

Help excerpt
^^^^^^^^^^^^

.. code-block:: console

   $ badc infer run --help
   Usage: badc infer run [OPTIONS] MANIFEST
     Run HawkEars (or a custom runner) for every chunk in a manifest.
   Arguments:
     MANIFEST  Path to chunk manifest CSV.  [required]
   Options:
     --max-gpus INTEGER       Limit number of GPUs to use.
     --cpu-workers INTEGER    Extra CPU worker threads to append to the GPU pool.
     --output-dir PATH        Directory for inference outputs.
     --runner-cmd TEXT        Command used to invoke HawkEars (default stub).
     --use-hawkears / --stub-runner  Invoke the embedded HawkEars analyzer.
     --hawkears-arg TEXT      Extra argument to pass to HawkEars (repeatable).
     --max-retries INTEGER    Maximum retries per chunk.
     --telemetry-log PATH     Telemetry log path (JSONL).
     --resume-summary PATH    Skip chunks marked success in this scheduler summary JSON.
     --print-datalad-run      Show a ready-to-run `datalad run` command.
     --help                   Show this message and exit.

Workflow notes:

* Worker pool: BADC pairs each chunk with a ``GPUWorker`` (index + UUID) derived from ``nvidia-smi``.
  ``--cpu-workers`` adds CPU threads on top of the GPU pool, and when no GPUs are found the CLI
  still spins up at least one CPU worker.
* Telemetry: every chunk emits a JSON record with timestamps, runtime, GPU index/name, and (when
  available) GPU utilization/memory snapshots. The CLI prints the log path; monitor progress via
  ``badc infer monitor --log <file>`` (rich GPU summary) or ``badc telemetry --log <file>`` (plain
  tail). A sibling ``*.summary.json`` file captures the per-worker/per-chunk outcomes so interrupted
  runs can resume without repeating successful chunks—pass that path to
  ``--resume-summary <telemetry.summary.json>`` on the next invocation to skip completed jobs.
* Worker summary: once all jobs finish, ``badc infer run`` prints a per-worker table (GPU/CPU label,
  total jobs, failures, successful retry counts, and failed attempts) so long runs surface retry
  hot spots without diving into telemetry logs.
* Failure handling: if any worker raises an exception, the scheduler stops submitting new jobs and
  re-raises the first error after threads finish.

Example::

   badc infer run data/datalad/bogus/manifests/GNWT-290.csv \
       --use-hawkears --max-gpus 2 --hawkears-arg --min_score --hawkears-arg 0.7

See :ref:`usage-infer-examples` for more end-to-end command snippets (stub, GPU, and datalad-run).

``badc infer run-config``
-------------------------

Load a TOML configuration (see :file:`configs/hawkears-local.toml`) and delegate to ``badc infer
run`` so teams can share presets without copying long command lines.

Usage::

   badc infer run-config configs/hawkears-local.toml

Behavior:

* Parses the ``[runner]`` table for manifest path, GPU/CPU limits, telemetry log, and optional
  ``runner_cmd`` overrides.
* Forwards ``[hawkears].extra_args`` directly to ``vendor/HawkEars/analyze.py`` when
  ``runner.use_hawkears`` is ``true``.
* Supports ``--print-datalad-run`` to preview the exact command that would be executed from inside
  the dataset.

Options:

.. list-table::
   :header-rows: 1

   * - Option
     - Description
   * - ``--print-datalad-run``
     - Show the generated ``datalad run`` command instead of executing inference.

``badc infer aggregate``
------------------------

Summarize detection JSON into a CSV that analysts can ingest into notebooks, DuckDB, or dashboards.

Usage::

   badc infer aggregate artifacts/infer --output artifacts/aggregate/summary.csv \
       --parquet artifacts/aggregate/detections.parquet

Behavior:

* Walks the ``detections_dir`` and parses each JSON file via ``badc.aggregate`` helpers.
* When ``--manifest`` is supplied, missing chunk metadata (start/end offsets, hashes, recording IDs)
  is filled from the manifest so custom runners that omit chunk metadata still aggregate cleanly.
* Emits a CSV with canonical detection columns (chunk start/end offsets, detection start/end
  relative to the chunk, absolute timestamps, label code/name, confidence, runner + model metadata).
* Optionally writes a Parquet file (requires the ``duckdb`` package) suitable for DuckDB queries or
  downstream analytics notebooks.
* Skips empty directories with a warning so it is safe to run even before inference completes.

Option reference
^^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1

   * - Option / Argument
     - Description
     - Default
   * - ``DETECTIONS_DIR``
     - Folder containing JSON outputs from ``badc infer run``.
     - Required
   * - ``--output PATH``
     - Summary CSV destination.
     - ``artifacts/aggregate/summary.csv``
   * - ``--manifest PATH``
     - Optional chunk manifest CSV for metadata enrichment.
     - Disabled
   * - ``--parquet PATH``
     - Optional Parquet export (requires ``duckdb``).
     - Disabled

Help excerpt
^^^^^^^^^^^^

.. code-block:: console

   $ badc infer aggregate --help
   Usage: badc infer aggregate [OPTIONS] DETECTIONS_DIR
     Aggregate per-chunk detection JSON files into canonical summaries.
   Arguments:
     DETECTIONS_DIR  Directory containing inference outputs (JSON).  [required]
   Options:
     --output PATH   Summary CSV path.
     --parquet PATH  Optional Parquet export (requires duckdb).
     --help          Show this message and exit.

Common pattern::

   badc infer aggregate <dataset>/artifacts/infer --output <dataset>/artifacts/aggregate/summary.csv

Combine with ``--manifest`` so chunk metadata survives even when custom runners omit per-chunk JSON
fields. Each detection row now includes chunk-relative start/end times, absolute timestamps, label
codes/names, confidence, runner, and ``model_version`` whenever ``--use-hawkears`` is active. Pair
the command with ``datalad run`` or ``git annex`` metadata to track how raw detections feed downstream
reports. When ``--parquet`` is enabled you can open the file directly in DuckDB::

   duckdb -c "SELECT label, count(*) FROM 'artifacts/aggregate/detections.parquet' GROUP BY 1"

Hand the Parquet export to :doc:`/cli/report` for richer console summaries (group-by label,
recording, or both) or to the :doc:`/howto/aggregate-results` workflow for notebook analysis.

``badc infer monitor``
----------------------

Stream GPU utilization and per-chunk telemetry directly from the JSONL logs produced by
``badc infer run``. The view renders two ``rich`` tables: a per-GPU summary with success/failure
counts, retry attempts, failed-attempt totals, average runtimes, utilization trends (min/avg/max),
peak VRAM usage, and ASCII sparklines showing rolling utilization/VRAM/retry history, plus a live
tail of recent chunk events (status, runtime, attempt counter, GPU, utilization/memory snapshot).

Usage::

   badc infer monitor --log data/telemetry/infer/GNWT-290_20251207T080000Z.jsonl --tail 20

Options:

``--log``
   Telemetry log path. Defaults to ``data/telemetry/infer/log.jsonl`` but the run command prints the
   exact location for each manifest/timestamp.
``--tail``
   Number of recent events to display in the lower table.
``--follow``
   Refresh the tables every ``--interval`` seconds (Ctrl+C to stop).

Use this view during long HawkEars jobs to confirm GPUs remain busy (sustained utilization, stable
VRAM headroom) and to spot retry spikes immediately via the per-GPU retry counters, retry sparkline,
and the event tail's attempt column. The sparkline columns update every refresh when ``--follow`` is
enabled, exposing rolling trends without leaving the CLI.
``badc infer orchestrate``
-------------------------

Plan inference runs across an entire dataset (or a saved chunk plan) without executing HawkEars
immediately.

Usage::

   badc infer orchestrate data/datalad/bogus \
       --manifest-dir manifests \
       --output-dir artifacts/infer \
       --telemetry-dir artifacts/telemetry \
       --plan-csv plans/infer.csv \
       --print-datalad-run

Highlights:

* Loads manifests from ``<dataset>/manifests`` (or a supplied chunk plan CSV/JSON).
* Builds per-recording plans that capture manifest path, output directory, telemetry log, HawkEars
  settings, CPU/GPU worker overrides, and the chunk status located under
  ``<dataset>/<chunks-dir>/<recording>/.chunk_status.json`` (defaults to ``artifacts/chunks``).
  Status must be ``completed`` unless you opt into ``--allow-partial-chunks``; this prevents Sockeye
  scripts and local ``--apply`` runs from launching inference against half-finished chunk jobs.
* ``--plan-csv`` / ``--plan-json`` save the plan for HPC submission scripts or future re-runs.
* ``--print-datalad-run`` emits commands such as::

     datalad run -m "Infer REC" --input manifests/REC.csv \
       --output artifacts/infer/REC \
       -- badc infer run manifests/REC.csv \
            --use-hawkears \
            --output-dir artifacts/infer/REC \
            --telemetry-log artifacts/telemetry/infer/REC.jsonl

* ``--apply`` executes :command:`badc infer run` for each plan entry using the saved settings. When
  the dataset has ``.datalad`` and the CLI is available, runs are wrapped in ``datalad run`` by
  default (disable via ``--no-record-datalad``) so provenance is captured automatically.
* ``--sockeye-script`` (plus the optional ``--sockeye-*`` overrides) writes a SLURM job-array script
  so Sockeye submissions no longer require hand-written `sbatch` files. Each array task maps to a
  manifest/output pair from the generated plan. Pair it with
  ``--sockeye-resume-completed`` to have the script automatically append ``--resume-summary`` whenever
  a telemetry ``*.summary.json`` already exists, so reruns skip completed chunks. Add
  ``--sockeye-bundle`` to chain ``badc infer aggregate`` and ``badc report bundle`` right after each
  inference run so Phase 2 quicklook/parquet artifacts land alongside the detections. The emitted
  script now also validates the chunk status file before running HawkEars; array tasks exit early
  with a descriptive error if the status file is missing or reports anything other than
  ``completed``.
* ``--resume-completed`` tells ``--apply`` runs to look for the telemetry ``*.summary.json`` that the
  prior run produced and pass ``--resume-summary`` automatically so only unfinished chunks are
  retried.
* ``--bundle`` mirrors the Sockeye automation locally: after each ``--apply`` recording finishes,
  BADC aggregates detections into ``artifacts/aggregate/`` and runs ``badc report bundle`` so the
  quicklook CSVs, parquet report, and DuckDB database live alongside the dataset without extra
  commands. Override paths via ``--bundle-aggregate-dir`` and adjust the timeline window via
  ``--bundle-bucket-minutes``. Append ``--bundle-rollup`` (plus its ``--bundle-rollup-limit`` and
  ``--bundle-rollup-export-dir`` knobs) to run :command:`badc report aggregate-dir` after the queue
  drains; the rollup emits dataset-wide label/recording leaderboards to
  ``<aggregate_dir>/aggregate_summary`` by default so Erin immediately sees cross-run coverage.

Chunk status paths default to ``artifacts/chunks/<recording>/.chunk_status.json``; customize the
root with ``--chunks-dir`` when your dataset stores chunk WAVs elsewhere, or pass
``--allow-partial-chunks`` if you intentionally want to run inference on manifests whose chunk status
is missing or still marked ``failed``/``in_progress``.

Combine this with :command:`badc chunk orchestrate` to move from chunk plans to inference runs in a
single workflow.