Usage Overview ============== The snippets below show the canonical BADC workflow end to end. Each section links back to the corresponding CLI reference page so you can dive deeper when needed. .. _usage-bootstrap: Bootstrap a checkout -------------------- The CLI entry point is ``badc``. After cloning the repo, initialise submodules and connect the bogus DataLad dataset so sample audio is available locally:: $ git submodule update --init --recursive $ badc data connect bogus --pull ``badc data connect`` records the dataset path in ``~/.config/badc/data.toml`` (see :doc:`cli/data`). You can confirm the registry at any time:: $ badc data status Tracked datasets: - bogus: connected (/home/gep/projects/badc/data/datalad/bogus) To detach the dataset (and optionally drop annexed content):: $ badc data disconnect bogus --drop-content Dataset bogus marked as disconnected; data removed. .. _usage-chunk-examples: Chunk audio and build manifests ------------------------------- Refer to :doc:`cli/chunk` for option details. A typical sequence: 1. Probe a file to estimate viable chunk durations:: $ badc chunk probe data/datalad/bogus/audio/XXXX-000_20251001_093000.wav \ --initial-duration 120 --max-duration 600 --tolerance 10 Recommended chunk duration: 248.00 s (strategy: memory_estimator_v1) Notes: GPU 0 (Quadro RTX 4000) limit 6554 MiB Telemetry log: artifacts/telemetry/chunk_probe/XXXX-000_20251001_093000_20251208T210945Z.jsonl Recent attempts: • 120.00s -> 1050.5 MiB fits (fits memory budget) • 360.00s -> 3151.6 MiB fits (fits memory budget) • 480.00s -> 4202.1 MiB fits (fits memory budget) 2. Generate a manifest without writing audio (hashes optional):: $ badc chunk manifest data/datalad/bogus/audio/XXXX-000_20251001_093000.wav --chunk-duration 60 --hash-chunks --output manifests/XXXX-000_20251001_093000.csv Wrote manifest with chunk duration 60s to manifests/XXXX-000_20251001_093000.csv (with hashes) 3. Split chunks to disk and emit a manifest in one pass:: $ badc chunk run data/datalad/bogus/audio/XXXX-000_20251001_093000.wav --chunk-duration 60 --overlap 5 --output-dir artifacts/chunks --manifest manifests/XXXX-000_20251001_093000.csv Chunks written to artifacts/chunks; manifest at manifests/XXXX-000_20251001_093000.csv When planning large batch jobs, pair ``badc chunk run`` with ``datalad run`` (see :doc:`howto/chunk-audio`) so provenance is captured alongside the generated WAVs. .. _usage-infer-examples: Run inference ------------- The :doc:`cli/infer` page covers every option; the quick hits below show common patterns. Stub/local runs (no HawkEars, great for CI):: $ badc infer run manifests/XXXX-000_20251001_093000.csv --runner-cmd "echo hawkears-stub" Processed 3 jobs; outputs stored in artifacts/infer Leverage HawkEars directly (requires CUDA + vendor checkout):: $ badc infer run manifests/XXXX-000_20251001_093000.csv --use-hawkears --hawkears-arg --min_score --hawkears-arg 0.7 Processed 3 jobs; outputs stored in artifacts/infer Reuse the shared config file instead of hand-writing flags:: $ badc infer run-config configs/hawkears-local.toml Processed 3 jobs; outputs stored in artifacts/infer CPU-heavy fallback (e.g., developers without GPUs or teams that want CPU assist threads):: $ badc infer run manifests/XXXX-000_20251001_093000.csv --cpu-workers 4 Processed 3 jobs; outputs stored in artifacts/infer Preview a ``datalad run`` command without executing jobs:: $ badc infer run manifests/XXXX-000_20251001_093000.csv --print-datalad-run Run the following from the dataset root (/home/gep/projects/badc/data/datalad/bogus): datalad run -m "badc infer ..." --input manifests/... --output artifacts/infer -- badc infer run ... When chunk inputs live inside a DataLad dataset (for example ``data/datalad/bogus``), inference outputs default to ``/artifacts/infer`` so you can immediately ``datalad save`` and push. GPU planning helpers:: $ badc gpus nvidia-smi reported 'Insufficient Permissions'. GPU inventory usually requires NVML access—try running `sudo nvidia-smi` to confirm the driver works or ask the cluster admin to grant your user access to the NVIDIA devices. No GPUs detected via nvidia-smi. Use ``--max-gpus`` to cap the GPU pool or ``--cpu-workers`` to append CPU threads (BADC still adds one CPU worker automatically when no GPUs are detected). When detection succeeds the utility lists each GPU (index, name, memory). If you see a permissions warning, escalate to the system administrator; otherwise BADC will fall back to CPU workers. .. _usage-aggregate-telemetry: Aggregate detections and monitor telemetry ------------------------------------------ Summarise detections via :doc:`cli/infer`, optionally emitting a Parquet file for DuckDB and pulling chunk metadata from the original manifest:: $ badc infer aggregate artifacts/infer \ --manifest manifests/XXXX-000_20251001_093000.csv \ --output artifacts/aggregate/summary.csv \ --parquet artifacts/aggregate/detections.parquet Wrote detection summary to artifacts/aggregate/summary.csv Wrote Parquet export to artifacts/aggregate/detections.parquet Hand the Parquet file to :doc:`cli/report` for quick pivots. Each row now carries the chunk-relative start/end offsets, absolute timestamps, HawkEars label code/name pairs, confidence, runner label, and the detected HawkEars ``model_version`` so downstream DuckDB queries have everything needed for Phase 2 aggregation:: $ badc report summary --parquet artifacts/aggregate/detections.parquet --group-by label +---------+------------+----------------+ | label | detections | avg_confidence | | grouse | 42 | 0.87 | +---------+------------+----------------+ Need more color than a single pivot? ``badc report quicklook`` hits the same Parquet file but emits multiple Rich tables (top labels, top recordings, chunk timeline) and ASCII sparklines while also writing optional CSV snapshots for notebooks:: $ badc report quicklook --parquet artifacts/aggregate/detections.parquet --output-dir artifacts/aggregate/quicklook $ ls artifacts/aggregate/quicklook chunks.csv labels.csv recordings.csv Telemetry logs are unique per manifest/timestamp (default path printed by ``badc infer run``). Tail them with :doc:`cli/infer`'s monitor view (per-GPU utilization/memory trends, rolling ASCII sparklines, success/failure counts, and a live event tail) or the lightweight :doc:`cli/misc` command:: $ badc infer monitor --log data/telemetry/infer/GNWT-290_20251207T080000Z.jsonl $ badc telemetry --log data/telemetry/infer/GNWT-290_20251207T080000Z.jsonl Telemetry records (4): [success] GNWT-290_chunk_1 (GPU 0) 2025-12-06T18:22:11 runtime=12.4 Telemetries are JSONL files, so you can also ingest them into notebooks or log shippers for dashboards. See :doc:`howto/aggregate-results` for a guided walkthrough tying together aggregation, Parquet exports, DuckDB summaries, and telemetry monitoring. See also -------- * :doc:`cli/data` – dataset helpers. * :doc:`cli/chunk` – chunk probe/split/manifest/run parameters. * :doc:`cli/infer` – inference, aggregation, and telemetry options. * :doc:`howto/infer-hpc` – SLURM-ready instructions for Sockeye/Chinook deployments.