Track Inference with ``datalad run`` ==================================== This recipe shows how to pair ``badc`` with ``datalad run`` so every HawkEars inference is reproducible: inputs, outputs, and the exact command line end up in the DataLad commit history. The workflow assumes you already cloned (or ``badc data connect``-ed) a dataset such as ``data/datalad/bogus``. .. contents:: Steps :local: :depth: 1 Prerequisites ------------- 1. DataLad + git-annex installed (see ``notes/datalad-plan.md`` for platform specifics). 2. A BADC checkout with Typer CLI entry points available (``pip install -e .`` or ``uv pip install -e .``). 3. Chunk manifest CSV living *inside* the target DataLad dataset. The manifest can be generated via ``badc chunk split --manifest data/datalad/bogus/...``. 4. ``badc data connect bogus`` (or your production dataset) executed so the local registry knows where the files reside. Step 1 – Confirm dataset layout ------------------------------- .. code-block:: console $ badc data status Tracked datasets: - bogus: connected (/home/user/projects/badc/data/datalad/bogus) Change into the dataset root and verify DataLad metadata exists:: $ cd data/datalad/bogus $ ls .datalad config config.datalad siblings.datalad Step 2 – Generate (or locate) a chunk manifest ---------------------------------------------- Use the chunk CLI to rewrite or validate the manifest so that every chunk path stays relative to the dataset root. Example:: $ badc chunk split audio/GNWT-290_20230331_235938.wav \ --chunk-duration 60 \ --manifest manifests/GNWT-290.csv The manifest file now sits under the DataLad repo (``manifests/``). This is a requirement for ``--print-datalad-run`` to work because BADC needs to declare ``--input`` paths relative to the dataset root. Step 3 – Ask BADC to draft the ``datalad run`` command ------------------------------------------------------ From anywhere (project root or dataset root) run:: $ badc infer run data/datalad/bogus/manifests/GNWT-290.csv \ --use-hawkears \ --print-datalad-run BADC inspects every chunk, finds the dataset root via ``badc.data.find_dataset_root``, and emits a command similar to:: datalad run -m "badc infer GNWT-290.csv" \ --input manifests/GNWT-290.csv \ --output artifacts/infer \ -- badc infer run manifests/GNWT-290.csv --use-hawkears Nothing executes yet—this step is a dry-run preview that guarantees the manifest and output folder live inside the same dataset and that all relative paths are valid. Step 4 – Execute inside the dataset ----------------------------------- Change to the dataset root (``cd data/datalad/bogus``) and run the suggested command. DataLad will: * Materialize required inputs (via git-annex/datalad get). * Execute ``badc infer run ...`` exactly as printed. * Save the produced JSON/CSV files under ``artifacts/infer`` (BADC chooses this path automatically when it detects that chunks live inside a dataset). * Create a commit referencing both the manifest and the resulting artifacts, along with the shell command recorded in ``git-annex`` metadata. Step 5 – Push provenance + outputs ---------------------------------- After the ``datalad run`` command succeeds:: $ datalad status $ datalad save -m "HawkEars inference for GNWT-290" $ datalad push --to origin This pushes the new commit plus annexed output objects to the configured special remote (S3 or GitHub, depending on the dataset). Variant: scripting multiple manifests ------------------------------------- When scheduling many manifests, wrap Steps 2–4 in a loop:: for manifest in manifests/*.csv; do (cd data/datalad/bogus && \ badc infer run "$manifest" --use-hawkears --print-datalad-run) # Copy/paste the emitted command or tee it into a shell script done Alternatively, pass ``--max-gpus`` or ``--cpu-workers`` to fine-tune concurrency (BADC still schedules at least one CPU worker automatically when GPUs are absent), and append ``--hawkears-arg`` repeatedly to forward custom switches to the HawkEars ``analyze.py`` entry point. Troubleshooting --------------- * If BADC reports that the manifest is outside the dataset root, move the file under the dataset (e.g., ``data/datalad/bogus/manifests``) or supply ``--output-dir`` so the workflow does not rely on dataset-relative paths. * ``datalad run`` fails when outputs already exist. Remove the previous ``artifacts/infer`` tree or use unique ``--output-dir`` values per run. * To rerun without recomputing inference, call ``datalad rerun`` on the recorded commit; DataLad will restore the same inputs and execute the saved command.