Track Inference with datalad run

This recipe shows how to pair badc with datalad run so every HawkEars inference is reproducible: inputs, outputs, and the exact command line end up in the DataLad commit history. The workflow assumes you already cloned (or badc data connect-ed) a dataset such as data/datalad/bogus.

Prerequisites

  1. DataLad + git-annex installed (see notes/datalad-plan.md for platform specifics).

  2. A BADC checkout with Typer CLI entry points available (pip install -e . or uv pip install -e .).

  3. Chunk manifest CSV living inside the target DataLad dataset. The manifest can be generated via badc chunk split --manifest data/datalad/bogus/....

  4. badc data connect bogus (or your production dataset) executed so the local registry knows where the files reside.

Step 1 – Confirm dataset layout

$ badc data status
Tracked datasets:
  - bogus: connected (/home/user/projects/badc/data/datalad/bogus)

Change into the dataset root and verify DataLad metadata exists:

$ cd data/datalad/bogus
$ ls .datalad
config  config.datalad  siblings.datalad

Step 2 – Generate (or locate) a chunk manifest

Use the chunk CLI to rewrite or validate the manifest so that every chunk path stays relative to the dataset root. Example:

$ badc chunk split audio/GNWT-290_20230331_235938.wav \
    --chunk-duration 60 \
    --manifest manifests/GNWT-290.csv

The manifest file now sits under the DataLad repo (manifests/). This is a requirement for --print-datalad-run to work because BADC needs to declare --input paths relative to the dataset root.

Step 3 – Ask BADC to draft the datalad run command

From anywhere (project root or dataset root) run:

$ badc infer run data/datalad/bogus/manifests/GNWT-290.csv \
    --use-hawkears \
    --print-datalad-run

BADC inspects every chunk, finds the dataset root via badc.data.find_dataset_root, and emits a command similar to:

datalad run -m "badc infer GNWT-290.csv" \
  --input manifests/GNWT-290.csv \
  --output artifacts/infer \
  -- badc infer run manifests/GNWT-290.csv --use-hawkears

Nothing executes yet—this step is a dry-run preview that guarantees the manifest and output folder live inside the same dataset and that all relative paths are valid.

Step 4 – Execute inside the dataset

Change to the dataset root (cd data/datalad/bogus) and run the suggested command. DataLad will:

  • Materialize required inputs (via git-annex/datalad get).

  • Execute badc infer run ... exactly as printed.

  • Save the produced JSON/CSV files under artifacts/infer (BADC chooses this path automatically when it detects that chunks live inside a dataset).

  • Create a commit referencing both the manifest and the resulting artifacts, along with the shell command recorded in git-annex metadata.

Step 5 – Push provenance + outputs

After the datalad run command succeeds:

$ datalad status
$ datalad save -m "HawkEars inference for GNWT-290"
$ datalad push --to origin

This pushes the new commit plus annexed output objects to the configured special remote (S3 or GitHub, depending on the dataset).

Variant: scripting multiple manifests

When scheduling many manifests, wrap Steps 2–4 in a loop:

for manifest in manifests/*.csv; do
    (cd data/datalad/bogus && \
     badc infer run "$manifest" --use-hawkears --print-datalad-run)
    # Copy/paste the emitted command or tee it into a shell script
done

Alternatively, pass --max-gpus or --cpu-workers to fine-tune concurrency (BADC still schedules at least one CPU worker automatically when GPUs are absent), and append --hawkears-arg repeatedly to forward custom switches to the HawkEars analyze.py entry point.

Troubleshooting

  • If BADC reports that the manifest is outside the dataset root, move the file under the dataset (e.g., data/datalad/bogus/manifests) or supply --output-dir so the workflow does not rely on dataset-relative paths.

  • datalad run fails when outputs already exist. Remove the previous artifacts/infer tree or use unique --output-dir values per run.

  • To rerun without recomputing inference, call datalad rerun on the recorded commit; DataLad will restore the same inputs and execute the saved command.