Track Inference with datalad run¶
This recipe shows how to pair badc with datalad run so every HawkEars
inference is reproducible: inputs, outputs, and the exact command line end up in
the DataLad commit history. The workflow assumes you already cloned (or
badc data connect-ed) a dataset such as data/datalad/bogus.
Prerequisites¶
DataLad + git-annex installed (see
notes/datalad-plan.mdfor platform specifics).A BADC checkout with Typer CLI entry points available (
pip install -e .oruv pip install -e .).Chunk manifest CSV living inside the target DataLad dataset. The manifest can be generated via
badc chunk split --manifest data/datalad/bogus/....badc data connect bogus(or your production dataset) executed so the local registry knows where the files reside.
Step 1 – Confirm dataset layout¶
$ badc data status
Tracked datasets:
- bogus: connected (/home/user/projects/badc/data/datalad/bogus)
Change into the dataset root and verify DataLad metadata exists:
$ cd data/datalad/bogus
$ ls .datalad
config config.datalad siblings.datalad
Step 2 – Generate (or locate) a chunk manifest¶
Use the chunk CLI to rewrite or validate the manifest so that every chunk path stays relative to the dataset root. Example:
$ badc chunk split audio/GNWT-290_20230331_235938.wav \
--chunk-duration 60 \
--manifest manifests/GNWT-290.csv
The manifest file now sits under the DataLad repo (manifests/). This is a
requirement for --print-datalad-run to work because BADC needs to declare
--input paths relative to the dataset root.
Step 3 – Ask BADC to draft the datalad run command¶
From anywhere (project root or dataset root) run:
$ badc infer run data/datalad/bogus/manifests/GNWT-290.csv \
--use-hawkears \
--print-datalad-run
BADC inspects every chunk, finds the dataset root via badc.data.find_dataset_root,
and emits a command similar to:
datalad run -m "badc infer GNWT-290.csv" \
--input manifests/GNWT-290.csv \
--output artifacts/infer \
-- badc infer run manifests/GNWT-290.csv --use-hawkears
Nothing executes yet—this step is a dry-run preview that guarantees the manifest and output folder live inside the same dataset and that all relative paths are valid.
Step 4 – Execute inside the dataset¶
Change to the dataset root (cd data/datalad/bogus) and run the suggested
command. DataLad will:
Materialize required inputs (via git-annex/datalad get).
Execute
badc infer run ...exactly as printed.Save the produced JSON/CSV files under
artifacts/infer(BADC chooses this path automatically when it detects that chunks live inside a dataset).Create a commit referencing both the manifest and the resulting artifacts, along with the shell command recorded in
git-annexmetadata.
Step 5 – Push provenance + outputs¶
After the datalad run command succeeds:
$ datalad status
$ datalad save -m "HawkEars inference for GNWT-290"
$ datalad push --to origin
This pushes the new commit plus annexed output objects to the configured special remote (S3 or GitHub, depending on the dataset).
Variant: scripting multiple manifests¶
When scheduling many manifests, wrap Steps 2–4 in a loop:
for manifest in manifests/*.csv; do
(cd data/datalad/bogus && \
badc infer run "$manifest" --use-hawkears --print-datalad-run)
# Copy/paste the emitted command or tee it into a shell script
done
Alternatively, pass --max-gpus or --cpu-workers to fine-tune concurrency
(BADC still schedules at least one CPU worker automatically when GPUs are absent),
and append --hawkears-arg repeatedly to forward custom switches to the
HawkEars analyze.py entry point.
Troubleshooting¶
If BADC reports that the manifest is outside the dataset root, move the file under the dataset (e.g.,
data/datalad/bogus/manifests) or supply--output-dirso the workflow does not rely on dataset-relative paths.datalad runfails when outputs already exist. Remove the previousartifacts/infertree or use unique--output-dirvalues per run.To rerun without recomputing inference, call
datalad rerunon the recorded commit; DataLad will restore the same inputs and execute the saved command.