Run Inference on Sockeye¶
This cookbook stitches together DataLad staging, Apptainer containers, and Sockeye SLURM jobs so you can blast through manifest-sized HawkEars runs on the cluster.
Pre-flight checklist¶
Review the Sockeye HPC page for the latest partition, module, and job-array guidelines.
Ensure the DataLad dataset is clean on Chinook before copying manifests to Sockeye (
datalad statusshould report “nothing to save”).Run
badc data statuson both Chinook and Sockeye to confirm the bogus/production datasets point at the same commit.Capture the commands you will run (
datalad runinvocation,sbatchsubmission) inCHANGE_LOG.mdas soon as the batch finishes.
GPU planning¶
Use
badc gpusinside an interactivesrun --gres=gpu:1 --pty bashshell to confirm which devices will be available to the job.If a manifest requires fewer workers than GPUs requested, pass
--max-gpusto keep HawkEars from spawning unnecessary processes while still holding the reservation for future chunks.Sockeye arrays:
badc infer orchestrate --sockeye-script sockeye_array.shnow emits a ready-to-use SLURM array script (one task per manifest). Pair this withsbatch sockeye_array.shto run the entire plan without hand-editing bash snippets. Store the command string in the job log for provenance. When re-running an interrupted array, add--sockeye-resume-completedso the generated script automatically passes--resume-summaryif the prior telemetry*.summary.jsonexists—finished chunks are skipped automatically. Use--sockeye-bundlewhen generating the script to automatically appendbadc infer aggregate+badc report bundleper manifest, leaving behind the CSV/Parquet/DuckDB exports (quicklook, parquet report, DuckDB) right after each recording finishes.
Notebook hand-off¶
After collecting results (step 5), launch the notebook gallery (the notebook gallery (
docs/notebooks/index)) locally or on Chinook to visualize detection counts (notebooks/aggregate_analysis.ipynb) before pushing to collaborators.The same datasets used for inference can host derived tables—run
badc infer aggregateinside the dataset,datalad save, then open the notebook withdatalad run jupyter labif you need provenance for figure generation.
1. Prepare the dataset on Chinook¶
Clone or create the DataLad dataset in
/project/<pi>/badc/data/datalad/<name>.Populate (or update) manifests under
manifests/and save withdatalad save.Push metadata + content upstream:
$ datalad push --to origin $ datalad push --to arbutus-s3
2. Stage code + containers on Sockeye¶
Clone this repo with submodules into
/project/<pi>/badc.Build (or copy)
badc-hawkears.sifinto/project/<pi>/containers.Create a Python virtual environment for helper scripts (optional if container used for everything).
3. Draft the manifest-specific job script¶
Hint: badc infer orchestrate --sockeye-script job.sh can emit this SLURM array template
automatically (one array index per manifest). Edit the generated script or start from the example
below if you prefer a custom layout.
#!/bin/bash
#SBATCH --job-name=hawkears-gnwt290
#SBATCH --partition=gpu
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=04:00:00
#SBATCH --output=logs/%x-%j.out
module load apptainer/1.3 cuda/12.2
DATASET=/project/pi-mygroup/badc/data/datalad/bogus
MANIFEST=manifests/GNWT-290.csv
IMG=/project/pi-mygroup/containers/badc-hawkears.sif
cd "$DATASET"
datalad update --how=merge --recursive
datalad run -m "hawkears $(basename "$MANIFEST")" \
--input "$MANIFEST" \
--output artifacts/infer \
-- \
apptainer exec --nv "$IMG" badc infer run "$MANIFEST" \
--use-hawkears --max-gpus 4 --hawkears-arg --min_score --hawkears-arg 0.7
4. Submit + monitor¶
sbatch job.shsqueue -u $USERto watch state.Tail the log:
tail -f logs/hawkears-gnwt290-<jobid>.out.Inspect telemetry in real time:
badc telemetry --log data/telemetry/infer/log.jsonl(from the dataset root).
5. Collect results¶
Once the job finishes, verify new JSON appears under
artifacts/infer/<recording>/.Run aggregation locally or as a follow-up job:
$ badc infer aggregate artifacts/infer --output artifacts/aggregate/summary.csv $ datalad save -m "Aggregate GNWT-290 Sockeye run"
6. Push back to Chinook/GitHub¶
From the dataset root:
$ datalad push --to origin $ datalad push --to arbutus-s3
Update
CHANGE_LOG.mdwith the commands executed.
Troubleshooting¶
apptainer execfails withfailed to communicate with slurmstepd– make sure you request GPUs via--gresand include--nv.DataLad complains about dirty worktree – ensure
datalad runexecutes inside the dataset root and that your manifest path is relative to that directory.GPU count mismatch – Sockeye injects
CUDA_VISIBLE_DEVICES. Let BADC detect GPUs automatically (default) or pass--max-gpusexplicitly to stay within the allocation.