Sockeye GPU Workflow¶
Sockeye is UBC ARC’s multi-tenant GPU cluster. This page captures the conventions we use to run HawkEars inference there without breaking cluster policies.
Prerequisites¶
Active Sockeye account with permission to submit GPU jobs.
module load apptainerandmodule load cuda/<version>available on login nodes.DataLad + git-annex installed in your home or project environment (see
notes/datalad-plan.md).Data staged in a project directory (
/project/<pi>/badc) or on the scratch filesystem; avoid using$HOMEfor large data.
Resource requests¶
GPU partitions:
gpu(A100/H100 mix) andgpu-a40(A40 cards). Use--gres=gpu:<N>whereN≤ 4 per job.Typical HawkEars run:
--time=04:00:00 --mem=64G --cpus-per-task=8works for four concurrent GPUs. Adjust according to chunk duration.Set
--account=<pi-allocation>so the job charges the correct project.
Working with DataLad¶
Clone the source repo (with submodules) on the login node:
git clone --recurse-submodules.For each dataset, run
badc data connect bogus --path /project/<pi>/badc/data/datalador your production dataset clone. The CLI writes metadata to~/.config/badc/data.tomlso future jobs can locate it.Inside job scripts, prefer
datalad runto wrap the inference command. This records provenance and keeps outputs inside the dataset root.
Example SLURM script¶
#!/bin/bash
#SBATCH --job-name=badc-hawkears
#SBATCH --partition=gpu
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=04:00:00
#SBATCH --account=pi-mygroup
#SBATCH --output=logs/%x-%j.out
module load apptainer/1.3 cuda/12.2
source /project/pi-mygroup/badc/.venv/bin/activate
DATASET=/project/pi-mygroup/badc/data/datalad/bogus
cd "$DATASET"
# Ensure latest audio
datalad update --how=merge --recursive
MANIFEST=manifests/GNWT-290.csv
IMG=/project/pi-mygroup/containers/badc-hawkears.sif
datalad run -m "hawkears $(basename "$MANIFEST")" \
--input "$MANIFEST" \
--output artifacts/infer \
-- \
apptainer exec --nv "$IMG" badc infer run "$MANIFEST" \
--use-hawkears --max-gpus 4 --hawkears-arg --min_score --hawkears-arg 0.7
Automatic GPU detection¶
Run
badc gpusinside the allocation after loading modules to confirm the devices exposed viaCUDA_VISIBLE_DEVICESmatch the--gresrequest.Limit utilization with
--max-gpuswhen you intentionally reserve more GPUs than needed (e.g., staging pipelines or throttling HawkEars concurrency).CPU-only dry runs: supply
--cpu-workers N(>=1) and omit--max-gpusto request additional CPU threads; BADC will still add at least one CPU worker automatically when no GPUs are detected.
Job arrays & manifests¶
Batch many manifests by storing their relative paths in a text file and launching a SLURM array:
#SBATCH --array=1-10 MANIFEST=$(sed -n "${SLURM_ARRAY_TASK_ID}p" manifests.txt) datalad run -m "hawkears $(basename "$MANIFEST")" \ --input "$MANIFEST" --output artifacts/infer \ -- apptainer exec --nv "$IMG" badc infer run "$MANIFEST" --use-hawkearsKeep array jobs idempotent by writing outputs under
artifacts/infer/<recording>/inside the dataset so failed tasks can be retried withdatalad rerun.
Automated script generation¶
Instead of hand-authoring sbatch files, run
badc infer orchestrate --sockeye-scriptto emit a ready-to-submit script that already knows about your manifests, telemetry logs, and aggregation paths. Example (recorded 2025-12-10 after the bogus dataset refresh):badc infer orchestrate data/datalad/bogus \ --manifest-dir manifests \ --include-existing \ --sockeye-script artifacts/sockeye/bogus_bundle.sh \ --sockeye-job-name badc-bogus \ --sockeye-account pi-fresh \ --sockeye-partition gpu \ --sockeye-gres gpu:4 \ --sockeye-time 06:00:00 \ --sockeye-cpus-per-task 8 \ --sockeye-mem 64G \ --sockeye-resume-completed \ --sockeye-log-dir /scratch/$USER/badc-logs \ --sockeye-bundle \ --sockeye-bundle-aggregate-dir artifacts/aggregate \ --sockeye-bundle-bucket-minutes 30The generated script populates
MANIFESTS,OUTPUTS,TELEMETRY, and (when--sockeye-resume-completedis enabled)RESUMESarrays. Each array index receives the right telemetry summary and automatically appends--resume-summaryso chunks already markedsuccessare skipped inline. Adding--sockeye-bundleinjects the aggregation steps directly afterbadc infer run:AGG_CMD=(badc infer aggregate "$OUTPUT" --manifest "$MANIFEST" --output "$SUMMARY_PATH" --parquet "$PARQUET_PATH") BUNDLE_CMD=(badc report bundle --parquet "$PARQUET_PATH" --output-dir "$AGGREGATE_DIR" --bucket-minutes 30)
Because the dataset lives under DataLad/git-annex, the manifest/telemetry/resume entries resolve to
.git/modules/.../annex/objectslocations. Always change into the dataset root (cd "$DATASET"as shown in the script) sodataladcan materialise those files viadatalad getbefore the job starts. Use--sockeye-log-dirwhen you want telemetry/summary logs to land under a cluster-friendly path (e.g.,$SCRATCH/logs/badc); the emitted script pre-creates the directory, points both theTELEMETRYandRESUMESarrays at that location, echoes whether each resume summary exists, and prints bundle artifact locations (summary/parquet/duckdb) after the reporting commands finish so you can archive logs alongside SLURM output.The array script now also validates
artifacts/chunks/<recording>/.chunk_status.jsonbefore running HawkEars. If the status file is missing or notcompleted, the task exits with a clear error so you can rerunbadc chunk orchestrate --apply(or copy the missing chunk artifacts) before consuming GPU allocations.
Telemetry & monitoring¶
Use
badc telemetry --log data/telemetry/infer/log.jsonlto inspect the last few chunks.For real-time GPU stats, add
nvidia-smi dmon -s pucm -i 0,1,2,3in a separatesrunshell or runbadc gpusat job start to confirm visibility.Generate scripts with
--sockeye-log-dir /scratch/$USER/badc-logswhen SLURM nodes cannot write telemetry into the DataLad tree. The script pre-creates that directory, redirects telemetry + resume summaries there, and prints the final bundle artifact paths so you can tail logs from a login node without datalad get.
Common pitfalls¶
Forgetting
module load apptainercausesexec: apptainer: not founderrors.Always run
datalad runfrom inside the dataset root; otherwise--print-datalad-runrefuses to emit commands.Clean up scratch: Sockeye purges
$SCRATCHperiodically. Store authoritative outputs back in the DataLad dataset and push to Chinook/GitHub once the run finishes.