Run Inference on Sockeye ======================== This cookbook stitches together DataLad staging, Apptainer containers, and Sockeye SLURM jobs so you can blast through manifest-sized HawkEars runs on the cluster. .. contents:: Steps :local: :depth: 1 Pre-flight checklist -------------------- * Review the Sockeye HPC page for the latest partition, module, and job-array guidelines. * Ensure the DataLad dataset is clean on Chinook before copying manifests to Sockeye (``datalad status`` should report "nothing to save"). * Run ``badc data status`` on both Chinook and Sockeye to confirm the bogus/production datasets point at the same commit. * Capture the commands you will run (``datalad run`` invocation, ``sbatch`` submission) in ``CHANGE_LOG.md`` as soon as the batch finishes. GPU planning ------------ * Use ``badc gpus`` inside an interactive ``srun --gres=gpu:1 --pty bash`` shell to confirm which devices will be available to the job. * If a manifest requires fewer workers than GPUs requested, pass ``--max-gpus`` to keep HawkEars from spawning unnecessary processes while still holding the reservation for future chunks. * Sockeye arrays: ``badc infer orchestrate --sockeye-script sockeye_array.sh`` now emits a ready-to-use SLURM array script (one task per manifest). Pair this with ``sbatch sockeye_array.sh`` to run the entire plan without hand-editing bash snippets. Store the command string in the job log for provenance. When re-running an interrupted array, add ``--sockeye-resume-completed`` so the generated script automatically passes ``--resume-summary`` if the prior telemetry ``*.summary.json`` exists—finished chunks are skipped automatically. Use ``--sockeye-bundle`` when generating the script to automatically append ``badc infer aggregate`` + ``badc report bundle`` per manifest, leaving behind the CSV/Parquet/DuckDB exports (quicklook, parquet report, DuckDB) right after each recording finishes. Notebook hand-off ----------------- * After collecting results (step 5), launch the notebook gallery (the notebook gallery (``docs/notebooks/index``)) locally or on Chinook to visualize detection counts (``notebooks/aggregate_analysis.ipynb``) before pushing to collaborators. * The same datasets used for inference can host derived tables—run ``badc infer aggregate`` inside the dataset, ``datalad save``, then open the notebook with ``datalad run jupyter lab`` if you need provenance for figure generation. 1. Prepare the dataset on Chinook --------------------------------- * Clone or create the DataLad dataset in ``/project//badc/data/datalad/``. * Populate (or update) manifests under ``manifests/`` and save with ``datalad save``. * Push metadata + content upstream: .. code-block:: console $ datalad push --to origin $ datalad push --to arbutus-s3 2. Stage code + containers on Sockeye ------------------------------------- * Clone this repo with submodules into ``/project//badc``. * Build (or copy) ``badc-hawkears.sif`` into ``/project//containers``. * Create a Python virtual environment for helper scripts (optional if container used for everything). 3. Draft the manifest-specific job script ----------------------------------------- Hint: ``badc infer orchestrate --sockeye-script job.sh`` can emit this SLURM array template automatically (one array index per manifest). Edit the generated script or start from the example below if you prefer a custom layout. .. code-block:: bash #!/bin/bash #SBATCH --job-name=hawkears-gnwt290 #SBATCH --partition=gpu #SBATCH --gres=gpu:4 #SBATCH --cpus-per-task=8 #SBATCH --mem=64G #SBATCH --time=04:00:00 #SBATCH --output=logs/%x-%j.out module load apptainer/1.3 cuda/12.2 DATASET=/project/pi-mygroup/badc/data/datalad/bogus MANIFEST=manifests/GNWT-290.csv IMG=/project/pi-mygroup/containers/badc-hawkears.sif cd "$DATASET" datalad update --how=merge --recursive datalad run -m "hawkears $(basename "$MANIFEST")" \ --input "$MANIFEST" \ --output artifacts/infer \ -- \ apptainer exec --nv "$IMG" badc infer run "$MANIFEST" \ --use-hawkears --max-gpus 4 --hawkears-arg --min_score --hawkears-arg 0.7 4. Submit + monitor ------------------- * ``sbatch job.sh`` * ``squeue -u $USER`` to watch state. * Tail the log: ``tail -f logs/hawkears-gnwt290-.out``. * Inspect telemetry in real time: ``badc telemetry --log data/telemetry/infer/log.jsonl`` (from the dataset root). 5. Collect results ------------------ * Once the job finishes, verify new JSON appears under ``artifacts/infer//``. * Run aggregation locally or as a follow-up job: .. code-block:: console $ badc infer aggregate artifacts/infer --output artifacts/aggregate/summary.csv $ datalad save -m "Aggregate GNWT-290 Sockeye run" 6. Push back to Chinook/GitHub ------------------------------ * From the dataset root: .. code-block:: console $ datalad push --to origin $ datalad push --to arbutus-s3 * Update ``CHANGE_LOG.md`` with the commands executed. Troubleshooting --------------- * ``apptainer exec`` fails with ``failed to communicate with slurmstepd`` – make sure you request GPUs via ``--gres`` and include ``--nv``. * DataLad complains about dirty worktree – ensure ``datalad run`` executes inside the dataset root and that your manifest path is relative to that directory. * GPU count mismatch – Sockeye injects ``CUDA_VISIBLE_DEVICES``. Let BADC detect GPUs automatically (default) or pass ``--max-gpus`` explicitly to stay within the allocation.