Chinook Storage Strategy¶

Chinook provides the object storage backend (Arbutus S3) and long-term POSIX space for BADC data. Use it to host DataLad datasets, Apptainer images, and large inference outputs.

S3 special remote ¶

We follow the “GitHub metadata + Arbutus S3 content” pattern documented in notes/datalad-plan.md.
Configure credentials via setup/datalad_config.sh (ignored by git). Required variables: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, S3_ENDPOINT_URL, S3_BUCKET_NAME, GITHUB_ORG, GITHUB_REPO_NAME.
The bootstrap script scripts/setup_bogus_datalad.sh demonstrates how to: 1. datalad create the dataset. 2. Copy audio fixtures into audio/. 3. Run git annex initremote arbutus-s3 ... which creates the bucket automatically. 4. datalad create-sibling-github to publish the metadata repo.
If the script fails after creating the bucket, delete the bucket (or its git-annex-uuid object) manually before rerunning; reuse support is still fragile.

POSIX workspace layout ¶

Keep source repo, virtual environments, and containers under /project/<pi>/badc so Sockeye and Chinook share paths.
Store large artifacts (e.g., aggregated CSVs) under /data/<pi>/badc if you need higher quotas; symlink them back into the DataLad dataset when saving commits.
When Sockeye jobs write telemetry to $SCRATCH via --sockeye-log-dir, rsync that directory back to Chinook alongside the DataLad pushes so log + resume files remain reachable for audits.

Dataset lifecycle example ¶

On your workstation, prepare manifests and telemetry folders under data/datalad/<name>.
datalad save to capture the changes locally, then datalad push --to origin (GitHub metadata).
From Chinook, run datalad update --how=merge followed by datalad get for any new audio/manifest paths.
After Sockeye jobs finish and push artifacts back (see the “Run Inference on Sockeye” how-to), execute datalad push --to arbutus-s3 on Chinook to ensure annexed WAVs land in the bucket.
Record the sync in CHANGE_LOG.md so collaborators understand which dataset revision reached Chinook.

Publishing changes ¶

Stage work in the dataset: datalad save -m "Add GNWT-290 chunks".
Push metadata to GitHub: datalad push --to origin.
Push annexed content to Chinook: datalad push --to arbutus-s3.
Record the commands in CHANGE_LOG.md per AGENTS.md.

Credential rotation ¶

Store AWS/GitHub tokens in setup/datalad_config.sh and source it before running the helper scripts.
When rotating credentials, issue datalad siblings configure --name arbutus-s3 ... to update stored access keys without recreating the dataset.
Validate connectivity with git annex testremote arbutus-s3 before launching large transfers.

Credential hygiene ¶

Never commit setup/datalad_config.sh; the filename is already gitignored.
When sharing instructions, reference environment variables rather than pasting secrets.
On Sockeye, export the same variables in ~/.bashrc or ~/.bash_profile if the job needs to talk to Chinook directly (e.g., datalad push --to arbutus-s3 inside a batch script).

Chinook Storage Strategy¶

S3 special remote¶

POSIX workspace layout¶

Dataset lifecycle example¶

Publishing changes¶

Credential rotation¶

Credential hygiene¶

S3 special remote ¶

POSIX workspace layout ¶

Dataset lifecycle example ¶

Publishing changes ¶

Credential rotation ¶

Credential hygiene ¶