Chinook Storage Strategy¶
Chinook provides the object storage backend (Arbutus S3) and long-term POSIX space for BADC data. Use it to host DataLad datasets, Apptainer images, and large inference outputs.
S3 special remote¶
We follow the “GitHub metadata + Arbutus S3 content” pattern documented in
notes/datalad-plan.md.Configure credentials via
setup/datalad_config.sh(ignored by git). Required variables:AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,S3_ENDPOINT_URL,S3_BUCKET_NAME,GITHUB_ORG,GITHUB_REPO_NAME.The bootstrap script
scripts/setup_bogus_datalad.shdemonstrates how to: 1.datalad createthe dataset. 2. Copy audio fixtures intoaudio/. 3. Rungit annex initremote arbutus-s3 ...which creates the bucket automatically. 4.datalad create-sibling-githubto publish the metadata repo.If the script fails after creating the bucket, delete the bucket (or its
git-annex-uuidobject) manually before rerunning; reuse support is still fragile.
POSIX workspace layout¶
Keep source repo, virtual environments, and containers under
/project/<pi>/badcso Sockeye and Chinook share paths.Store large artifacts (e.g., aggregated CSVs) under
/data/<pi>/badcif you need higher quotas; symlink them back into the DataLad dataset when saving commits.When Sockeye jobs write telemetry to
$SCRATCHvia--sockeye-log-dir, rsync that directory back to Chinook alongside the DataLad pushes so log + resume files remain reachable for audits.
Dataset lifecycle example¶
On your workstation, prepare manifests and telemetry folders under
data/datalad/<name>.datalad saveto capture the changes locally, thendatalad push --to origin(GitHub metadata).From Chinook, run
datalad update --how=mergefollowed bydatalad getfor any new audio/manifest paths.After Sockeye jobs finish and push artifacts back (see the “Run Inference on Sockeye” how-to), execute
datalad push --to arbutus-s3on Chinook to ensure annexed WAVs land in the bucket.Record the sync in
CHANGE_LOG.mdso collaborators understand which dataset revision reached Chinook.
Publishing changes¶
Stage work in the dataset:
datalad save -m "Add GNWT-290 chunks".Push metadata to GitHub:
datalad push --to origin.Push annexed content to Chinook:
datalad push --to arbutus-s3.Record the commands in
CHANGE_LOG.mdperAGENTS.md.
Credential rotation¶
Store AWS/GitHub tokens in
setup/datalad_config.shand source it before running the helper scripts.When rotating credentials, issue
datalad siblings configure --name arbutus-s3 ...to update stored access keys without recreating the dataset.Validate connectivity with
git annex testremote arbutus-s3before launching large transfers.
Credential hygiene¶
Never commit
setup/datalad_config.sh; the filename is already gitignored.When sharing instructions, reference environment variables rather than pasting secrets.
On Sockeye, export the same variables in
~/.bashrcor~/.bash_profileif the job needs to talk to Chinook directly (e.g.,datalad push --to arbutus-s3inside a batch script).