Data Repository Commands¶
The badc data namespace abstracts how users clone, update, and detach
DataLad-backed repositories that hold primary bird audio. It mirrors the
practices captured in notes/datalad-plan.md and stores the results in a
small TOML registry so other tools (chunking, inference, notebooks) know where
those datasets live.
Overview¶
Runs happily inside a virtual environment without elevating privileges.
Supports both
gitanddataladclone flows—BADC auto-detects which tool is available unless--methodforces a choice.Records every connected dataset in a per-user config file (
~/.config/badc/data.tomlby default, override viaBADC_DATA_CONFIG).Knows about the
bogusdataset out of the box so people can smoke-test the CLI without touching production remotes. You can override the URL via--urlfor custom mirrors or private deployments.When
NAMEisbogusand the repository contains thedata/datalad/bogusgit submodule,badc data connectautomatically runsgit submodule update --init --recursive data/datalad/bogusinstead of cloning a new copy, keeping the registry aligned with the in-tree dataset.
Dataset registry¶
Each successful connect run (without --dry-run) writes an entry that
looks like this:
[datasets.bogus]
path = "/home/user/projects/badc/data/datalad/bogus"
url = "https://github.com/UBC-FRESH/badc-bogus-data.git"
method = "datalad"
status = "connected"
last_connected = "2025-01-12T21:07:14.889192+00:00"
You can point BADC at an alternate config file by exporting
BADC_DATA_CONFIG=/path/to/custom.toml before invoking the CLI. That is
handy when staging datasets on different filesystems (e.g., NVMe for inference,
Ceph for archiving) or when running automated tests that should not touch the
real registry.
badc data connect¶
Clone (or refresh) a dataset under data/datalad/<name> and track it in the
registry. The command returns immediately after the clone completes or after a
pull, so you can chain it inside scripts or datalad run records.
Usage:
badc data connect NAME [--path PATH] [--url URL] [--method git|datalad]
[--pull / --no-pull] [--dry-run]
Key options:
NAMEDataset identifier.
bogusresolves to the public sample repository maintained for this project. Unknown names require--url.--pathBase directory that will hold dataset folders. Defaults to
data/dataladrelative to the current working tree.--urlOverride the clone URL. Required when
NAMEis not in the built-in table defined inbadc.data.DEFAULT_DATASETS.--methodForce
gitordatalad. When omitted, BADC prefersdataladif the binary is onPATHand falls back togitotherwise.--pull / --no-pullControls what happens when the target directory already exists.
--pull(default) merges upstream changes;--no-pullsimply confirms the presence of the dataset.--dry-runPrint what would happen without touching the filesystem.
Option reference¶
Option / Argument |
Description |
Default |
|---|---|---|
|
Dataset registry key ( |
Required |
|
Base directory where datasets are created/updated. |
|
|
Override clone URL for custom/private datasets. |
Registry value or required for unknown names |
|
Force clone implementation. Auto-detected when omitted. |
Auto |
|
Update an existing dataset after verifying it exists. |
|
|
Print planned actions without touching disk or the registry. |
Disabled |
Help excerpt¶
$ badc data connect --help
Usage: badc data connect [OPTIONS] NAME
Clone or update a DataLad dataset and record its metadata.
Arguments:
NAME Dataset name, e.g., 'bogus'. [required]
Options:
--path DIRECTORY Target path for the dataset. [default: data/datalad]
--url TEXT Override dataset URL (required for unknown names).
--method TEXT Preferred clone method: git or datalad.
--pull / --no-pull Update the dataset when it already exists locally.
--dry-run / --apply Preview actions without running commands.
--help Show this message and exit.
Examples:
# Clone the public bogus dataset using DataLad (auto-detected)
badc data connect bogus
# Clone into a scratch filesystem and skip pulling if it already exists
badc data connect bogus --path /mnt/scratch/badc --no-pull
# Register a private repository
badc data connect sockeye-prod --url git@github.com:UBC-FRESH/badc-prod-data.git
See Bootstrap a checkout for a complete checkout walkthrough.
badc data disconnect¶
Remove a dataset from the active registry and optionally delete its contents.
The command never deletes anything unless you pass --drop-content—with that flag
enabled BADC first runs datalad drop --recursive --reckless auto (when available)
so annexed content is removed cleanly before the directory tree is deleted.
Usage:
badc data disconnect NAME [--drop-content / --keep-content]
[--path PATH] [--dry-run]
--drop-contentRecursively delete the dataset directory after marking it as disconnected.
--pathFallback search root when the dataset is not in the registry, useful for first-time disconnects or misconfigured machines.
--dry-runPreview the deletion/recording steps without touching the filesystem.
Option reference¶
Option / Argument |
Description |
Default |
|---|---|---|
|
Dataset identifier to mark as disconnected. |
Required |
|
Remove dataset files after recording the disconnection. |
|
|
Base directory to search when the registry entry is missing. |
|
|
Emit the pending actions without deleting or editing files. |
Disabled |
Help excerpt¶
$ badc data disconnect --help
Usage: badc data disconnect [OPTIONS] NAME
Mark a dataset as disconnected and optionally drop its contents.
Arguments:
NAME Dataset name to detach. [required]
Options:
--drop-content / --keep-content Drop annexed files after disconnecting.
--path DIRECTORY Base directory that holds dataset folders.
--dry-run / --apply Preview actions without modifying files.
--help Show this message and exit.
badc data status¶
List every dataset currently recorded in the registry along with its local path and whether the directory still exists. This is the fastest way to confirm that the bogus dataset submodule was connected correctly or to audit scratch mounts before a cleanup.
Usage:
badc data status [--path PATH]
Example output (after git submodule update –init –recursive followed by badc data connect bogus):
$ badc data status
NAME PATH METHOD STATUS LAST_CONNECTED
bogus /home/user/projects/badc/data/datalad/bogus datalad connected 2025-12-10T02:18:31+00:00
If PATH is missing on disk the command prints missing in the STATUS column and suggests
rerunning badc data connect NAME --pull.
Cleanup workflow¶
When you are done with a dataset copy (or need to reclaim space on a dev server), follow this pattern:
datalad statusinside the dataset to confirm there are no uncommitted changes.datalad drop --recursive --reckless autoto remove annexed content but keep metadata.badc data disconnect NAME --drop-contentif you want to delete the working tree entirely and remove it from the registry. Use--keep-contentwhen you only want to mark it inactive.Rerun
badc data statusto confirm the entry flipped todisconnected.
Because the bogus dataset lives at data/datalad/bogus as a git submodule, badc data connect
prefers updating the existing checkout instead of recloning. If you do delete that directory,
badc data connect bogus automatically re-runs
git submodule update --init --recursive data/datalad/bogus before refreshing the registry entry.
See notes/datalad-plan.md for end-to-end scenarios (clone, publish, cleanup).
The registry retains the last known path and timestamp so future connect
operations can reconcile state when pointed at the same location.
badc data status¶
Display the tracked datasets, their status (connected or disconnected), filesystem paths, and—when requested—DataLad sibling information. Example summary output:
$ badc data status
Tracked datasets:
- bogus: connected (/home/gep/projects/badc/data/datalad/bogus) [present]
Request extended metadata and siblings when debugging dataset plumbing:
$ badc data status --details --show-siblings
bogus — connected (method: datalad)
Path: /home/gep/projects/badc/data/datalad/bogus
Exists: yes; type: datalad
Siblings:
- origin state=present https://github.com/UBC-FRESH/badc-bogus-data.git
- arbutus-s3 state=present s3://ubc-fresh-badc-bogus-data
Option reference¶
Option / Argument |
Description |
Default |
|---|---|---|
|
Toggle extended output (method, filesystem checks, notes). |
|
|
Include |
|
Help excerpt¶
$ badc data status --help
Usage: badc data status [OPTIONS]
Report all datasets tracked in ~/.config/badc/data.toml.
Options:
--details / --summary Show extended metadata for each dataset.
--show-siblings / --hide-siblings
Include `datalad siblings` output (requires DataLad).
--help Show this message and exit.
Use this command while debugging datalad run pipelines or before chaining a
chunk/infer workflow to confirm that the referenced repositories exist locally.
Automation tips¶
Combine
badc data connectwithgit submodule update --init --recursivein bootstrap scripts so cloned worktrees always have both the source tree and the audio datasets they require.When integrating with
datalad run, callbadc data connectas the first recorded action so downstream provenance captures the origin of the dataset.Emit
badc data statusas part of telemetry bundles to help future readers understand which repository revision supplied the raw WAV files.