Data Repository Commands¶

The badc data namespace abstracts how users clone, update, and detach DataLad-backed repositories that hold primary bird audio. It mirrors the practices captured in notes/datalad-plan.md and stores the results in a small TOML registry so other tools (chunking, inference, notebooks) know where those datasets live.

Overview ¶

Runs happily inside a virtual environment without elevating privileges.
Supports both git and datalad clone flows—BADC auto-detects which tool is available unless --method forces a choice.
Records every connected dataset in a per-user config file (~/.config/badc/data.toml by default, override via BADC_DATA_CONFIG).
Knows about the bogus dataset out of the box so people can smoke-test the CLI without touching production remotes. You can override the URL via --url for custom mirrors or private deployments.
When NAME is bogus and the repository contains the data/datalad/bogus git submodule, badc data connect automatically runs git submodule update --init --recursive data/datalad/bogus instead of cloning a new copy, keeping the registry aligned with the in-tree dataset.

Dataset registry ¶

Each successful connect run (without --dry-run) writes an entry that looks like this:

[datasets.bogus]
path = "/home/user/projects/badc/data/datalad/bogus"
url = "https://github.com/UBC-FRESH/badc-bogus-data.git"
method = "datalad"
status = "connected"
last_connected = "2025-01-12T21:07:14.889192+00:00"

You can point BADC at an alternate config file by exporting BADC_DATA_CONFIG=/path/to/custom.toml before invoking the CLI. That is handy when staging datasets on different filesystems (e.g., NVMe for inference, Ceph for archiving) or when running automated tests that should not touch the real registry.

`badc data connect`¶

Clone (or refresh) a dataset under data/datalad/<name> and track it in the registry. The command returns immediately after the clone completes or after a pull, so you can chain it inside scripts or datalad run records.

Usage:

badc data connect NAME [--path PATH] [--url URL] [--method git|datalad]
                     [--pull / --no-pull] [--dry-run]

Key options:

NAME: Dataset identifier. bogus resolves to the public sample repository maintained for this project. Unknown names require --url.
--path: Base directory that will hold dataset folders. Defaults to data/datalad relative to the current working tree.
--url: Override the clone URL. Required when NAME is not in the built-in table defined in badc.data.DEFAULT_DATASETS.
--method: Force git or datalad. When omitted, BADC prefers datalad if the binary is on PATH and falls back to git otherwise.
--pull / --no-pull: Controls what happens when the target directory already exists. --pull (default) merges upstream changes; --no-pull simply confirms the presence of the dataset.
--dry-run: Print what would happen without touching the filesystem.

Option reference¶

Option / Argument	Description	Default
`NAME`	Dataset registry key (`bogus` built-in).	Required
`--path PATH`	Base directory where datasets are created/updated.	`data/datalad`
`--url URL`	Override clone URL for custom/private datasets.	Registry value or required for unknown names
`--method git\|datalad`	Force clone implementation. Auto-detected when omitted.	Auto
`--pull` / `--no-pull`	Update an existing dataset after verifying it exists.	`--pull`
`--dry-run`	Print planned actions without touching disk or the registry.	Disabled

Help excerpt¶

$ badc data connect --help
Usage: badc data connect [OPTIONS] NAME
  Clone or update a DataLad dataset and record its metadata.
Arguments:
  NAME  Dataset name, e.g., 'bogus'.  [required]
Options:
  --path DIRECTORY        Target path for the dataset.  [default: data/datalad]
  --url TEXT              Override dataset URL (required for unknown names).
  --method TEXT           Preferred clone method: git or datalad.
  --pull / --no-pull      Update the dataset when it already exists locally.
  --dry-run / --apply     Preview actions without running commands.
  --help                  Show this message and exit.

Examples:

# Clone the public bogus dataset using DataLad (auto-detected)
badc data connect bogus

# Clone into a scratch filesystem and skip pulling if it already exists
badc data connect bogus --path /mnt/scratch/badc --no-pull

# Register a private repository
badc data connect sockeye-prod --url git@github.com:UBC-FRESH/badc-prod-data.git

See Bootstrap a checkout for a complete checkout walkthrough.

`badc data disconnect`¶

Remove a dataset from the active registry and optionally delete its contents. The command never deletes anything unless you pass --drop-content—with that flag enabled BADC first runs datalad drop --recursive --reckless auto (when available) so annexed content is removed cleanly before the directory tree is deleted.

Usage:

badc data disconnect NAME [--drop-content / --keep-content]
                       [--path PATH] [--dry-run]

--drop-content: Recursively delete the dataset directory after marking it as disconnected.
--path: Fallback search root when the dataset is not in the registry, useful for first-time disconnects or misconfigured machines.
--dry-run: Preview the deletion/recording steps without touching the filesystem.

Option reference¶

Option / Argument	Description	Default
`NAME`	Dataset identifier to mark as disconnected.	Required
`--drop-content` / `--keep-content`	Remove dataset files after recording the disconnection.	`--keep-content`
`--path PATH`	Base directory to search when the registry entry is missing.	`data/datalad`
`--dry-run`	Emit the pending actions without deleting or editing files.	Disabled

Help excerpt¶

$ badc data disconnect --help
Usage: badc data disconnect [OPTIONS] NAME
  Mark a dataset as disconnected and optionally drop its contents.
Arguments:
  NAME  Dataset name to detach.  [required]
Options:
  --drop-content / --keep-content  Drop annexed files after disconnecting.
  --path DIRECTORY                Base directory that holds dataset folders.
  --dry-run / --apply             Preview actions without modifying files.
  --help                          Show this message and exit.

`badc data status`¶

List every dataset currently recorded in the registry along with its local path and whether the directory still exists. This is the fastest way to confirm that the bogus dataset submodule was connected correctly or to audit scratch mounts before a cleanup.

Usage:

badc data status [--path PATH]

Example output (after git submodule update –init –recursive followed by badc data connect bogus):

$ badc data status
NAME    PATH                                METHOD    STATUS       LAST_CONNECTED
bogus   /home/user/projects/badc/data/datalad/bogus  datalad  connected    2025-12-10T02:18:31+00:00

If PATH is missing on disk the command prints missing in the STATUS column and suggests rerunning badc data connect NAME --pull.

Cleanup workflow ¶

When you are done with a dataset copy (or need to reclaim space on a dev server), follow this pattern:

datalad status inside the dataset to confirm there are no uncommitted changes.
datalad drop --recursive --reckless auto to remove annexed content but keep metadata.
badc data disconnect NAME --drop-content if you want to delete the working tree entirely and remove it from the registry. Use --keep-content when you only want to mark it inactive.
Rerun badc data status to confirm the entry flipped to disconnected.

Because the bogus dataset lives at data/datalad/bogus as a git submodule, badc data connect prefers updating the existing checkout instead of recloning. If you do delete that directory, badc data connect bogus automatically re-runs git submodule update --init --recursive data/datalad/bogus before refreshing the registry entry. See notes/datalad-plan.md for end-to-end scenarios (clone, publish, cleanup).

The registry retains the last known path and timestamp so future connect operations can reconcile state when pointed at the same location.

`badc data status`¶

Display the tracked datasets, their status (connected or disconnected), filesystem paths, and—when requested—DataLad sibling information. Example summary output:

$ badc data status
Tracked datasets:
 - bogus: connected (/home/gep/projects/badc/data/datalad/bogus) [present]

Request extended metadata and siblings when debugging dataset plumbing:

$ badc data status --details --show-siblings
bogus — connected (method: datalad)
  Path: /home/gep/projects/badc/data/datalad/bogus
  Exists: yes; type: datalad
  Siblings:
    - origin state=present https://github.com/UBC-FRESH/badc-bogus-data.git
    - arbutus-s3 state=present s3://ubc-fresh-badc-bogus-data

Option reference¶

Option / Argument	Description	Default
`--details` / `--summary`	Toggle extended output (method, filesystem checks, notes).	`--summary`
`--show-siblings` / `--hide-siblings`	Include `datalad siblings` output (requires DataLad and a dataset with `.datalad` metadata).	`--hide-siblings`

Help excerpt¶

$ badc data status --help
Usage: badc data status [OPTIONS]
  Report all datasets tracked in ~/.config/badc/data.toml.
Options:
  --details / --summary            Show extended metadata for each dataset.
  --show-siblings / --hide-siblings
                                  Include `datalad siblings` output (requires DataLad).
  --help                          Show this message and exit.

Use this command while debugging datalad run pipelines or before chaining a chunk/infer workflow to confirm that the referenced repositories exist locally.

Automation tips ¶

Combine badc data connect with git submodule update --init --recursive in bootstrap scripts so cloned worktrees always have both the source tree and the audio datasets they require.
When integrating with datalad run, call badc data connect as the first recorded action so downstream provenance captures the origin of the dataset.
Emit badc data status as part of telemetry bundles to help future readers understand which repository revision supplied the raw WAV files.

Data Repository Commands¶

Overview¶

Dataset registry¶

badc data connect¶

Option reference¶

Help excerpt¶

badc data disconnect¶

Option reference¶

Help excerpt¶

badc data status¶

Cleanup workflow¶

badc data status¶

Option reference¶

Help excerpt¶

Automation tips¶

Overview ¶

Dataset registry ¶

`badc data connect`¶

`badc data disconnect`¶

`badc data status`¶

Cleanup workflow ¶

`badc data status`¶

Automation tips ¶