Data Repository Commands

The badc data namespace abstracts how users clone, update, and detach DataLad-backed repositories that hold primary bird audio. It mirrors the practices captured in notes/datalad-plan.md and stores the results in a small TOML registry so other tools (chunking, inference, notebooks) know where those datasets live.

Overview

  • Runs happily inside a virtual environment without elevating privileges.

  • Supports both git and datalad clone flows—BADC auto-detects which tool is available unless --method forces a choice.

  • Records every connected dataset in a per-user config file (~/.config/badc/data.toml by default, override via BADC_DATA_CONFIG).

  • Knows about the bogus dataset out of the box so people can smoke-test the CLI without touching production remotes. You can override the URL via --url for custom mirrors or private deployments.

  • When NAME is bogus and the repository contains the data/datalad/bogus git submodule, badc data connect automatically runs git submodule update --init --recursive data/datalad/bogus instead of cloning a new copy, keeping the registry aligned with the in-tree dataset.

Dataset registry

Each successful connect run (without --dry-run) writes an entry that looks like this:

[datasets.bogus]
path = "/home/user/projects/badc/data/datalad/bogus"
url = "https://github.com/UBC-FRESH/badc-bogus-data.git"
method = "datalad"
status = "connected"
last_connected = "2025-01-12T21:07:14.889192+00:00"

You can point BADC at an alternate config file by exporting BADC_DATA_CONFIG=/path/to/custom.toml before invoking the CLI. That is handy when staging datasets on different filesystems (e.g., NVMe for inference, Ceph for archiving) or when running automated tests that should not touch the real registry.

badc data connect

Clone (or refresh) a dataset under data/datalad/<name> and track it in the registry. The command returns immediately after the clone completes or after a pull, so you can chain it inside scripts or datalad run records.

Usage:

badc data connect NAME [--path PATH] [--url URL] [--method git|datalad]
                     [--pull / --no-pull] [--dry-run]

Key options:

NAME

Dataset identifier. bogus resolves to the public sample repository maintained for this project. Unknown names require --url.

--path

Base directory that will hold dataset folders. Defaults to data/datalad relative to the current working tree.

--url

Override the clone URL. Required when NAME is not in the built-in table defined in badc.data.DEFAULT_DATASETS.

--method

Force git or datalad. When omitted, BADC prefers datalad if the binary is on PATH and falls back to git otherwise.

--pull / --no-pull

Controls what happens when the target directory already exists. --pull (default) merges upstream changes; --no-pull simply confirms the presence of the dataset.

--dry-run

Print what would happen without touching the filesystem.

Option reference

Option / Argument

Description

Default

NAME

Dataset registry key (bogus built-in).

Required

--path PATH

Base directory where datasets are created/updated.

data/datalad

--url URL

Override clone URL for custom/private datasets.

Registry value or required for unknown names

--method git|datalad

Force clone implementation. Auto-detected when omitted.

Auto

--pull / --no-pull

Update an existing dataset after verifying it exists.

--pull

--dry-run

Print planned actions without touching disk or the registry.

Disabled

Help excerpt

$ badc data connect --help
Usage: badc data connect [OPTIONS] NAME
  Clone or update a DataLad dataset and record its metadata.
Arguments:
  NAME  Dataset name, e.g., 'bogus'.  [required]
Options:
  --path DIRECTORY        Target path for the dataset.  [default: data/datalad]
  --url TEXT              Override dataset URL (required for unknown names).
  --method TEXT           Preferred clone method: git or datalad.
  --pull / --no-pull      Update the dataset when it already exists locally.
  --dry-run / --apply     Preview actions without running commands.
  --help                  Show this message and exit.

Examples:

# Clone the public bogus dataset using DataLad (auto-detected)
badc data connect bogus

# Clone into a scratch filesystem and skip pulling if it already exists
badc data connect bogus --path /mnt/scratch/badc --no-pull

# Register a private repository
badc data connect sockeye-prod --url git@github.com:UBC-FRESH/badc-prod-data.git

See Bootstrap a checkout for a complete checkout walkthrough.

badc data disconnect

Remove a dataset from the active registry and optionally delete its contents. The command never deletes anything unless you pass --drop-content—with that flag enabled BADC first runs datalad drop --recursive --reckless auto (when available) so annexed content is removed cleanly before the directory tree is deleted.

Usage:

badc data disconnect NAME [--drop-content / --keep-content]
                       [--path PATH] [--dry-run]
--drop-content

Recursively delete the dataset directory after marking it as disconnected.

--path

Fallback search root when the dataset is not in the registry, useful for first-time disconnects or misconfigured machines.

--dry-run

Preview the deletion/recording steps without touching the filesystem.

Option reference

Option / Argument

Description

Default

NAME

Dataset identifier to mark as disconnected.

Required

--drop-content / --keep-content

Remove dataset files after recording the disconnection.

--keep-content

--path PATH

Base directory to search when the registry entry is missing.

data/datalad

--dry-run

Emit the pending actions without deleting or editing files.

Disabled

Help excerpt

$ badc data disconnect --help
Usage: badc data disconnect [OPTIONS] NAME
  Mark a dataset as disconnected and optionally drop its contents.
Arguments:
  NAME  Dataset name to detach.  [required]
Options:
  --drop-content / --keep-content  Drop annexed files after disconnecting.
  --path DIRECTORY                Base directory that holds dataset folders.
  --dry-run / --apply             Preview actions without modifying files.
  --help                          Show this message and exit.

badc data status

List every dataset currently recorded in the registry along with its local path and whether the directory still exists. This is the fastest way to confirm that the bogus dataset submodule was connected correctly or to audit scratch mounts before a cleanup.

Usage:

badc data status [--path PATH]

Example output (after git submodule update –init –recursive followed by badc data connect bogus):

$ badc data status
NAME    PATH                                METHOD    STATUS       LAST_CONNECTED
bogus   /home/user/projects/badc/data/datalad/bogus  datalad  connected    2025-12-10T02:18:31+00:00

If PATH is missing on disk the command prints missing in the STATUS column and suggests rerunning badc data connect NAME --pull.

Cleanup workflow

When you are done with a dataset copy (or need to reclaim space on a dev server), follow this pattern:

  1. datalad status inside the dataset to confirm there are no uncommitted changes.

  2. datalad drop --recursive --reckless auto to remove annexed content but keep metadata.

  3. badc data disconnect NAME --drop-content if you want to delete the working tree entirely and remove it from the registry. Use --keep-content when you only want to mark it inactive.

  4. Rerun badc data status to confirm the entry flipped to disconnected.

Because the bogus dataset lives at data/datalad/bogus as a git submodule, badc data connect prefers updating the existing checkout instead of recloning. If you do delete that directory, badc data connect bogus automatically re-runs git submodule update --init --recursive data/datalad/bogus before refreshing the registry entry. See notes/datalad-plan.md for end-to-end scenarios (clone, publish, cleanup).

The registry retains the last known path and timestamp so future connect operations can reconcile state when pointed at the same location.

badc data status

Display the tracked datasets, their status (connected or disconnected), filesystem paths, and—when requested—DataLad sibling information. Example summary output:

$ badc data status
Tracked datasets:
 - bogus: connected (/home/gep/projects/badc/data/datalad/bogus) [present]

Request extended metadata and siblings when debugging dataset plumbing:

$ badc data status --details --show-siblings
bogus — connected (method: datalad)
  Path: /home/gep/projects/badc/data/datalad/bogus
  Exists: yes; type: datalad
  Siblings:
    - origin state=present https://github.com/UBC-FRESH/badc-bogus-data.git
    - arbutus-s3 state=present s3://ubc-fresh-badc-bogus-data

Option reference

Option / Argument

Description

Default

--details / --summary

Toggle extended output (method, filesystem checks, notes).

--summary

--show-siblings / --hide-siblings

Include datalad siblings output (requires DataLad and a dataset with .datalad metadata).

--hide-siblings

Help excerpt

$ badc data status --help
Usage: badc data status [OPTIONS]
  Report all datasets tracked in ~/.config/badc/data.toml.
Options:
  --details / --summary            Show extended metadata for each dataset.
  --show-siblings / --hide-siblings
                                  Include `datalad siblings` output (requires DataLad).
  --help                          Show this message and exit.

Use this command while debugging datalad run pipelines or before chaining a chunk/infer workflow to confirm that the referenced repositories exist locally.

Automation tips

  • Combine badc data connect with git submodule update --init --recursive in bootstrap scripts so cloned worktrees always have both the source tree and the audio datasets they require.

  • When integrating with datalad run, call badc data connect as the first recorded action so downstream provenance captures the origin of the dataset.

  • Emit badc data status as part of telemetry bundles to help future readers understand which repository revision supplied the raw WAV files.