Data Repository Commands ======================== The ``badc data`` namespace abstracts how users clone, update, and detach DataLad-backed repositories that hold primary bird audio. It mirrors the practices captured in ``notes/datalad-plan.md`` and stores the results in a small TOML registry so other tools (chunking, inference, notebooks) know where those datasets live. .. contents:: On this page :local: :depth: 1 Overview -------- * Runs happily inside a virtual environment without elevating privileges. * Supports both ``git`` and ``datalad`` clone flows—BADC auto-detects which tool is available unless ``--method`` forces a choice. * Records every connected dataset in a per-user config file (``~/.config/badc/data.toml`` by default, override via ``BADC_DATA_CONFIG``). * Knows about the ``bogus`` dataset out of the box so people can smoke-test the CLI without touching production remotes. You can override the URL via ``--url`` for custom mirrors or private deployments. * When ``NAME`` is ``bogus`` and the repository contains the ``data/datalad/bogus`` git submodule, ``badc data connect`` automatically runs ``git submodule update --init --recursive data/datalad/bogus`` instead of cloning a new copy, keeping the registry aligned with the in-tree dataset. Dataset registry ---------------- Each successful ``connect`` run (without ``--dry-run``) writes an entry that looks like this:: [datasets.bogus] path = "/home/user/projects/badc/data/datalad/bogus" url = "https://github.com/UBC-FRESH/badc-bogus-data.git" method = "datalad" status = "connected" last_connected = "2025-01-12T21:07:14.889192+00:00" You can point BADC at an alternate config file by exporting ``BADC_DATA_CONFIG=/path/to/custom.toml`` before invoking the CLI. That is handy when staging datasets on different filesystems (e.g., NVMe for inference, Ceph for archiving) or when running automated tests that should not touch the real registry. ``badc data connect`` ~~~~~~~~~~~~~~~~~~~~~ Clone (or refresh) a dataset under ``data/datalad/`` and track it in the registry. The command returns immediately after the clone completes or after a pull, so you can chain it inside scripts or ``datalad run`` records. Usage:: badc data connect NAME [--path PATH] [--url URL] [--method git|datalad] [--pull / --no-pull] [--dry-run] Key options: ``NAME`` Dataset identifier. ``bogus`` resolves to the public sample repository maintained for this project. Unknown names require ``--url``. ``--path`` Base directory that will hold dataset folders. Defaults to ``data/datalad`` relative to the current working tree. ``--url`` Override the clone URL. Required when ``NAME`` is not in the built-in table defined in ``badc.data.DEFAULT_DATASETS``. ``--method`` Force ``git`` or ``datalad``. When omitted, BADC prefers ``datalad`` if the binary is on ``PATH`` and falls back to ``git`` otherwise. ``--pull / --no-pull`` Controls what happens when the target directory already exists. ``--pull`` (default) merges upstream changes; ``--no-pull`` simply confirms the presence of the dataset. ``--dry-run`` Print what would happen without touching the filesystem. Option reference ^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 * - Option / Argument - Description - Default * - ``NAME`` - Dataset registry key (``bogus`` built-in). - Required * - ``--path PATH`` - Base directory where datasets are created/updated. - ``data/datalad`` * - ``--url URL`` - Override clone URL for custom/private datasets. - Registry value or required for unknown names * - ``--method git|datalad`` - Force clone implementation. Auto-detected when omitted. - Auto * - ``--pull`` / ``--no-pull`` - Update an existing dataset after verifying it exists. - ``--pull`` * - ``--dry-run`` - Print planned actions without touching disk or the registry. - Disabled Help excerpt ^^^^^^^^^^^^ .. code-block:: console $ badc data connect --help Usage: badc data connect [OPTIONS] NAME Clone or update a DataLad dataset and record its metadata. Arguments: NAME Dataset name, e.g., 'bogus'. [required] Options: --path DIRECTORY Target path for the dataset. [default: data/datalad] --url TEXT Override dataset URL (required for unknown names). --method TEXT Preferred clone method: git or datalad. --pull / --no-pull Update the dataset when it already exists locally. --dry-run / --apply Preview actions without running commands. --help Show this message and exit. Examples:: # Clone the public bogus dataset using DataLad (auto-detected) badc data connect bogus # Clone into a scratch filesystem and skip pulling if it already exists badc data connect bogus --path /mnt/scratch/badc --no-pull # Register a private repository badc data connect sockeye-prod --url git@github.com:UBC-FRESH/badc-prod-data.git See :ref:`usage-bootstrap` for a complete checkout walkthrough. ``badc data disconnect`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Remove a dataset from the active registry and optionally delete its contents. The command never deletes anything unless you pass ``--drop-content``—with that flag enabled BADC first runs ``datalad drop --recursive --reckless auto`` (when available) so annexed content is removed cleanly before the directory tree is deleted. Usage:: badc data disconnect NAME [--drop-content / --keep-content] [--path PATH] [--dry-run] ``--drop-content`` Recursively delete the dataset directory after marking it as disconnected. ``--path`` Fallback search root when the dataset is not in the registry, useful for first-time disconnects or misconfigured machines. ``--dry-run`` Preview the deletion/recording steps without touching the filesystem. Option reference ^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 * - Option / Argument - Description - Default * - ``NAME`` - Dataset identifier to mark as disconnected. - Required * - ``--drop-content`` / ``--keep-content`` - Remove dataset files after recording the disconnection. - ``--keep-content`` * - ``--path PATH`` - Base directory to search when the registry entry is missing. - ``data/datalad`` * - ``--dry-run`` - Emit the pending actions without deleting or editing files. - Disabled Help excerpt ^^^^^^^^^^^^ .. code-block:: console $ badc data disconnect --help Usage: badc data disconnect [OPTIONS] NAME Mark a dataset as disconnected and optionally drop its contents. Arguments: NAME Dataset name to detach. [required] Options: --drop-content / --keep-content Drop annexed files after disconnecting. --path DIRECTORY Base directory that holds dataset folders. --dry-run / --apply Preview actions without modifying files. --help Show this message and exit. ``badc data status`` ~~~~~~~~~~~~~~~~~~~~ List every dataset currently recorded in the registry along with its local path and whether the directory still exists. This is the fastest way to confirm that the bogus dataset submodule was connected correctly or to audit scratch mounts before a cleanup. Usage:: badc data status [--path PATH] Example output (after `git submodule update --init --recursive` followed by ``badc data connect bogus``):: $ badc data status NAME PATH METHOD STATUS LAST_CONNECTED bogus /home/user/projects/badc/data/datalad/bogus datalad connected 2025-12-10T02:18:31+00:00 If ``PATH`` is missing on disk the command prints ``missing`` in the ``STATUS`` column and suggests rerunning ``badc data connect NAME --pull``. Cleanup workflow ---------------- When you are done with a dataset copy (or need to reclaim space on a dev server), follow this pattern: 1. ``datalad status`` inside the dataset to confirm there are no uncommitted changes. 2. ``datalad drop --recursive --reckless auto`` to remove annexed content but keep metadata. 3. ``badc data disconnect NAME --drop-content`` if you want to delete the working tree entirely **and** remove it from the registry. Use ``--keep-content`` when you only want to mark it inactive. 4. Rerun ``badc data status`` to confirm the entry flipped to ``disconnected``. Because the bogus dataset lives at ``data/datalad/bogus`` as a git submodule, ``badc data connect`` prefers updating the existing checkout instead of recloning. If you *do* delete that directory, ``badc data connect bogus`` automatically re-runs ``git submodule update --init --recursive data/datalad/bogus`` before refreshing the registry entry. See ``notes/datalad-plan.md`` for end-to-end scenarios (clone, publish, cleanup). The registry retains the last known path and timestamp so future ``connect`` operations can reconcile state when pointed at the same location. ``badc data status`` ~~~~~~~~~~~~~~~~~~~~~~ Display the tracked datasets, their status (connected or disconnected), filesystem paths, and—when requested—DataLad sibling information. Example summary output:: $ badc data status Tracked datasets: - bogus: connected (/home/gep/projects/badc/data/datalad/bogus) [present] Request extended metadata and siblings when debugging dataset plumbing:: $ badc data status --details --show-siblings bogus — connected (method: datalad) Path: /home/gep/projects/badc/data/datalad/bogus Exists: yes; type: datalad Siblings: - origin state=present https://github.com/UBC-FRESH/badc-bogus-data.git - arbutus-s3 state=present s3://ubc-fresh-badc-bogus-data Option reference ^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 * - Option / Argument - Description - Default * - ``--details`` / ``--summary`` - Toggle extended output (method, filesystem checks, notes). - ``--summary`` * - ``--show-siblings`` / ``--hide-siblings`` - Include ``datalad siblings`` output (requires DataLad and a dataset with ``.datalad`` metadata). - ``--hide-siblings`` Help excerpt ^^^^^^^^^^^^ .. code-block:: console $ badc data status --help Usage: badc data status [OPTIONS] Report all datasets tracked in ~/.config/badc/data.toml. Options: --details / --summary Show extended metadata for each dataset. --show-siblings / --hide-siblings Include `datalad siblings` output (requires DataLad). --help Show this message and exit. Use this command while debugging ``datalad run`` pipelines or before chaining a chunk/infer workflow to confirm that the referenced repositories exist locally. Automation tips --------------- * Combine ``badc data connect`` with ``git submodule update --init --recursive`` in bootstrap scripts so cloned worktrees always have both the source tree and the audio datasets they require. * When integrating with ``datalad run``, call ``badc data connect`` as the first recorded action so downstream provenance captures the origin of the dataset. * Emit ``badc data status`` as part of telemetry bundles to help future readers understand which repository revision supplied the raw WAV files.