Chunk Audio Recordings

Use this guide when you need to turn long WAV or FLAC recordings into inference-ready chunks across an entire DataLad dataset (BADC relies on soundfile/libsndfile behind the scenes and always emits WAV chunks for HawkEars).

  1. Probe a representative file to determine safe chunk durations.

    $ badc chunk probe data/datalad/bogus/audio/XXXX-000_20251001_093000.wav \
        --initial-duration 60 --max-duration 600 --tolerance 5
    

    The command prints the recommended chunk duration and stores telemetry under artifacts/telemetry/chunk_probe/ for posterity.

  2. Plan per-recording chunk runs with badc chunk orchestrate (Phase 2 CLI scaffold).

    $ badc chunk orchestrate data/datalad/bogus \
        --pattern "*.wav" \
        --chunk-duration 60 \
        --overlap 0 \
        --manifest-dir manifests \
        --chunks-dir artifacts/chunks \
        --workers 4 \
        --print-datalad-run
    

    The command prints a Rich table showing which recordings still need manifests and the directories where chunks/manifests will live. With --print-datalad-run enabled you also get copy/pastable commands similar to:

    datalad run -m "Chunk XXXX" \
      --input audio/XXXX.wav \
      --output artifacts/chunks/XXXX \
      --output manifests/XXXX.csv \
      -- badc chunk run audio/XXXX.wav \
           --chunk-duration 60 \
           --overlap 0 \
           --output-dir artifacts/chunks/XXXX \
           --manifest manifests/XXXX.csv
    

    Run each command from the dataset root to keep provenance in Git/annex.

  3. Write chunks + manifests using badc chunk run (either manually or by reusing the command emitted above). If you trust the plan, append --apply to badc chunk orchestrate to run every chunk job automatically (optionally capturing provenance with --plan-csv/--plan-json). When the source resides inside a DataLad dataset you can omit --output-dir/--manifest and BADC will place chunks under <dataset>/artifacts/chunks/<recording> and manifests under <dataset>/manifests/<recording>.csv automatically. --record-datalad (default) wraps each applied job in datalad run; use --no-record-datalad plus --workers N when you want multi-recording parallelism without provenance tracking.

    Non-WAV inputs (e.g., FLAC) require soundfile (installed with BADC) and are automatically transcoded to WAV before hashing. Every applied run writes artifacts/chunks/<recording>/.chunk_status.json recording whether the job is in_progress, completed, or failed along with timestamps, manifest row counts, and the CLI arguments used. Follow-up orchestrate runs automatically resume anything marked failed/in_progress; strictly reprocess completed recordings with --include-existing.

    $ datalad run -m "Chunk XXXX" \
        --input audio/XXXX.wav \
        --output artifacts/chunks/XXXX \
        --output manifests/XXXX.csv \
        -- badc chunk run audio/XXXX.wav \
             --chunk-duration 60 \
             --overlap 0 \
             --output-dir artifacts/chunks/XXXX \
             --manifest manifests/XXXX.csv
    
  4. Validate the manifest by running badc chunk manifest or badc chunk split if you only need placeholders. Once chunking is complete, proceed with badc infer run as described in Run Local HawkEars Inference. Resume-friendly status files ensure you can rerun orchestrator passes at any time without duplicating work.