Chunk Audio Recordings¶
Use this guide when you need to turn long WAV or FLAC recordings into inference-ready chunks across
an entire DataLad dataset (BADC relies on soundfile/libsndfile behind the scenes and always
emits WAV chunks for HawkEars).
Probe a representative file to determine safe chunk durations.
$ badc chunk probe data/datalad/bogus/audio/XXXX-000_20251001_093000.wav \ --initial-duration 60 --max-duration 600 --tolerance 5
The command prints the recommended chunk duration and stores telemetry under
artifacts/telemetry/chunk_probe/for posterity.Plan per-recording chunk runs with
badc chunk orchestrate(Phase 2 CLI scaffold).$ badc chunk orchestrate data/datalad/bogus \ --pattern "*.wav" \ --chunk-duration 60 \ --overlap 0 \ --manifest-dir manifests \ --chunks-dir artifacts/chunks \ --workers 4 \ --print-datalad-run
The command prints a Rich table showing which recordings still need manifests and the directories where chunks/manifests will live. With
--print-datalad-runenabled you also get copy/pastable commands similar to:datalad run -m "Chunk XXXX" \ --input audio/XXXX.wav \ --output artifacts/chunks/XXXX \ --output manifests/XXXX.csv \ -- badc chunk run audio/XXXX.wav \ --chunk-duration 60 \ --overlap 0 \ --output-dir artifacts/chunks/XXXX \ --manifest manifests/XXXX.csv
Run each command from the dataset root to keep provenance in Git/annex.
Write chunks + manifests using
badc chunk run(either manually or by reusing the command emitted above). If you trust the plan, append--applytobadc chunk orchestrateto run every chunk job automatically (optionally capturing provenance with--plan-csv/--plan-json). When the source resides inside a DataLad dataset you can omit--output-dir/--manifestand BADC will place chunks under<dataset>/artifacts/chunks/<recording>and manifests under<dataset>/manifests/<recording>.csvautomatically.--record-datalad(default) wraps each applied job indatalad run; use--no-record-dataladplus--workers Nwhen you want multi-recording parallelism without provenance tracking.Non-WAV inputs (e.g., FLAC) require
soundfile(installed with BADC) and are automatically transcoded to WAV before hashing. Every applied run writesartifacts/chunks/<recording>/.chunk_status.jsonrecording whether the job isin_progress,completed, orfailedalong with timestamps, manifest row counts, and the CLI arguments used. Follow-up orchestrate runs automatically resume anything markedfailed/in_progress; strictly reprocess completed recordings with--include-existing.$ datalad run -m "Chunk XXXX" \ --input audio/XXXX.wav \ --output artifacts/chunks/XXXX \ --output manifests/XXXX.csv \ -- badc chunk run audio/XXXX.wav \ --chunk-duration 60 \ --overlap 0 \ --output-dir artifacts/chunks/XXXX \ --manifest manifests/XXXX.csv
Validate the manifest by running
badc chunk manifestorbadc chunk splitif you only need placeholders. Once chunking is complete, proceed withbadc infer runas described in Run Local HawkEars Inference. Resume-friendly status files ensure you can rerun orchestrator passes at any time without duplicating work.