aac_datasets package¶
Audio Captioning datasets for PyTorch.
- class AudioCaps(
- root: str | Path | None =
None, - subset: 'train' | 'val' | 'test' | 'train_fixed' =
'train', - download: bool =
False, - transform: Callable[[AudioCapsItem], Any] | None =
None, - verbose: int =
0, - force_download: bool =
False, - verify_files: bool =
False, - *,
- audio_duration: float =
10.0, - audio_format: str =
'flac', - audio_n_channels: int =
1, - download_audio: bool =
True, - exclude_removed_audio: bool =
True, - ffmpeg_path: str | Path | None =
None, - flat_captions: bool =
False, - max_workers: int | None =
1, - sr: int =
32000, - with_tags: bool =
False, - ytdlp_path: str | Path | None =
None, - ytdlp_opts: Iterable[str] =
(), - version: 'v1' | 'v2' =
'v1', - num_dl_attempts: int =
2, Bases:
AACDataset[AudioCapsItem]Unofficial AudioCaps PyTorch dataset.
Subsets available are ‘train’, ‘val’ and ‘test’.
Audio is a waveform tensor of shape (1, n_times) of 10 seconds max, sampled at 32kHz by default. Target is a list of strings containing the captions. The ‘train’ subset has only 1 caption per sample and ‘val’ and ‘test’ have 5 captions. Download from YouTube requires ‘yt-dlp’ and ‘ffmpeg’ commands.
- /!YouTube website can sometimes block your IP when downloading audio with the error:
Sign in to confirm you’re not a bot. Use –cookies-from-browser or –cookies for the authentication. See https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp for how to manually pass cookies. Also see https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies for tips on effectively exporting YouTube cookies.
You can pass yt-dlp args with ytdlp_opts argument, e.g. AudioCaps(ytdlp_opts=[”–cookies-from-browser”, “firefox”]).
See also: AudioCaps paper : https://www.aclweb.org/anthology/N19-1011.pdf
Dataset folder tree (for version v1)¶{root} └── AUDIOCAPS ├── csv_files_v1 │ ├── train.csv │ ├── val.csv │ └── test.csv └── audio_32000Hz ├── train │ └── (46231/49838 flac files, ~42G for 32kHz) ├── val │ └── (465/495 flac files, ~425M for 32kHz) └── test └── (913/975 flac files, ~832M for 32kHz)-
CARD : ClassVar[AudioCapsCard] =
<aac_datasets.datasets.functional.audiocaps.AudioCapsCard object>¶
- property subset : 'train' | 'val' | 'test' | 'train_fixed'¶
- property version : 'v1' | 'v2'¶
- class Clotho(
- root: str | Path | None =
None, - subset: 'dev' | 'val' | 'eval' =
'dev', - download: bool =
False, - transform: Callable[[ClothoItem], Any] | None =
None, - verbose: int =
0, - force_download: bool =
False, - verify_files: bool =
False, - *,
- clean_archives: bool =
True, - flat_captions: bool =
False, - version: 'v1' | 'v2' | 'v2.1' =
ClothoCard.DEFAULT_VERSION, - class Clotho(
- root: str | Path | None =
None, - *,
- subset: 'dcase_aac_test',
- download: bool =
False, - transform: Callable[[ClothoItem], Any] | None =
None, - verbose: int =
0, - force_download: bool =
False, - verify_files: bool =
False, - clean_archives: bool =
True, - flat_captions: bool =
False, - version: 'v1' | 'v2' | 'v2.1' =
ClothoCard.DEFAULT_VERSION, - class Clotho(
- root: str | Path | None =
None, - *,
- subset: 'dcase_aac_analysis',
- download: bool =
False, - transform: Callable[[ClothoItem], Any] | None =
None, - verbose: int =
0, - force_download: bool =
False, - verify_files: bool =
False, - clean_archives: bool =
True, - flat_captions: bool =
False, - version: 'v1' | 'v2' | 'v2.1' =
ClothoCard.DEFAULT_VERSION, - class Clotho(
- root: str | Path | None =
None, - *,
- subset: 'dcase_t2a_audio',
- download: bool =
False, - transform: Callable[[ClothoItem], Any] | None =
None, - verbose: int =
0, - force_download: bool =
False, - verify_files: bool =
False, - clean_archives: bool =
True, - flat_captions: bool =
False, - version: 'v1' | 'v2' | 'v2.1' =
ClothoCard.DEFAULT_VERSION, - class Clotho(
- root: str | Path | None =
None, - *,
- subset: 'dcase_t2a_captions',
- download: bool =
False, - transform: Callable[[ClothoItem], Any] | None =
None, - verbose: int =
0, - force_download: bool =
False, - verify_files: bool =
False, - clean_archives: bool =
True, - flat_captions: bool =
False, - version: 'v1' | 'v2' | 'v2.1' =
ClothoCard.DEFAULT_VERSION, Bases:
Generic[T_ClothoItem],AACDataset[T_ClothoItem]Unofficial Clotho PyTorch dataset.
Subsets available are ‘train’, ‘val’, ‘eval’, ‘dcase_aac_test’, ‘dcase_aac_analysis’, ‘dcase_t2a_audio’ and ‘dcase_t2a_captions’.
Audio are waveform sounds of 15 to 30 seconds, sampled at 44100 Hz. Target is a list of 5 different sentences strings describing an audio sample. The maximal number of words in captions is 20.
Clotho V1 Paper: https://arxiv.org/pdf/1910.09387.pdf
Dataset folder tree for version ‘v2.1’, with all subsets¶{root} └── CLOTHO_v2.1 ├── archives | └── (5 7z files, ~8.9GB) ├── clotho_audio_files │ ├── clotho_analysis │ │ └── (8360 wav files, ~19GB) │ ├── development │ │ └── (3839 wav files, ~7.1GB) │ ├── evaluation │ │ └── (1045 wav files, ~2.0GB) │ ├── test │ | └── (1043 wav files, ~2.0GB) │ ├── test_retrieval_audio │ | └── (1000 wav files, ~2.0GB) │ └── validation │ └── (1045 wav files, ~2.0GB) └── clotho_csv_files ├── clotho_captions_development.csv ├── clotho_captions_evaluation.csv ├── clotho_captions_validation.csv ├── clotho_metadata_development.csv ├── clotho_metadata_evaluation.csv ├── clotho_metadata_test.csv ├── clotho_metadata_validation.csv ├── retrieval_audio_metadata.csv └── retrieval_captions.csv-
CARD : ClassVar[ClothoCard] =
<aac_datasets.datasets.functional.clotho.ClothoCard object>¶
- property subset : 'dev' | 'val' | 'eval' | 'dcase_aac_test' | 'dcase_aac_analysis' | 'dcase_t2a_audio' | 'dcase_t2a_captions'¶
- property version : 'v1' | 'v2' | 'v2.1'¶
-
CARD : ClassVar[ClothoCard] =
- class MACS(
- root: str | Path | None =
None, - subset: 'full' =
'full', - download: bool =
False, - transform: Callable[[MACSItem], Any] | None =
None, - verbose: int =
0, - force_download: bool =
False, - verify_files: bool =
False, - *,
- clean_archives: bool =
True, - flat_captions: bool =
False, Bases:
AACDataset[MACSItem]Unofficial MACS PyTorch dataset.
Dataset folder tree¶{root} └── MACS ├── audio │ └── (3930 wav files, ~13GB) ├── LICENCE.txt ├── MACS.yaml ├── MACS_competence.csv └── tau_meta ├── fold1_evaluate.csv ├── fold1_test.csv ├── fold1_train.csv └── meta.csv- get_annotator_id_to_competence_dict() dict[int, float][source]¶
Get annotator to competence dictionary.
- property subset : 'full'¶
- class WavCaps(
- root: str | Path | None =
None, - subset: 'audioset' | 'bbc' | 'freesound' | 'soundbible' | 'audioset_no_audiocaps_v1' | 'freesound_no_clotho_v2' =
'audioset_no_audiocaps_v1', - download: bool =
False, - transform: Callable[[WavCapsItem], Any] | None =
None, - verbose: int =
0, - force_download: bool =
False, - verify_files: bool =
False, - *,
- clean_archives: bool =
False, - hf_cache_dir: str | None =
None, - repo_id: str | None =
None, - revision: str | None =
'85a0c21e26fa7696a5a74ce54fada99a9b43c6de', - zip_path: str | Path | None =
None, Bases:
AACDataset[WavCapsItem]Unofficial WavCaps PyTorch dataset.
WavCaps Paper : https://arxiv.org/pdf/2303.17395.pdf HuggingFace source : https://huggingface.co/datasets/cvssp/WavCaps
This dataset contains 4 training subsets, extracted from different sources: - BBC Sound Effects “bbc” - SoundBible “soundbible” - AudioSet strongly labeled without AudioCaps V1 val and test subsets “audioset_no_audiocaps_v1” - FreeSound without Clotho dev, val, eval and test subsets “freesound_no_clotho_v2”
Other subsets exists but they does not comply DCASE Challenge rules: - AudioSet strongly labeled “audioset” - FreeSound “freesound”
Warning
WavCaps download is experimental ; it requires a lot of disk space and can take very long time to download and extract, so you might expect errors.
Dataset folder tree¶{root} └── WavCaps ├── Audio │ ├── AudioSet_SL │ │ └── (108317 flac files, ~64GB) │ ├── BBC_Sound_Effects │ │ └── (31201 flac files, ~142GB) │ ├── FreeSound │ │ └── (262300 flac files, ~1.4TB) │ └── SoundBible │ └── (1232 flac files, ~884MB) ├── Zip_files │ ├── AudioSet_SL │ │ └── (8 zip files, ~76GB) │ ├── BBC_Sound_Effects │ │ └── (26 zip files, ~562GB) │ ├── FreeSound │ │ └── (123 zip? files, ~1.4TB) │ └── SoundBible │ └── (1 zip? files, ~624GB) ├── json_files │ ├── AudioSet_SL │ │ └── as_final.json │ ├── BBC_Sound_Effects │ │ └── bbc_final.json │ ├── FreeSound │ │ ├── fsd_final_2s.json │ │ └── fsd_final.json │ ├── SoundBible │ │ └── sb_final.json │ └── blacklist │ ├── blacklist_exclude_all_ac.json │ ├── blacklist_exclude_test_ac.json │ └── blacklist_exclude_ubs8k_esc50_vggsound.json ├── .gitattributes └── README.md-
CARD : ClassVar[WavCapsCard] =
<aac_datasets.datasets.functional.wavcaps.WavCapsCard object>¶
- property subset : 'audioset' | 'bbc' | 'freesound' | 'soundbible' | 'audioset_no_audiocaps_v1' | 'freesound_no_clotho_v2'¶
-
CARD : ClassVar[WavCapsCard] =
- get_default_ffmpeg_path() str[source]¶
Returns the default ffmpeg executable path.
If
set_default_ffmpeg_path()has been used before with a string argument, it will return the value given to this function. Else if the environment variable AAC_DATASETS_FFMPEG_PATH has been set to a string, it will return its value. Else it will be equal to “ffmpeg” by default.
- get_default_root() str[source]¶
Returns the default root directory path.
If
set_default_root()has been used before with a string argument, it will return the value given to this function. Else if the environment variable AAC_DATASETS_ROOT has been set to a string, it will return its value. Else it will be equal to “.” by default.
- get_default_ytdlp_path() str[source]¶
Returns the default yt-dlp executable path.
If
set_default_ytdlp_path()has been used before with a string argument, it will return the value given to this function. Else if the environment variable AAC_DATASETS_YTDLP_PATH has been set to a string, it will return its value. Else it will be equal to “yt-dlp” by default.
- load_dataset( ) AACDataset[source]¶
- set_default_root(
- cache_path: str | Path | None,
Override default root directory path.
Subpackages¶
- aac_datasets.datasets package
- aac_datasets.utils package