aac_datasets package¶

Audio Captioning datasets for PyTorch.

class AudioCaps( root: str | Path | None = None, subset: 'train' | 'val' | 'test' | 'train_fixed' = 'train', download: bool = False, transform: Callable[[AudioCapsItem], Any] | None = None, verbose: int = 0, force_download: bool = False, verify_files: bool = False, *, audio_duration: float = 10.0, audio_format: str = 'flac', audio_n_channels: int = 1, download_audio: bool = True, exclude_removed_audio: bool = True, ffmpeg_path: str | Path | None = None, flat_captions: bool = False, max_workers: int | None = 1, sr: int = 32000, with_tags: bool = False, ytdlp_path: str | Path | None = None, ytdlp_opts: Iterable[str] = (), version: 'v1' | 'v2' = 'v1', num_dl_attempts: int = 2, )[source]¶

Bases: AACDataset[AudioCapsItem]

Unofficial AudioCaps PyTorch dataset.

Subsets available are ‘train’, ‘val’ and ‘test’.

Audio is a waveform tensor of shape (1, n_times) of 10 seconds max, sampled at 32kHz by default. Target is a list of strings containing the captions. The ‘train’ subset has only 1 caption per sample and ‘val’ and ‘test’ have 5 captions. Download from YouTube requires ‘yt-dlp’ and ‘ffmpeg’ commands.

/!YouTube website can sometimes block your IP when downloading audio with the error:: Sign in to confirm you’re not a bot. Use –cookies-from-browser or –cookies for the authentication. See https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp for how to manually pass cookies. Also see https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies for tips on effectively exporting YouTube cookies.

You can pass yt-dlp args with ytdlp_opts argument, e.g. AudioCaps(ytdlp_opts=[”–cookies-from-browser”, “firefox”]).

See also: AudioCaps paper : https://www.aclweb.org/anthology/N19-1011.pdf

Dataset folder tree (for version v1)¶

{root}
└── AUDIOCAPS
    ├── csv_files_v1
    │   ├── train.csv
    │   ├── val.csv
    │   └── test.csv
    └── audio_32000Hz
        ├── train
        │   └── (46231/49838 flac files, ~42G for 32kHz)
        ├── val
        │   └── (465/495 flac files, ~425M for 32kHz)
        └── test
            └── (913/975 flac files, ~832M for 32kHz)

CARD : ClassVar[AudioCapsCard] = <aac_datasets.datasets.functional.audiocaps.AudioCapsCard object>¶

property download : bool¶

property exclude_removed_audio : bool¶

property index_to_name : dict[int, str]¶

property root : str¶

property sr : int¶

property subset : 'train' | 'val' | 'test' | 'train_fixed'¶

property version : 'v1' | 'v2'¶

property with_tags : bool¶

class Clotho( root: str | Path | None = None, subset: 'dev' | 'val' | 'eval' = 'dev', download: bool = False, transform: Callable[[ClothoItem], Any] | None = None, verbose: int = 0, force_download: bool = False, verify_files: bool = False, *, clean_archives: bool = True, flat_captions: bool = False, version: 'v1' | 'v2' | 'v2.1' = ClothoCard.DEFAULT_VERSION, )[source]¶

class Clotho( root: str | Path | None = None, *, subset: 'dcase_aac_test', download: bool = False, transform: Callable[[ClothoItem], Any] | None = None, verbose: int = 0, force_download: bool = False, verify_files: bool = False, clean_archives: bool = True, flat_captions: bool = False, version: 'v1' | 'v2' | 'v2.1' = ClothoCard.DEFAULT_VERSION, )

class Clotho( root: str | Path | None = None, *, subset: 'dcase_aac_analysis', download: bool = False, transform: Callable[[ClothoItem], Any] | None = None, verbose: int = 0, force_download: bool = False, verify_files: bool = False, clean_archives: bool = True, flat_captions: bool = False, version: 'v1' | 'v2' | 'v2.1' = ClothoCard.DEFAULT_VERSION, )

class Clotho( root: str | Path | None = None, *, subset: 'dcase_t2a_audio', download: bool = False, transform: Callable[[ClothoItem], Any] | None = None, verbose: int = 0, force_download: bool = False, verify_files: bool = False, clean_archives: bool = True, flat_captions: bool = False, version: 'v1' | 'v2' | 'v2.1' = ClothoCard.DEFAULT_VERSION, )

class Clotho( root: str | Path | None = None, *, subset: 'dcase_t2a_captions', download: bool = False, transform: Callable[[ClothoItem], Any] | None = None, verbose: int = 0, force_download: bool = False, verify_files: bool = False, clean_archives: bool = True, flat_captions: bool = False, version: 'v1' | 'v2' | 'v2.1' = ClothoCard.DEFAULT_VERSION, )

Bases: Generic[T_ClothoItem], AACDataset[T_ClothoItem]

Unofficial Clotho PyTorch dataset.

Subsets available are ‘train’, ‘val’, ‘eval’, ‘dcase_aac_test’, ‘dcase_aac_analysis’, ‘dcase_t2a_audio’ and ‘dcase_t2a_captions’.

Audio are waveform sounds of 15 to 30 seconds, sampled at 44100 Hz. Target is a list of 5 different sentences strings describing an audio sample. The maximal number of words in captions is 20.

Clotho V1 Paper: https://arxiv.org/pdf/1910.09387.pdf

Dataset folder tree for version ‘v2.1’, with all subsets¶

{root}
└── CLOTHO_v2.1
    ├── archives
    |   └── (5 7z files, ~8.9GB)
    ├── clotho_audio_files
    │   ├── clotho_analysis
    │   │    └── (8360 wav files, ~19GB)
    │   ├── development
    │   │    └── (3839 wav files, ~7.1GB)
    │   ├── evaluation
    │   │    └── (1045 wav files, ~2.0GB)
    │   ├── test
    │   |    └── (1043 wav files, ~2.0GB)
    │   ├── test_retrieval_audio
    │   |    └── (1000 wav files, ~2.0GB)
    │   └── validation
    │        └── (1045 wav files, ~2.0GB)
    └── clotho_csv_files
        ├── clotho_captions_development.csv
        ├── clotho_captions_evaluation.csv
        ├── clotho_captions_validation.csv
        ├── clotho_metadata_development.csv
        ├── clotho_metadata_evaluation.csv
        ├── clotho_metadata_test.csv
        ├── clotho_metadata_validation.csv
        ├── retrieval_audio_metadata.csv
        └── retrieval_captions.csv

CARD : ClassVar[ClothoCard] = <aac_datasets.datasets.functional.clotho.ClothoCard object>¶

INVALID_SOUND_ID : ClassVar[str] = 'Not found'¶

INVALID_SOUND_LINK : ClassVar[str] = 'NA'¶

INVALID_START_END_SAMPLES : ClassVar[str] = ''¶

property download : bool¶

property root : str¶

property sr : int¶

property subset : 'dev' | 'val' | 'eval' | 'dcase_aac_test' | 'dcase_aac_analysis' | 'dcase_t2a_audio' | 'dcase_t2a_captions'¶

property version : 'v1' | 'v2' | 'v2.1'¶

class MACS( root: str | Path | None = None, subset: 'full' = 'full', download: bool = False, transform: Callable[[MACSItem], Any] | None = None, verbose: int = 0, force_download: bool = False, verify_files: bool = False, *, clean_archives: bool = True, flat_captions: bool = False, )[source]¶

Bases: AACDataset[MACSItem]

Unofficial MACS PyTorch dataset.

Dataset folder tree¶

{root}
└── MACS
    ├── audio
    │    └── (3930 wav files, ~13GB)
    ├── LICENCE.txt
    ├── MACS.yaml
    ├── MACS_competence.csv
    └── tau_meta
        ├── fold1_evaluate.csv
        ├── fold1_test.csv
        ├── fold1_train.csv
        └── meta.csv

CARD : ClassVar[MACSCard] = <aac_datasets.datasets.functional.macs.MACSCard object>¶

property download : bool¶

get_annotator_id_to_competence_dict() → dict[int, float][source]¶: Get annotator to competence dictionary.

get_competence( annotator_id: int, ) → float[source]¶: Get competence value for a specific annotator id.

property root : str¶

property sr : int¶

property subset : 'full'¶

Bases: AACDataset[WavCapsItem]

Unofficial WavCaps PyTorch dataset.

WavCaps Paper : https://arxiv.org/pdf/2303.17395.pdf HuggingFace source : https://huggingface.co/datasets/cvssp/WavCaps

This dataset contains 4 training subsets, extracted from different sources: - BBC Sound Effects “bbc” - SoundBible “soundbible” - AudioSet strongly labeled without AudioCaps V1 val and test subsets “audioset_no_audiocaps_v1” - FreeSound without Clotho dev, val, eval and test subsets “freesound_no_clotho_v2”

Other subsets exists but they does not comply DCASE Challenge rules: - AudioSet strongly labeled “audioset” - FreeSound “freesound”

Warning

WavCaps download is experimental ; it requires a lot of disk space and can take very long time to download and extract, so you might expect errors.

Dataset folder tree¶

{root}
└── WavCaps
    ├── Audio
    │   ├── AudioSet_SL
    │   │    └── (108317 flac files, ~64GB)
    │   ├── BBC_Sound_Effects
    │   │    └── (31201 flac files, ~142GB)
    │   ├── FreeSound
    │   │    └── (262300 flac files, ~1.4TB)
    │   └── SoundBible
    │        └── (1232 flac files, ~884MB)
    ├── Zip_files
    │   ├── AudioSet_SL
    │   │    └── (8 zip files, ~76GB)
    │   ├── BBC_Sound_Effects
    │   │    └── (26 zip files, ~562GB)
    │   ├── FreeSound
    │   │    └── (123 zip? files, ~1.4TB)
    │   └── SoundBible
    │        └── (1 zip? files, ~624GB)
    ├── json_files
    │    ├── AudioSet_SL
    │    │    └── as_final.json
    │    ├── BBC_Sound_Effects
    │    │    └── bbc_final.json
    │    ├── FreeSound
    │    │    ├── fsd_final_2s.json
    │    │    └── fsd_final.json
    │    ├── SoundBible
    │    │    └── sb_final.json
    │    └── blacklist
    │         ├── blacklist_exclude_all_ac.json
    │         ├── blacklist_exclude_test_ac.json
    │         └── blacklist_exclude_ubs8k_esc50_vggsound.json
    ├── .gitattributes
    └── README.md

CARD : ClassVar[WavCapsCard] = <aac_datasets.datasets.functional.wavcaps.WavCapsCard object>¶

property download : bool¶

property root : str¶

property sr : int¶

property subset : 'audioset' | 'bbc' | 'freesound' | 'soundbible' | 'audioset_no_audiocaps_v1' | 'freesound_no_clotho_v2'¶

get_default_ffmpeg_path() → str[source]¶

Returns the default ffmpeg executable path.

If set_default_ffmpeg_path() has been used before with a string argument, it will return the value given to this function. Else if the environment variable AAC_DATASETS_FFMPEG_PATH has been set to a string, it will return its value. Else it will be equal to “ffmpeg” by default.

get_default_root() → str[source]¶

Returns the default root directory path.

If set_default_root() has been used before with a string argument, it will return the value given to this function. Else if the environment variable AAC_DATASETS_ROOT has been set to a string, it will return its value. Else it will be equal to “.” by default.

get_default_ytdlp_path() → str[source]¶

Returns the default yt-dlp executable path.

If set_default_ytdlp_path() has been used before with a string argument, it will return the value given to this function. Else if the environment variable AAC_DATASETS_YTDLP_PATH has been set to a string, it will return its value. Else it will be equal to “yt-dlp” by default.

list_datasets_names() → tuple['AudioCaps' | 'Clotho' | 'MACS' | 'WavCaps', ...][source]¶

load_dataset(

name: 'AudioCaps' | 'Clotho' | 'MACS' | 'WavCaps',

*args,

**kwargs,

) → AACDataset[source]¶

set_default_ffmpeg_path( tmp_path: str | Path | None, ) → None[source]¶: Override default ffmpeg executable path.

set_default_root( cache_path: str | Path | None, ) → None[source]¶: Override default root directory path.

set_default_ytdlp_path( java_path: str | Path | None, ) → None[source]¶: Override default yt-dl executable path.

aac_datasets package¶

Subpackages¶

Submodules¶