aac_datasets.datasets.audiocaps module

class AudioCaps(
root: str | Path | None = None,
subset: 'train' | 'val' | 'test' | 'train_fixed' = 'train',
download: bool = False,
transform: Callable[[AudioCapsItem], Any] | None = None,
verbose: int = 0,
force_download: bool = False,
verify_files: bool = False,
*,
audio_duration: float = 10.0,
audio_format: str = 'flac',
audio_n_channels: int = 1,
download_audio: bool = True,
exclude_removed_audio: bool = True,
ffmpeg_path: str | Path | None = None,
flat_captions: bool = False,
max_workers: int | None = 1,
sr: int = 32000,
with_tags: bool = False,
ytdlp_path: str | Path | None = None,
ytdlp_opts: Iterable[str] = (),
version: 'v1' | 'v2' = 'v1',
num_dl_attempts: int = 2,
)[source]

Bases: AACDataset[AudioCapsItem]

Unofficial AudioCaps PyTorch dataset.

Subsets available are ‘train’, ‘val’ and ‘test’.

Audio is a waveform tensor of shape (1, n_times) of 10 seconds max, sampled at 32kHz by default. Target is a list of strings containing the captions. The ‘train’ subset has only 1 caption per sample and ‘val’ and ‘test’ have 5 captions. Download from YouTube requires ‘yt-dlp’ and ‘ffmpeg’ commands.

/!YouTube website can sometimes block your IP when downloading audio with the error:

Sign in to confirm you’re not a bot. Use –cookies-from-browser or –cookies for the authentication. See https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp for how to manually pass cookies. Also see https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies for tips on effectively exporting YouTube cookies.

You can pass yt-dlp args with ytdlp_opts argument, e.g. AudioCaps(ytdlp_opts=[”–cookies-from-browser”, “firefox”]).

See also: AudioCaps paper : https://www.aclweb.org/anthology/N19-1011.pdf

Dataset folder tree (for version v1)
{root}
└── AUDIOCAPS
    ├── csv_files_v1
    │   ├── train.csv
    │   ├── val.csv
    │   └── test.csv
    └── audio_32000Hz
        ├── train
        │   └── (46231/49838 flac files, ~42G for 32kHz)
        ├── val
        │   └── (465/495 flac files, ~425M for 32kHz)
        └── test
            └── (913/975 flac files, ~832M for 32kHz)
CARD : ClassVar[AudioCapsCard] = <aac_datasets.datasets.functional.audiocaps.AudioCapsCard object>
property download : bool
property exclude_removed_audio : bool
property index_to_name : dict[int, str]
property root : str
property sr : int
property subset : 'train' | 'val' | 'test' | 'train_fixed'
property version : 'v1' | 'v2'
property with_tags : bool
class AudioCapsItem[source]

Bases: TypedDict

Class representing a single AudioCaps item.

audio : Tensor
audiocaps_ids : List[int]
captions : List[str]
dataset : str
duration : float
fname : str
index : int
sr : int
start_time : int
subset : Literal['train', 'val', 'test', 'train_fixed']
tags : NotRequired[List[int]]
youtube_id : str