aac_datasets.datasets.audiocaps module

class AudioCaps(
root: str | Path | None = None,
subset: typing_extensions.Literal[train, val, test, train_fixed] = 'train',
download: bool = False,
transform: Callable[[AudioCapsItem], Any] | None = None,
verbose: int = 0,
force_download: bool = False,
verify_files: bool = False,
*,
audio_duration: float = 10.0,
audio_format: str = 'flac',
audio_n_channels: int = 1,
download_audio: bool = True,
exclude_removed_audio: bool = True,
ffmpeg_path: str | Path | None = None,
flat_captions: bool = False,
max_workers: int | None = 1,
sr: int = 32000,
with_tags: bool = False,
ytdlp_path: str | Path | None = None,
ytdlp_opts: Iterable[str] = (),
version: typing_extensions.Literal[v1, v2] = 'v1',
num_dl_attempts: int = 2,
)[source]

Bases: AACDataset[AudioCapsItem]

Unofficial AudioCaps PyTorch dataset.

Subsets available are ‘train’, ‘val’ and ‘test’.

Audio is a waveform tensor of shape (1, n_times) of 10 seconds max, sampled at 32kHz by default. Target is a list of strings containing the captions. The ‘train’ subset has only 1 caption per sample and ‘val’ and ‘test’ have 5 captions. Download from YouTube requires ‘yt-dlp’ and ‘ffmpeg’ commands.

/!YouTube website can sometimes block your IP when downloading audio with the error:

Sign in to confirm you’re not a bot. Use –cookies-from-browser or –cookies for the authentication. See https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp for how to manually pass cookies. Also see https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies for tips on effectively exporting YouTube cookies.

You can pass yt-dlp args with ytdlp_opts argument, e.g. AudioCaps(ytdlp_opts=[”–cookies-from-browser”, “firefox”]).

See also: AudioCaps paper : https://www.aclweb.org/anthology/N19-1011.pdf

Dataset folder tree (for version v1)
{root}
└── AUDIOCAPS
    ├── csv_files_v1
    │   ├── train.csv
    │   ├── val.csv
    │   └── test.csv
    └── audio_32000Hz
        ├── train
        │   └── (46231/49838 flac files, ~42G for 32kHz)
        ├── val
        │   └── (465/495 flac files, ~425M for 32kHz)
        └── test
            └── (913/975 flac files, ~832M for 32kHz)
CARD: ClassVar[AudioCapsCard] = <aac_datasets.datasets.functional.audiocaps.AudioCapsCard object>
property download: bool
property exclude_removed_audio: bool
property index_to_name: Dict[int, str]
property root: str
property sr: int
property subset: typing_extensions.Literal[train, val, test, train_fixed]
property version: typing_extensions.Literal[v1, v2]
property with_tags: bool
class AudioCapsItem[source]

Bases: TypedDict

Class representing a single AudioCaps item.

audio: Tensor
audiocaps_ids: List[int]
captions: List[str]
dataset: str
duration: float
fname: str
index: int
sr: int
start_time: int
subset: typing_extensions.Literal[train, val, test, train_fixed]
tags: typing_extensions.NotRequired[List[int]]
youtube_id: str