aac_datasets.datasets.audiocaps module¶

class AudioCaps( root: str | Path | None = None, subset: 'train' | 'val' | 'test' | 'train_fixed' = 'train', download: bool = False, transform: Callable[[AudioCapsItem], Any] | None = None, verbose: int = 0, force_download: bool = False, verify_files: bool = False, *, audio_duration: float = 10.0, audio_format: str = 'flac', audio_n_channels: int = 1, download_audio: bool = True, exclude_removed_audio: bool = True, ffmpeg_path: str | Path | None = None, flat_captions: bool = False, max_workers: int | None = 1, sr: int = 32000, with_tags: bool = False, ytdlp_path: str | Path | None = None, ytdlp_opts: Iterable[str] = (), version: 'v1' | 'v2' = 'v1', num_dl_attempts: int = 2, )[source]¶

Bases: AACDataset[AudioCapsItem]

Unofficial AudioCaps PyTorch dataset.

Subsets available are ‘train’, ‘val’ and ‘test’.

Audio is a waveform tensor of shape (1, n_times) of 10 seconds max, sampled at 32kHz by default. Target is a list of strings containing the captions. The ‘train’ subset has only 1 caption per sample and ‘val’ and ‘test’ have 5 captions. Download from YouTube requires ‘yt-dlp’ and ‘ffmpeg’ commands.

/!YouTube website can sometimes block your IP when downloading audio with the error:: Sign in to confirm you’re not a bot. Use –cookies-from-browser or –cookies for the authentication. See https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp for how to manually pass cookies. Also see https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies for tips on effectively exporting YouTube cookies.

You can pass yt-dlp args with ytdlp_opts argument, e.g. AudioCaps(ytdlp_opts=[”–cookies-from-browser”, “firefox”]).

See also: AudioCaps paper : https://www.aclweb.org/anthology/N19-1011.pdf

Dataset folder tree (for version v1)¶

{root}
└── AUDIOCAPS
    ├── csv_files_v1
    │   ├── train.csv
    │   ├── val.csv
    │   └── test.csv
    └── audio_32000Hz
        ├── train
        │   └── (46231/49838 flac files, ~42G for 32kHz)
        ├── val
        │   └── (465/495 flac files, ~425M for 32kHz)
        └── test
            └── (913/975 flac files, ~832M for 32kHz)

CARD : ClassVar[AudioCapsCard] = <aac_datasets.datasets.functional.audiocaps.AudioCapsCard object>¶

property download : bool¶

property exclude_removed_audio : bool¶

property index_to_name : dict[int, str]¶

property root : str¶

property sr : int¶

property subset : 'train' | 'val' | 'test' | 'train_fixed'¶

property version : 'v1' | 'v2'¶

property with_tags : bool¶

class AudioCapsItem[source]¶

Bases: TypedDict

Class representing a single AudioCaps item.

audio : Tensor¶

audiocaps_ids : List[int]¶

captions : List[str]¶

dataset : str¶

duration : float¶

fname : str¶

index : int¶

sr : int¶

start_time : int¶

subset : Literal['train', 'val', 'test', 'train_fixed']¶

tags : NotRequired[List[int]]¶

youtube_id : str¶