aac_datasets.datasets.audiocaps module

class AudioCaps(
root: str | Path | None = None,
subset: str = 'train',
download: bool = False,
transform: Callable[[AudioCapsItem], Any] | None = None,
verbose: int = 0,
force_download: bool = False,
verify_files: bool = False,
*,
audio_duration: float = 10.0,
audio_format: str = 'flac',
audio_n_channels: int = 1,
download_audio: bool = True,
exclude_removed_audio: bool = True,
ffmpeg_path: str | Path | None = None,
flat_captions: bool = False,
max_workers: int | None = 1,
sr: int = 32000,
with_tags: bool = False,
ytdlp_path: str | Path | None = None,
)[source]

Bases: AACDataset[AudioCapsItem]

Unofficial AudioCaps PyTorch dataset.

Subsets available are ‘train’, ‘val’ and ‘test’.

Audio is a waveform tensor of shape (1, n_times) of 10 seconds max, sampled at 32kHz by default. Target is a list of strings containing the captions. The ‘train’ subset has only 1 caption per sample and ‘val’ and ‘test’ have 5 captions. Download requires ‘yt-dlp’ and ‘ffmpeg’ commands.

AudioCaps paper : https://www.aclweb.org/anthology/N19-1011.pdf

Dataset folder tree
{root}
└── AUDIOCAPS
    ├── train.csv
    ├── val.csv
    ├── test.csv
    └── audio_32000Hz
        ├── train
        │    └── (46231/49838 flac files, ~42G for 32kHz)
        ├── val
        │    └── (465/495 flac files, ~425M for 32kHz)
        └── test
            └── (913/975 flac files, ~832M for 32kHz)
CARD: ClassVar[AudioCapsCard] = <aac_datasets.datasets.functional.audiocaps.AudioCapsCard object>
property download: bool
property exclude_removed_audio: bool
property index_to_name: Dict[int, str]
property root: str
property sr: int
property subset: str
property with_tags: bool
class AudioCapsItem[source]

Bases: TypedDict

Class representing a single AudioCaps item.

audio: Tensor
audiocaps_ids: List[int]
captions: List[str]
dataset: str
duration: float
fname: str
index: int
sr: int
start_time: int
subset: str
tags: typing_extensions.NotRequired[List[int]]
youtube_id: str