aac_datasets.datasets.audiocaps module¶

class AudioCaps( root: str | Path | None = None, subset: str = 'train', download: bool = False, transform: Callable[[AudioCapsItem], Any] | None = None, verbose: int = 0, force_download: bool = False, verify_files: bool = False, *, audio_duration: float = 10.0, audio_format: str = 'flac', audio_n_channels: int = 1, download_audio: bool = True, exclude_removed_audio: bool = True, ffmpeg_path: str | Path | None = None, flat_captions: bool = False, max_workers: int | None = 1, sr: int = 32000, with_tags: bool = False, ytdlp_path: str | Path | None = None, )[source]¶

Bases: AACDataset[AudioCapsItem]

Unofficial AudioCaps PyTorch dataset.

Subsets available are ‘train’, ‘val’ and ‘test’.

Audio is a waveform tensor of shape (1, n_times) of 10 seconds max, sampled at 32kHz by default. Target is a list of strings containing the captions. The ‘train’ subset has only 1 caption per sample and ‘val’ and ‘test’ have 5 captions. Download requires ‘yt-dlp’ and ‘ffmpeg’ commands.

AudioCaps paper : https://www.aclweb.org/anthology/N19-1011.pdf

Dataset folder tree¶

{root}
└── AUDIOCAPS
    ├── train.csv
    ├── val.csv
    ├── test.csv
    └── audio_32000Hz
        ├── train
        │    └── (46231/49838 flac files, ~42G for 32kHz)
        ├── val
        │    └── (465/495 flac files, ~425M for 32kHz)
        └── test
            └── (913/975 flac files, ~832M for 32kHz)

CARD: ClassVar[AudioCapsCard] = <aac_datasets.datasets.functional.audiocaps.AudioCapsCard object>¶

property download: bool¶

property exclude_removed_audio: bool¶

property index_to_name: Dict[int, str]¶

property root: str¶

property sr: int¶

property subset: str¶

property with_tags: bool¶

class AudioCapsItem[source]¶

Bases: TypedDict

Class representing a single AudioCaps item.

audio: Tensor¶

audiocaps_ids: List[int]¶

captions: List[str]¶

dataset: str¶

duration: float¶

fname: str¶

index: int¶

sr: int¶

start_time: int¶

subset: str¶

tags: typing_extensions.NotRequired[List[int]]¶

youtube_id: str¶