aac_datasets.datasets.audiocaps module¶
- class AudioCaps(
- root: str | Path | None = None,
- subset: str = 'train',
- download: bool = False,
- transform: Callable[[AudioCapsItem], Any] | None = None,
- verbose: int = 0,
- force_download: bool = False,
- verify_files: bool = False,
- *,
- audio_duration: float = 10.0,
- audio_format: str = 'flac',
- audio_n_channels: int = 1,
- download_audio: bool = True,
- exclude_removed_audio: bool = True,
- ffmpeg_path: str | Path | None = None,
- flat_captions: bool = False,
- max_workers: int | None = 1,
- sr: int = 32000,
- with_tags: bool = False,
- ytdlp_path: str | Path | None = None,
Bases:
AACDataset
[AudioCapsItem
]Unofficial AudioCaps PyTorch dataset.
Subsets available are ‘train’, ‘val’ and ‘test’.
Audio is a waveform tensor of shape (1, n_times) of 10 seconds max, sampled at 32kHz by default. Target is a list of strings containing the captions. The ‘train’ subset has only 1 caption per sample and ‘val’ and ‘test’ have 5 captions. Download requires ‘yt-dlp’ and ‘ffmpeg’ commands.
AudioCaps paper : https://www.aclweb.org/anthology/N19-1011.pdf
{root} └── AUDIOCAPS ├── train.csv ├── val.csv ├── test.csv └── audio_32000Hz ├── train │ └── (46231/49838 flac files, ~42G for 32kHz) ├── val │ └── (465/495 flac files, ~425M for 32kHz) └── test └── (913/975 flac files, ~832M for 32kHz)
- CARD: ClassVar[AudioCapsCard] = <aac_datasets.datasets.functional.audiocaps.AudioCapsCard object>¶