aac_datasets.datasets.functional.audiocaps module

class AudioCapsCard[source]

Bases: DatasetCard

ANNOTATIONS_CREATORS: Tuple[str, ...] = ('crowdsourced',)
CAPTIONS_PER_AUDIO: Dict[str, int] = {'test': 5, 'train': 1, 'train_v2': 1, 'val': 5}
CITATION: str = '\n    @inproceedings{kim_etal_2019_audiocaps,\n        title        = {{A}udio{C}aps: Generating Captions for Audios in The Wild},\n        author       = {Kim, Chris Dongjoo  and Kim, Byeongchang  and Lee, Hyunmin  and Kim, Gunhee},\n        year         = 2019,\n        month        = jun,\n        booktitle    = {Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},\n        publisher    = {Association for Computational Linguistics},\n        address      = {Minneapolis, Minnesota},\n        pages        = {119--132},\n        doi          = {10.18653/v1/N19-1011},\n        url          = {https://aclanthology.org/N19-1011},\n    }\n    '
DEFAULT_SUBSET: str = 'train'
HOMEPAGE: str = 'https://audiocaps.github.io/'
LANGUAGE: Tuple[str, ...] = ('en',)
LANGUAGE_DETAILS: Tuple[str, ...] = ('en-US',)
NAME: str = 'audiocaps'
PRETTY_NAME: str = 'AudioCaps'
SIZE_CATEGORIES: Tuple[str, ...] = ('10K<n<100K',)
SUBSETS: Tuple[str, ...] = ('train', 'val', 'test', 'train_v2')
TASK_CATEGORIES: Tuple[str, ...] = ('audio-to-text', 'text-to-audio')
download_audiocaps_dataset(
root: str | Path | None = None,
subset: str = 'train',
force: bool = False,
verbose: int = 0,
verify_files: bool = False,
audio_duration: float = 10.0,
audio_format: str = 'flac',
audio_n_channels: int = 1,
download_audio: bool = True,
ffmpeg_path: str | Path | None = None,
max_workers: int | None = 1,
sr: int = 32000,
ytdlp_path: str | Path | None = None,
with_tags: bool = False,
) None[source]

Prepare AudioCaps data (audio, labels, metadata).

Parameters:
  • root – Dataset root directory. The data will be stored in the ‘AUDIOCAPS’ subdirectory. defaults to “.”.

  • subset – The subset of AudioCaps to use. Can be one of SUBSETS. defaults to “train”.

  • force – If True, force to re-download file even if they exists on disk. defaults to False.

  • verbose – Verbose level. defaults to 0.

  • verify_files – If True, check hash value when possible. defaults to True.

  • audio_duration – Extracted duration for each audio file in seconds. defaults to 10.0.

  • audio_format – Audio format and extension name. defaults to “flac”.

  • audio_n_channels – Number of channels extracted for each audio file. defaults to 1.

  • download_audio – If True, download audio, metadata and labels files. Otherwise it will only donwload metadata and labels files. defaults to True.

  • ffmpeg_path – Path to ffmpeg executable file. defaults to “ffmpeg”.

  • max_workers – Number of threads to download audio files in parallel. Do not use a value too high to avoid “Too Many Requests” error. The value None will use min(32, os.cpu_count() + 4) workers, which is the default of ThreadPoolExecutor. defaults to 1.

  • sr – The sample rate used for audio files in the dataset (in Hz). Since original YouTube videos are recorded in various settings, this parameter allow to download allow audio files with a specific sample rate. defaults to 32000.

  • with_tags – If True, download the tags from AudioSet dataset. defaults to False.

  • ytdlp_path – Path to yt-dlp or ytdlp executable. defaults to “yt-dlp”.

download_audiocaps_datasets(
root: str | Path | None = None,
subsets: str | Iterable[str] = 'train',
force: bool = False,
verbose: int = 0,
verify_files: bool = False,
audio_duration: float = 10.0,
audio_format: str = 'flac',
audio_n_channels: int = 1,
download_audio: bool = True,
ffmpeg_path: str | Path | None = None,
max_workers: int | None = 1,
sr: int = 32000,
with_tags: bool = False,
ytdlp_path: str | Path | None = None,
) None[source]

Function helper to download a list of subsets. See download_audiocaps_dataset() for details.

load_audiocaps_dataset(
root: str | Path | None = None,
subset: str = 'train',
verbose: int = 0,
audio_format: str = 'flac',
exclude_removed_audio: bool = True,
sr: int = 32000,
with_tags: bool = False,
) Tuple[Dict[str, List[Any]], Dict[int, str]][source]

Load AudioCaps metadata.

Parameters:
  • root – Dataset root directory. The data will be stored in the ‘AUDIOCAPS’ subdirectory. defaults to “.”.

  • subset – The subset of AudioCaps to use. Can be one of SUBSETS. defaults to “train”.

  • verbose – Verbose level. defaults to 0.

  • audio_format – Audio format and extension name. defaults to “flac”.

  • exclude_removed_audio – If True, the dataset will exclude from the dataset the audio not downloaded from youtube (i.e. not present on disk). If False, invalid audios will return an empty tensor of shape (0,). defaults to True.

  • sr – The sample rate used for audio files in the dataset (in Hz). Since original YouTube videos are recorded in various settings, this parameter allow to download allow audio files with a specific sample rate. defaults to 32000.

  • with_tags – If True, load the tags from AudioSet dataset. Note: tags needs to be downloaded with download=True & with_tags=True before being used. defaults to False.

Returns:

A dictionnary of lists containing each metadata. Expected keys: “audiocaps_ids”, “youtube_id”, “start_time”, “captions”, “fname”, “tags”, “is_on_disk”.