About datasets subsets

AudioCaps

The original AudioCaps (V1) dataset contains only 3 subsets : train, val and test.

A fourth subset for V1, named train_fixed is another version of the train subset where captions has been manually corrected or deleted. I recommend to use it over train to train models. For more details, see paper “CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding”.

In 2025, a new AudioCaps version has been released and contains more data, with the subsets: train, val and test. train_fixed is not available for that version.

Clotho

Clotho v2.1 contains 7 subsets:

  • dev : contains 3.8K files for training,

  • val : contains 1K files for validation,

  • eval : contains 1K files for testing,

  • dcase_aac_test : contains 1K files without captions used in the evaluation of the DCASE challenge audio captioning task (AAC),

  • dcase_aac_analysis : contains 6K audio files without captions used in the evaluation of the DCASE challenge audio captioning task (AAC),

  • dcase_t2a_audio : contains 1K audio files without captions used in the evaluation of the DCASE challenge language based audio retrieval task (also called Text-to-Audio retrieval, T2A),

  • dcase_t2a_captions : contains 1K captions (queries) without audios files used in the evaluation of the DCASE challenge language based audio retrieval task (also called Text-to-Audio retrieval, T2A).

Older Clotho versions (v1 and v2) contains less data or minor typo errors and are not recommanded.

In the DCASE challenge for Audio Captioning, organizers followed a different convention for the subsets names.

Clotho subsets names

Clotho convention

DCASE convention

dev

development-training

val

development-validation

eval

development-testing

dcase_aac_test

evaluation (-testing)

dcase_aac_analysis

analysis

MACS

MACS contains only 1 subset: full. Its data is typically used as additional training data.

WavCaps

WavCaps contains 7 subsets. Recommanded subsets for training are:

  • bbc : contains 31K files from BBC Sound Effects website,

  • soundbible : contains 1.2K files from SoundBible website,

  • audioset_no_audiocaps_v1 : contains 99K files from audioset subset without overlapping data with AudioCaps,

  • freesound_no_clotho_v2 : contains 257K files from freesound subset without overlapping data with Clotho.

Other subsets exists but contains overlaps with several test datasets, and are therefore not recommanded:

  • audioset : contains 108K files from AudioSet strongly labeled dataset,

  • freesound : contains 262K files from FreeSound website.

Since WavCaps does not contains validation or testing subsets, all of their data is used as additional training data. The subsets audioset_no_audiocaps_v1 and freesound_no_clotho_v2 are provided to avoid biases when evaluating on AudioCaps or Clotho datasets (except for Clotho dcase_aac_test).

The subset freesound_no_clotho was removed in aac-datasets 0.6.0. It was containing 258K files from freesound subset without overlapping data with Clotho (except for subsets of DCASE2023 task 6a).

Datasets overlaps

Audio-Text datasets typically comme from other audio classification datasets or similar websites, which might lead to overlaps that can create data leaks in your training. Here is a list of known overlaps between differents sound events that should be aware of:

Overlaps between sound event datasets

Dataset A

Dataset B

Proportion of A in B (%)

AudioCaps

AudioSet (train)

100

Clotho

FSD50K (train)

5.4

AudioCaps

WavCaps (audioset)

17.6

Clotho

WavCaps (freesound)

89.0

If you do not take this overlaps into account, you might overestimate your results of your AAC model. Overlaps with WavCaps can be avoided by using audioset_no_audiocaps_v1 and freesound_no_clotho_v2 subsets.