aac_datasets.datasets.clotho module¶

class Clotho( root: str | Path | None = None, subset: str = 'dev', download: bool = False, transform: Callable[[ClothoItem], Any] | None = None, verbose: int = 0, force_download: bool = False, verify_files: bool = False, *, clean_archives: bool = True, flat_captions: bool = False, version: str = 'v2.1', )[source]¶

Bases: AACDataset[ClothoItem]

Unofficial Clotho PyTorch dataset.

Subsets available are ‘train’, ‘val’, ‘eval’, ‘dcase_aac_test’, ‘dcase_aac_analysis’, ‘dcase_t2a_audio’ and ‘dcase_t2a_captions’.

Audio are waveform sounds of 15 to 30 seconds, sampled at 44100 Hz. Target is a list of 5 different sentences strings describing an audio sample. The maximal number of words in captions is 20.

Clotho V1 Paper : https://arxiv.org/pdf/1910.09387.pdf

Dataset folder tree for version ‘v2.1’, with all subsets¶

{root}
└── CLOTHO_v2.1
    ├── archives
    |   └── (5 7z files, ~8.9GB)
    ├── clotho_audio_files
    │   ├── clotho_analysis
    │   │    └── (8360 wav files, ~19GB)
    │   ├── development
    │   │    └── (3839 wav files, ~7.1GB)
    │   ├── evaluation
    │   │    └── (1045 wav files, ~2.0GB)
    │   ├── test
    │   |    └── (1043 wav files, ~2.0GB)
    │   ├── test_retrieval_audio
    │   |    └── (1000 wav files, ~2.0GB)
    │   └── validation
    │        └── (1045 wav files, ~2.0GB)
    └── clotho_csv_files
        ├── clotho_captions_development.csv
        ├── clotho_captions_evaluation.csv
        ├── clotho_captions_validation.csv
        ├── clotho_metadata_development.csv
        ├── clotho_metadata_evaluation.csv
        ├── clotho_metadata_test.csv
        ├── clotho_metadata_validation.csv
        ├── retrieval_audio_metadata.csv
        └── retrieval_captions.csv

CARD: ClassVar[ClothoCard] = <aac_datasets.datasets.functional.clotho.ClothoCard object>¶

INVALID_SOUND_ID: ClassVar[str] = 'Not found'¶

INVALID_SOUND_LINK: ClassVar[str] = 'NA'¶

INVALID_START_END_SAMPLES: ClassVar[str] = ''¶

property download: bool¶

property root: str¶

property sr: int¶

property subset: str¶

property version: str¶

class ClothoItem[source]¶

Bases: TypedDict

Class representing a single Clotho item.

audio: typing_extensions.NotRequired[Tensor]¶

captions: typing_extensions.NotRequired[List[str]]¶

dataset: str¶

duration: typing_extensions.NotRequired[float]¶

fname: typing_extensions.NotRequired[str]¶

index: int¶

keywords: typing_extensions.NotRequired[List[str]]¶

license: typing_extensions.NotRequired[str]¶

manufacturer: typing_extensions.NotRequired[str]¶

sound_id: typing_extensions.NotRequired[str]¶

sound_link: typing_extensions.NotRequired[str]¶

sr: typing_extensions.NotRequired[int]¶

start_end_samples: typing_extensions.NotRequired[str]¶

subset: str¶