Audiovisual data and subtitles datasets

We have released datasets under an experimental licence to support R&D.
Table icon with Yle logo

Yle audiovisual and subtitles datasets v1.1

Sun January 1st, 2023

Yle released three datasets with an experimental license for a limited amount of time to support the development of language and media related technologies early 2021. Our license to distribute this data has expired, but you can find details on the dataset contents on this page.

If you are interested in accessing these datasets, please contact us through our Archive Sales service, either

  • by e-mail: arkisto.myynti@yle.fi or
  • by filling an inquiry form (currently only in Finnish).

Previously registered projects can keep on using the data until their project ends or until 5 years have passed since we provided access to the data for the project, which ever comes first.

Dataset descriptions

Dataset 1: Yle media evaluation dataset

Audio files, subtitles and ground truth transcripts, speaker diarizations and NER annotations of 16 factual programs in Finnish and Swedish.

Video files, subtitles, metadata and annotations for 8 factual programs that have been used for demonstration and test purposes in the MeMAD project.

This dataset contains 12,7 hours of media in total.

Dataset 2: Yle multimodal media and machine translation dataset

Browse-quality video files, accompanied by parallel multilingual subtitles and program metadata such as production years, genre classifications and topical segmentation timecodes from Yle production systems for 113 news, current affairs and factual programs.

This dataset is split into these subtitle language pairs: FIN-ENG, FIN-SWE, SWE-ENG with some additional content to demonstrate typical professional media products such as news broadcasts.

This dataset contains 59,95 hours of media in total.

Dataset 3: Yle machine translated subtitles evaluation dataset

Semi-automatically cleaned, parallel professional subtitles from 44 programs, containing 10.3k aligned sentence pairs for these language pairs: FIN-SWE, FIN-ENG, SWE-ENG.

This dataset does not contain video or audio, but the total content length covered by the subtitles is 22,46 hours.

This work has been supported by the European Union's Horizon 2020 research and innovation programme via the project MeMAD (grant agreement 780069).

Elävä arkisto metadata

Yle Elävä arkisto and Yle Arkivet metadata is also open for public use.