Audiovisual data and subtitles datasets

We have released datasets under an experimental licence to support R&D.
Table icon with Yle logo

Yle audiovisual and subtitles datasets v1.1

Thu November 4th, 2021

Yle has released three datasets with an experimental license for a limited amount of time to support the development of language and media related technologies. These datasets were originally created by the MeMAD research and innovation project, a collaboration between media industry members and research groups.

In short:

  • You can request this data for research purposes, and use it in your research project for the project duration. See the terms of use for details.
  • If you have multiple partners in your project, all can use the data as long as you make sure everyone in your project accepts the licence and terms of use.
  • The party requesting the data has to be located in Finland to gain access to the data (but your other project partners do not need to be).
  • Our current licence from the rights holders covers up to 50 projects requesting data until the end of December 2022. If we get more than 50 requests or you need to start a new project after this date, please contact us. There will likely be a small delay as we expand our licence agreement, but this should be possible.

Dataset descriptions

Dataset 1: Yle media evaluation dataset

Audio files, subtitles and ground truth transcripts, speaker diarizations and NER annotations of 16 factual programs in Finnish and Swedish.

Video files, subtitles, metadata and annotations for 8 factual programs that have been used for demonstration and test purposes in the MeMAD project.

This dataset contains 12,7 hours of media in total.

Dataset 2: Yle multimodal media and machine translation dataset

Browse-quality video files, accompanied by parallel multilingual subtitles and program metadata such as production years, genre classifications and topical segmentation timecodes from Yle production systems for 113 news, current affairs and factual programs.

This dataset is split into these subtitle language pairs: FIN-ENG, FIN-SWE, SWE-ENG with some additional content to demonstrate typical professional media products such as news broadcasts.

This dataset contains 59,95 hours of media in total.

Dataset 3: Yle machine translated subtitles evaluation dataset

Semi-automatically cleaned, parallel professional subtitles from 44 programs, containing 10.3k aligned sentence pairs for these language pairs: FIN-SWE, FIN-ENG, SWE-ENG.

This dataset does not contain video or audio, but the total content length covered by the subtitles is 22,46 hours.

Interested?

First: Read the Terms of Use. By sending this form you agree to these.

Second: Fill in and send this form to request the data: https://forms.gle/VSTJeLkrdvoNeQzg8.

We will review your request and if everything is in order, we'll send you instructions on how to access the data. If needed, we will ask for clarifications.

This work has been supported by the European Union's Horizon 2020 research and innovation programme via the project MeMAD (grant agreement 780069).

Elävä arkisto metadata

Yle Elävä arkisto and Yle Arkivet metadata is also open for public use.