Yle audiovisual and subtitles datasets v1.1
Yle has released three datasets with an experimental license for a limited amount of time to support the development of language and media related technologies. These datasets were originally created by the MeMAD research and innovation project, a collaboration between media industry members and research groups.
- The party requesting the data has to be located in Finland to gain access to the data (but your other project partners do not need to be).
- Our current licence from the rights holders covers up to 50 projects requesting data until the end of December 2022. If we get more than 50 requests or you need to start a new project after this date, please contact us. There will likely be a small delay as we expand our licence agreement, but this should be possible.
Dataset 1: Yle media evaluation dataset
Audio files, subtitles and ground truth transcripts, speaker diarizations and NER annotations of 16 factual programs in Finnish and Swedish.
Video files, subtitles, metadata and annotations for 8 factual programs that have been used for demonstration and test purposes in the MeMAD project.
This dataset contains 12,7 hours of media in total.
Dataset 2: Yle multimodal media and machine translation dataset
Browse-quality video files, accompanied by parallel multilingual subtitles and program metadata such as production years, genre classifications and topical segmentation timecodes from Yle production systems for 113 news, current affairs and factual programs.
This dataset is split into these subtitle language pairs: FIN-ENG, FIN-SWE, SWE-ENG with some additional content to demonstrate typical professional media products such as news broadcasts.
This dataset contains 59,95 hours of media in total.
Dataset 3: Yle machine translated subtitles evaluation dataset
Semi-automatically cleaned, parallel professional subtitles from 44 programs, containing 10.3k aligned sentence pairs for these language pairs: FIN-SWE, FIN-ENG, SWE-ENG.
This dataset does not contain video or audio, but the total content length covered by the subtitles is 22,46 hours.
Second: Fill in and send this form to request the data: https://forms.gle/VSTJeLkrdvoNeQzg8.
We will review your request and if everything is in order, we'll send you instructions on how to access the data. If needed, we will ask for clarifications.
This work has been supported by the European Union's Horizon 2020 research and innovation programme via the project MeMAD (grant agreement 780069).
Elävä arkisto metadata
Yle Elävä arkisto and Yle Arkivet metadata is also open for public use.