Word vectors based on Yle's article corpus

We release media content based on Yle's article corpus, to support development of Finnish language related Artificial Intelligence applications. Happy coding!
Table icon with Yle logo

Wait, but what on earth is a "Word Vector" 🤯? Learn more of these great tools for natural language processing e.g. here!

Latest Release v2

Thu May 9th, 2019

The word vectors can be used for commercial purposes, but Yle should be mentioned as one of the sources, yet in a way that is not related for selling, advertising or promoting your products.

Available formats

  • Word2vec (by Mikolov et al)
  • fastText (open-source algorithm created by Facebook).

Take me to Downloads

Preprocessing

We have done some data preprocessing before training the Word2vec and fastText models. That is, we cleaned the source data by removing control characters, special unicode sequences and news-specific article metadata, such as references and names of journalists.

Word2vec & fastText models

Word2vec models are trained using Python's gensim package with default parameters, while fastText models are trained using Facebook's fasttext program.

Word2vec

fastText

Vector representations

Available are gzipped CSV files and model-specific binary files. In the CSV files, each row contains a word and its 100 dimensional vector representation. Please, see the example below.

kolumbian -0.136994 -0.460687 -3.338988 ... -0.275282
farc-sissit 0.513632 1.526210 -0.746329 ... 1.053898
vapauttivat 2.613727 3.710677 4.826252 ... -3.551630
neljä -3.279752 -0.583108 -3.386821 ... 2.775387
panttivankia 1.585605 1.487833 -1.297238 ... -3.973775

Evaluation results

The following tables list evaluation results for both corpuses (Yle, Yle+Wikipedia), for both fastText and Word2vec models. The applied hyperparameters are listed for each row.

fastText cbow

Corpus lr dim ws epoch neg min­count minn maxn Euclidean Similarity Cosine Similarity Word Intrusion
yle 0.05 100 5 5 5 5 3 6 0.2928 0.3309 0.30894
yle-wikipedia 0.05 100 5 5 5 5 3 6 0.2843 0.3281 0.37367

fastText skipgram

Corpus lr dim ws epoch neg min­count minn maxn Euclidean Similarity Cosine Similarity Word Intrusion
yle 0.05 100 5 5 5 5 3 6 0.28201 0.30515 0.73045
yle-wikipedia 0.05 100 5 5 5 5 3 6 0.28032 0.30730 0.78213

Word2vec cbow

Corpus lr dim ws epoch neg min­count minn maxn Euclidean Similarity Cosine Similarity Word Intrusion
yle 0.05 100 5 5 5 5 - - 0.2284 0.2687 0.39788
yle-wikipedia 0.05 100 5 5 5 5 - - 0.2495 0.2943 0.45905

Word2vec skipgram

Corpus lr dim ws epoch neg min­count minn maxn Euclidean Similarity Cosine Similarity Word Intrusion
yle 0.05 100 5 5 5 5 - - 0.30698 0.30878 0.66951
yle-wikipedia 0.05 100 5 5 5 5 - - 0.29046 0.28948 0.76215

Download Word Vector Datasets

👉 You are almost there! To download the datasets, two quick steps:

1. Read the license

The license is available in Finnish only, sorry! If necessary, please consult someone with sufficient Finnish language skills.

To summarize the license: The word vectors can be used for commercial purposes, but Yle should be mentioned as one of the sources yet in a way that is not related for selling, advertising or promoting your products.

2. Tell us about your plans

Note: We are very serious about privacy, and will not distribute your email to third parties. This is just for letting us learn a little about our user community.

Elävä arkisto metadata

Yle Elävä arkisto and Yle Arkivet metadata is also open for public use.