Latest Release v2
The word vectors can be used for commercial purposes, but Yle should be mentioned as one of the sources, yet in a way that is not related for selling, advertising or promoting your products.
Available formats
- Word2vec (by Mikolov et al)
- fastText (open-source algorithm created by Facebook).
Preprocessing
We have done some data preprocessing before training the Word2vec and fastText models. That is, we cleaned the source data by removing control characters, special unicode sequences and news-specific article metadata, such as references and names of journalists.
Word2vec & fastText models
Word2vec models are trained using Python's gensim package with default parameters, while fastText models are trained using Facebook's fasttext program.
Word2vec
- Training algorithm: continuous bag-of-words (CBOW) with negative sampling
- Default parameters: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec
fastText
- Training algorithm: continuous bag-of-words (CBOW) with negative sampling
- Default parameters: https://github.com/facebookresearch/fastText#full-documentation
Vector representations
Available are gzipped CSV files and model-specific binary files. In the CSV files, each row contains a word and its 100 dimensional vector representation. Please, see the example below.
kolumbian | -0.136994 | -0.460687 | -3.338988 | ... | -0.275282 |
farc-sissit | 0.513632 | 1.526210 | -0.746329 | ... | 1.053898 |
vapauttivat | 2.613727 | 3.710677 | 4.826252 | ... | -3.551630 |
neljä | -3.279752 | -0.583108 | -3.386821 | ... | 2.775387 |
panttivankia | 1.585605 | 1.487833 | -1.297238 | ... | -3.973775 |
Evaluation results
The following tables list evaluation results for both corpuses (Yle, Yle+Wikipedia), for both fastText and Word2vec models. The applied hyperparameters are listed for each row.
fastText cbow
Corpus | lr | dim | ws | epoch | neg | mincount | minn | maxn | Euclidean Similarity | Cosine Similarity | Word Intrusion |
---|---|---|---|---|---|---|---|---|---|---|---|
yle | 0.05 | 100 | 5 | 5 | 5 | 5 | 3 | 6 | 0.2928 | 0.3309 | 0.30894 |
yle-wikipedia | 0.05 | 100 | 5 | 5 | 5 | 5 | 3 | 6 | 0.2843 | 0.3281 | 0.37367 |
fastText skipgram
Corpus | lr | dim | ws | epoch | neg | mincount | minn | maxn | Euclidean Similarity | Cosine Similarity | Word Intrusion |
---|---|---|---|---|---|---|---|---|---|---|---|
yle | 0.05 | 100 | 5 | 5 | 5 | 5 | 3 | 6 | 0.28201 | 0.30515 | 0.73045 |
yle-wikipedia | 0.05 | 100 | 5 | 5 | 5 | 5 | 3 | 6 | 0.28032 | 0.30730 | 0.78213 |
Word2vec cbow
Corpus | lr | dim | ws | epoch | neg | mincount | minn | maxn | Euclidean Similarity | Cosine Similarity | Word Intrusion |
---|---|---|---|---|---|---|---|---|---|---|---|
yle | 0.05 | 100 | 5 | 5 | 5 | 5 | - | - | 0.2284 | 0.2687 | 0.39788 |
yle-wikipedia | 0.05 | 100 | 5 | 5 | 5 | 5 | - | - | 0.2495 | 0.2943 | 0.45905 |
Word2vec skipgram
Corpus | lr | dim | ws | epoch | neg | mincount | minn | maxn | Euclidean Similarity | Cosine Similarity | Word Intrusion |
---|---|---|---|---|---|---|---|---|---|---|---|
yle | 0.05 | 100 | 5 | 5 | 5 | 5 | - | - | 0.30698 | 0.30878 | 0.66951 |
yle-wikipedia | 0.05 | 100 | 5 | 5 | 5 | 5 | - | - | 0.29046 | 0.28948 | 0.76215 |
Download Word Vector Datasets
1. Read the license
The license is available in Finnish only, sorry! If necessary, please consult someone with sufficient Finnish language skills.
To summarize the license: The word vectors can be used for commercial purposes, but Yle should be mentioned as one of the sources yet in a way that is not related for selling, advertising or promoting your products.
2. Tell us about your plans
Finnish
Source | Model | CSV | Binary |
---|---|---|---|
Yle | fastText | CSV | Binary |
Yle | Word2Vec | CSV | Binary |
Yle + Wikipedia | fastText | CSV | Binary |
Yle + Wikipedia | Word2Vec | CSV | Binary |
Swedish
Source | Model | CSV | Binary |
---|---|---|---|
Yle | fastText | CSV | Binary |
Yle | Word2Vec | CSV | Binary |
Yle + Wikipedia | fastText | CSV | Binary |
Yle + Wikipedia | Word2Vec | CSV | Binary |
Elävä arkisto metadata
Yle Elävä arkisto and Yle Arkivet metadata is also open for public use.