Missing words in word2vec vocabulary

Question

I am training word2vec on my own text-corpus using mikolov's implementation from here. Not all unique words from the corpus get a vector even though I have set the min-count to 1. Are there any parameters I may have missed, that might be the reason not all unique words get a vector? What else might be the reason?

To test word2vecs behavior I have written the following script providing a text file with 20058 sentences and 278896 words (all words and punctuation are space separated and there is one sentence per line).

import subprocess


def get_w2v_vocab(path_embs):
    vocab = set()
    with open(path_embs, 'r', encoding='utf8') as f:
        next(f)
        for line in f:
            word = line.split(' ')[0]
            vocab.add(word)
    return vocab - {'</s>'}


def train(path_corpus, path_embs):
    subprocess.call(["./word2vec", "-threads", "6", "-train", path_corpus,
                     "-output", path_embs, "-min-count", "1"])


def get_unique_words_in_corpus(path_corpus):
    vocab = []
    with open(path_corpus, 'r', encoding='utf8') as f:
        for line in f:
            vocab.extend(line.strip('\n').split(' '))
    return set(vocab)

def check_equality(expected, actual):
    if not expected == actual:
        diff = len(expected - actual)
        raise Exception('Not equal! Vocab expected: {}, Vocab actual: {}, Diff: {}'.format(len(expected), len(actual), diff))
    print('Expected vocab and actual vocab are equal.')



def main():
    path_corpus = 'test_corpus2.txt'
    path_embs = 'embeddings.vec'
    vocab_expected = get_unique_words_in_corpus(path_corpus)
    train(path_corpus, path_embs)
    vocab_actual = get_w2v_vocab(path_embs)
    check_equality(vocab_expected, vocab_actual)


if __name__ == '__main__':
    main()

This script gives me the following output:

Starting training using file test_corpus2.txt
Vocab size: 33651
Words in train file: 298954
Alpha: 0.000048  Progress: 99.97%  Words/thread/sec: 388.16k  Traceback (most recent call last):
  File "test_w2v_behaviour.py", line 44, in <module>
    main()
  File "test_w2v_behaviour.py", line 40, in main
    check_equality(vocab_expected, vocab_actual)
  File "test_w2v_behaviour.py", line 29, in check_equality
    raise Exception('Not equal! Vocab expected: {}, Vocab actual: {}, Diff: {}'.format(len(expected), len(actual), diff))
Exception: Not equal! Vocab expected: 42116, Vocab actual: 33650, Diff: 17316

gojomo · Accepted Answer · 2019-02-14 01:26:12Z

As long as you're using Python, you might want to use the Word2Vec implementation in the gensim package. It does everything the original Mikolov/Googleword2vec.c does, and more, and is usually performance-competitive.

In particular, it won't have any issues with UTF-8 encoding – while I'm not sure the Mikolov/Google word2vec.c handles UTF-8 correctly. And, that may be a source of your discrepancy.

If you need to get to the bottom of your discrepancy, I would suggest:

have your get_unique_words_in_corpus() also tally/report the total number of non-unique words its tokenization creates. If that's not the same as the 298954 reported by word2vec.c, then the two processes are clearly not working from the same baseline understanding of what 'words' are in the source file.
find some words, or at least one representative word, that your token-count expects to be in the final model, and isn't. Review those for any common characteristic – including in context in the file. That will probably reveal why the two tallies differ.

Again, I suspect something UTF-8 related, or perhaps related to other implementation-limits in word2vec.c (such as a maximum word-lenght) that are not mirrored in your Python-based word tallies.

Thanks, my text file is utf8-encoded, so it probably is a encoding related error. I will try out the gensim implementation. — sinaj, Commented Feb 15, 2019 at 17:01

Learning is a mess · Accepted Answer · 2019-02-15 14:05:37Z

0

You could use FastText instead of Word2Vec. FastText is able to embed out-of-vocabulary words by looking at subword information (character ngrams). Gensim also has a FastText implementation, which is very easy to use:

from gensim.models import FastText as ft

model = ft(sentences=training_data,)

word = 'blablabla' # can be out of vocabulary
embedded_word = model[word] # fetches the word embedding

See https://stackoverflow.com/a/54709303/3275464

answered Feb 15, 2019 at 14:05

Learning is a mess

8,2128 gold badges40 silver badges77 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Missing words in word2vec vocabulary

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
word2vec
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged word2vec or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
word2vec
or ask your own question.