1

I am learning Named Entity Recognition, and i see that the training script uses a variable called vocab which looks like this

vocab = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\'-/\t \n\r\x0b\x0c:"

My Guess is that it is supposed to learn all these characters present in the text like abcd... etc, what i dont understand is the use of char like /n /t what is the use of these char? and in general this variable?

Thanks in advance.

1 Answer 1

1

This string is the vocabulary. In the context of NLP, vocabulary is a list of all words or characters used in the training set. In your example the vocabulary is a list of characters. Specifically \n is a newline, and \t a tab.

For NER and other nlp tasks, we usually use a vocabulary to produce embeddings for each token (word or char), and these embeddings are fed to the machine learning model (nowadays, neural networks architectures such as LSTM are used to get the best results). Character based embeddings have an advantage over word based embeddings for OOV (Out-of-vocabulary) words, i.e. words that do not appear in the training set, but are encountered during inference.

2
  • What happens if i do not use the \n in the vocab?
    – Ryan
    Commented Aug 13, 2019 at 7:25
  • @Ryan It depends on your model, but basically it means you consider newlines as non-significant for the task (NER, in your case).
    – dimid
    Commented Aug 13, 2019 at 7:30

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.