Vocabulary

The set of supported words


The vocabulary of a model is the set of supported words. In a machine translation system, the vocabulary may refer to the set of words supported in either the input or the output. An input-supported word is a word that can be translated. An output-supported word is a word that can be generated by the translation model.

Typically the vocabulary is created from training data by retaining the most frequent N words in the source and target language. Words that are not in the vocabulary are called out-of-vocabulary (OOV).

The vocabulary of a language is the set of all possible words in the language.

The vocabulary of a pretrained word embedding model is the set of all words with defined embeddings.

Challenges

  • A large vocabulary will allow the system to learn to translate more words, but makes the model much larger and slower.
  • If pretrained word embeddings are used, the vocabulary is already determined and not trivial to update with new data.
  • Open-class words present challenges, such as proper names and numbers.
  • Agglutinative and highly inflected languages have very large vocabularies, and may require massive amounts of training data to observe enough of the vocabulary.

Dependencies

  • The vocabulary is affected by choice of tokenisation algorithm or the use of subword models such as byte-pair encoding. In the case of byte-pair encoding, this can cause vocabulary size to become a hyperparameter that affects the generalisation of the model. By 2022, vocabulary sizes with subword models typically ranged from 16000 to 64000.
  • The vocabulary is affected by normalisation of capitalisation and accents.

Want to learn more about Vocabulary?


Edit this article →

Machine Translate is created and edited by contributors like you!

Learn more about contributing →

Licensed under CC-BY-SA-4.0.

Cite this article →