Vocabulary

The set of supported words

The vocabulary of a model is the set of supported words. In a machine translation system, the vocabulary may refer to the set of words supported in either the input or the output. An input-supported word is a word that can be translated. An output-supported word is a word that can be generated by the translation model.

Typically the vocabulary is created from training data by retaining the most frequent N words in the source and target language. Words that are not in the vocabulary are called out-of-vocabulary (OOV).

The vocabulary of a language is the set of all possible words in the language.

The vocabulary of a pretrained word embedding model is the set of all words with defined embeddings.

Challenges

A large vocabulary will allow the system to learn to translate more words, but makes the model much larger and slower.
If pretrained word embeddings are used, the vocabulary is already determined and not trivial to update with new data.
Open-class words present challenges, such as proper names and numbers.
Agglutinative and highly inflected languages have very large vocabularies, and may require massive amounts of training data to observe enough of the vocabulary.

Dependencies

The vocabulary is affected by choice of tokenisation algorithm or the use of subword models such as byte-pair encoding. In the case of byte-pair encoding, this can cause vocabulary size to become a hyperparameter that affects the generalisation of the model. By 2022, vocabulary sizes with subword models typically ranged from 16000 to 64000.
The vocabulary is affected by normalisation of capitalisation and accents.