Vocabulary
The set of supported words
The vocabulary of a model is the set of supported words. In a machine translation system, the vocabulary may refer to the set of words supported in either the input or the output. An input-supported word is a word that can be translated. An output-supported word is a word that can be generated by the translation model.
Typically the vocabulary is created from training data by retaining the most frequent N
words in the source and target language. Words that are not in the vocabulary are called out-of-vocabulary (OOV).
The vocabulary of a language is the set of all possible words in the language.
The vocabulary of a pretrained word embedding model is the set of all words with defined embeddings.
Challenges
- A large vocabulary will allow the system to learn to translate more words, but makes the model much larger and slower.
- If pretrained word embeddings are used, the vocabulary is already determined and not trivial to update with new data.
- Open-class words present challenges, such as proper names and numbers.
- Agglutinative and highly inflected languages have very large vocabularies, and may require massive amounts of training data to observe enough of the vocabulary.
Dependencies
- The vocabulary is affected by choice of tokenisation algorithm or the use of subword models such as byte-pair encoding. In the case of byte-pair encoding, this can cause vocabulary size to become a hyperparameter that affects the generalisation of the model. By 2022, vocabulary sizes with subword models typically ranged from 16000 to 64000.
- The vocabulary is affected by normalisation of capitalisation and accents.