N-gram
A short sequence of word types
An n-gram is a short sequence of word types. n
is the number of words in the sequence, for example, a 2-gram has two word types. N-grams have many applications in machine translation:
- language models, such as an n-gram maximum likelihood estimate
- translation models for statistical machine translation
- evaluation metrics, such as BLEU
- language identification
Number of words | Common name | Sequence notation | N-gram language model notation |
---|---|---|---|
1 | Unigram | ||
2 | Bigram | ||
3 | Trigram |
Example
String in English: "The car has two doors."
Tokens: "The", "car", "has", "two", "doors", "."
Unigrams: "The", "car", "has", "two", "doors", "."
Bigrams: ("The", "car"), ("car", "has"), ("has", "two"), ("two", "doors"), ("doors", ".")