Probability of a word in a sequence
A language model takes text input and outputs the next word or character. The input is often called a “context” or “history”, because it represents what has been written so far.
the man is riding a ___
blue: 0.5%, …
In this example, the word
bike has a 40% probability of being the next word and is the most likely.
Decoding is used to convert an abstract representation into text. It is done, for example, by repeatedly taking the most probable output, adding it to the text and so on. This way a language model can generate new text as in the following example:
- Step 6
<BOS> the man is riding a ___
car: 30%, …
- Step 7
<BOS> the man is riding a bike ___
on: 10%, …
- Step 8:
<BOS> the man is riding a bike to ___
a: 25%, …
- Step 11:
<BOS> the man is riding a bike to the store . ___
<EOS>: 98%, …
For the beginning the special token
<BOS> (beginning of sentence) is used. When another special token
<EOS> (end of sentence), the decoder stops.
This kind of decoding where only the most probable token is considered is called greedy decoding and it may not always lead to the most fluent output. For this reason, the algorithm beam search is used that considers multiple most probable outputs at the same time.
There are many different ways in which language models are created.
The easiest one is to count a number of occurrences of a phrase, in this case a pair, and divide that by the number of occurrences of just . The result is the probability that with the history (now truncated to just ) the next word is .
Under the Markov assumption, the input is limited to the last word only. This model is quite restricted because it can’t model well any even mid-term sentence dependencies. There are models that take longer input, e.g. 3-grams, but they have create new issues, such as data sparsity. One solution to those is language model smoothing.
A neural language model is a neural network that computes the probability of the next word. RNN-based approaches worked by considering the whole sentence history compressed into a single vector. They perform badly on long-term dependency phenomena. This was vastly improved with the advent of attention. State-of-the-art neural language models are based on the Transformer architecture, either the encoder (e.g. BERT) or the decoder (e.g. GPT).
Phrase-based machine translation relies on a decoding algorithm that tries to cover the original sentence with phrases. That can be done trivially by using single-word phrases. The key missing ingredient is the cohesion between phrases, called fluency. Therefore, in the decoding phase of phrase-based machine translation, the score of a state is determined partly by the language model probability. Higher probabilities are preferred because they correspond to more natural-sounding sentences.
In phrase-based machine translation, increasing the weight of the language model increases fluency but can decrease adequacy.
The usage of language models in neural machine translation is more subtle. The decoder can be viewed as a language model because the output is a probability across the target vocabulary and it has computational access to the history: .