Metrics for evaluating machine translation
Metrics measure the quality of the output of a machine translation system.
Metrics are used to compare different machine translation systems.
The best known metric for machine translation is BLEU, a string-based automatic metric.
Table of contents
Automatic evaluation metrics
Automatic quality metrics can be computed relatively quickly.
Automatic quality metrics are divided into string-based metrics and machine learning-based metrics.
String-based metrics generally measure the word or character distance between the target sentence and the reference translation.
String-based are used in research papers and competitions because they are explainable and fair, and they can support any language pair.
But string-based metrics can punish translations that convey the correct meaning, and the scores cannot be compared across language pairs.
The scores generally do not correlate well with human evaluation scores when translation quality is high.
Machine learning-based metrics
Machine learning-based metrics use sentence embeddings to calculate the difference between the generated target sentence and the reference translation, or even between the target senternce and the source sentence.
Machine learning-based metrics require a model that was trained on data with the source and target languages.
The score can correlate well with human evaluation scores.
But the scores are not explainable or fair, so they cannot be used in a research competition.
Human evaluation metrics
Human evaluation is the gold standard.
- Average score and average z-score
- Adecuacy and fluency judgement
- Relative ranking
- Constituent ranking
- Yes or no constituent judgement
- Direct assessment
But human evaluation is slow, expensive and subjective.
Slide from Unbabel for AMTA 2022