Metrics

List of metrics for evaluating machine translation

Metrics measure the quality of the output of a machine translation system.

Metrics are used to compare different machine translation systems.

The best known metric for machine translation is BLEU, a string-based automatic metric.

Table of contents

Automatic evaluation metrics
1. String-based metrics
2. Machine learning-based metrics
Human evaluation metrics
1. Evolution
  1. Slide from Unbabel for AMTA 2022
See also

Automatic evaluation metrics

Automatic quality metrics can be computed relatively quickly.

Automatic quality metrics are divided into string-based metrics and machine learning-based metrics.

String-based metrics

String-based metrics generally measure the word or character distance between the target sentence and the reference translation.

Examples:

String-based are used in research papers and competitions because they are explainable and fair, and they can support any language pair.

But string-based metrics can punish translations that convey the correct meaning, and the scores cannot be compared across language pairs.

The scores generally do not correlate well with human evaluation scores when translation quality is high.

Machine learning-based metrics

Machine learning-based metrics use sentence embeddings to calculate the difference between the generated target sentence and the reference translation, or even between the target sentence and the source sentence.

Examples:/

Machine learning-based metrics require a model that was trained on data with the source and target languages.

The score can correlate well with human evaluation scores.

But the scores are not explainable or fair, so they cannot be used in a research competition.

Human evaluation metrics

Human evaluation is the gold standard.

But human evaluation is slow, expensive and subjective.

Evolution

Slide from Unbabel for AMTA 2022