Human evaluation metrics
Metrics for human evaluation of machine translation
Human evaluation metrics for machine translation are standards for assessing and comparing how machine translation systems perform on evaluation sets.
Challenges
- Human evaluation is subjective by nature.
- Human evaluation is slow and expensive.
- There are several competing standards.
- Results from different languages and evaluation sets cannot be compared.
Metrics
MQM
Multidimensional Quality Metrics (MQM) is a framework that determines specific translation errors, severities, and error weights.
Error dimensions:
- Terminology
- Accuracy
- Linguistic conventions
- Style
- Locale conventions
- Audience appropriateness
- Design and markup
SQM
The Scalar Quality Metric (SQM) evaluation gathers scalar ratings at the segment level with document context.
A table row displays every source segment and its corresponding translated segment from the document. For each segment, humans choose a rating on a seven-point scale.
Average score and average z-score
For the average score, human assessment scores for translations are standardised according to each human assessor’s overall mean and standard deviation. Then a system-level score is computed.
Average z-score is a normalised version. It shows the distance between the average score for a system and the mean average score across all systems.
Average score and average z-score are the main metrics used in the results for the translation shared task since WMT17.
TrueSkill
TrueSkill is a gaming rating system. Microsoft Research originally developed it for the Xbox Live gaming community. For WMT, TrueSkill was adapted to machine translation evaluation.
For WMT14, WMT15 and WMT16, TrueSkill was used as the human evaluation ranking for all translation shared tasks.
Adequacy and fluency judgement
In adequacy and fluency judgement, for each input, humans rank the output from each system for both adequacy and fluency. Adequacy and fluency scores indicate the meaning adequacy and translation fluency of the system outputs on a five-point scale.
Adequacy and fluency judgement was the official ranking for the translation shared task from WMT06 to WMT07.
Relative ranking
In relative ranking, for each input, humans rank the outputs from all systems. There is no absolute score or label, so there is no measure of absolute quality.
The sequence-level rankings are used to calculate system-level rankings, for example with TrueSkill.
Relative ranking was the official ranking for the translation shared task from WMT07 to WMT16.
Constituent ranking
In constituent ranking, for each input, humans rank the outputs of an automatically selected syntactic constituent instead of the complete sentences. The constituent score measures how often a system was judged to be better than any other system.
Constituent ranking was the official ranking for the translation shared task from WMT07 to WMT08.
Yes or no constituent judgement
In yes or no constituent judgement, for each input, humans rank the acceptability of the outputs of an automatically selected syntactic constituent. The acceptability score measures the per cent of a system translation that was judged to be acceptable.
Yes or no constituent judgement was added as an official ranking for WMT08.
Direct assessment
In direct assessment, for each input, humans rate the output from each system with an absolute score or label. The sequence-level ratings can then be used to calculate system-level ranking.
Direct assessment was first added as an investigatory ranking for WMT16. Direct assessment is the official ranking for the translation shared task since WMT17.
There are different types of direct assessment.
- Monolingual: Human raters see the system output only.
- Bilingual: Human raters see the system input and output.
- Reference-based: Human raters see the system output and a reference output.
For WMT22, a combination of direct assessment and SQM was used for the evaluation of out-of-english and non-english translation pairs.