Skip to main content Link Search Menu Expand Document (external link)

WMT

Conference on Machine Translation


WMT is the main event for machine translation and machine translation research. The conference is held annually in connection with larger conferences on natural language processing.

The conference aims to bring together academic scientists, researchers and industry representatives to exchange and share their experiences and research results. WMT plays a key role for the entire industry of computational linguistics and machine translation.

In 2006, the first Workshop on Machine Translation was held at the NAACL (North American Chapter of the Association for Computational Linguistics Annual Meeting).

In 2016, with the rise of neural machine translation, WMT became a conference of its own. The Conference on Machine Translation is still mainly known as WMT.

Universities, research laboratories and big technology companies consistently participate in the conference and are represented in the organising committee.

Table of contents
  1. Events
  2. Shared tasks
    1. Recurrent tasks
    2. Translation tasks
      1. Evaluation tasks
      2. Other tasks
      3. Discontinued tasks
  3. Organisers
  4. Evaluation
    1. Average score and average z-score
    2. TrueSkill
    3. Adequacy and fluency judgement
    4. Relative ranking
    5. Constituent ranking
    6. Yes or no constituent judgement
    7. Direct assessment

Events

     
WMT22 Eighth Conference on Machine Translation EMNLP 2022
WMT21 Seventh Conference on Machine Translation EMNLP 2021
WMT20 Sixth Conference on Machine Translation EMNLP 2020
WMT19 Fourth Conference on Machine Translation ACL 2019
WMT18 Third Conference on Machine Translation EMNLP 2018
WMT17 Second Conference on Machine Translation EMNLP 2017
WMT16 First Conference on Machine Translation ACL 2016
WMT15 Workshop on Statistical Machine Translation EMNLP 2015
WMT14 Workshop on Statistical Machine Translation ACL 2014
WMT13 Workshop on Statistical Machine Translation ACL 2013
WMT12 Workshop on Statistical Machine Translation NAACL 2012
WMT11 Workshop on Statistical Machine Translation EMNLP 2011
WMT10 Workshop on Statistical Machine Translation ACL 2010
WMT09 Workshop on Statistical Machine Translation EACL 2009
WMT08 Workshop on Statistical Machine Translation ACL 2008
WMT07 Workshop on Statistical Machine Translation ACL 2007
WMT06 Workshop on Statistical Machine Translation NAACL 2006

Shared tasks

WMT includes competitions on different aspects of machine translation. These competitions are known as shared tasks.

Typically, the task organisers provide datasets and instructions. Teams submit the output of their systems. The submissions are ranked with human evaluation. The results of the competition are ready before the conference takes place.

During the main conference, researchers present the results of the shared tasks and winners are announced.

WMT started in 2006 with a translation task. In the following years, WMT included themes on all aspects of machine translation, corpus preparation, training, and evaluation.

The main task is the General machine translation task. Until 2022, it was known as the News task because traditionally the content to be translated was news articles.

Recurrent tasks

Translation tasks

  • General machine translation task (former News task)
  • Biomedical translation task
  • Multimodal translation task
  • Unsupervised and very low resource translation task
  • Lifelong learning in machine translation task
  • Chat translation task
  • Life-long learning in machine translation task
  • Machine translation using terminologies task
  • Sign language translation task
  • Robustness translation task
  • Triangular machine translation task
  • Large-scale multilingual machine translation task

Evaluation tasks

  • Metrics task
  • Quality estimation task

Other tasks

  • Automatic post-editing task

Discontinued tasks

  • Medical text translation task
  • Pronoun translation task
  • Bilingual document alignment
  • Similar language translation task
  • Multilingual low-resource translation task for Indo-European languages
  • Tuning task
  • Parallel corpus filtering task
  • Task on training of neural machine translation
  • Task on bandit learning for machine translation

The published results from the shared tasks and the data sets released for WMT are standard benchmarks across machine translation research.

Organisers

Organisers are the people responsible for the contents for the main event and the contents, guidelines, datasets and results for each shared task.

Some people have been organisers over many years:

  • Philipp Koehn
  • Barry Haddow
  • Loïc Barrault
  • Ondřej Bojar
  • Lucia Specia
  • Marco Turchi
  • Matt Post
  • Rajen Chatterjee
  • Christof Monz
  • Matteo Negri
  • Matthias Huck
  • Christian Federmann
  • Christof Monz
  • Yvette Graham
  • Mariana Neves
  • Tom Kocmi

Evaluation

Average score and average z-score

For the average score, human assessment scores for translations are standardised according to each human assessor’s overall mean and standard deviation. Then a system-level score is computed.

Average z-score is a normalised version. It shows the distance between the average score for a system and the mean average score across all systems.

Average score and average z-score are the main metrics used in the results for the translation shared task since WMT17.

TrueSkill

TrueSkill is a gaming rating system. Microsoft Research originally developed it for the Xbox Live gaming community. For WMT, TrueSkill was adapted to machine translation evaluation.

For WMT14, WMT15 and WMT16, TrueSkill was used as the human evaluation ranking for all translation shared tasks.

Adequacy and fluency judgement

In adequacy and fluency judgement, for each input, humans rank the output from each system for both adequacy and fluency. Adequacy and fluency scores indicate the meaning adequacy and translation fluency of the system outputs on a five-point scale.

Adequacy and fluency judgement was the official ranking for the translation shared task from WMT06 to WMT07.

Relative ranking

In relative ranking, for each input, humans rank the outputs from all systems. There is no absolute score or label, so there is no measure of absolute quality.

The sequence-level rankings are used to calculate system-level rankings, for example with TrueSkill.

Relative ranking was the official ranking for the translation shared task from WMT07 to WMT16.

Constituent ranking

In constituent ranking, for each input, humans rank the outputs of an automatically selected syntactic constituent instead of the complete sentences. The constituent score measures how often a system was judged to be better than any other system.

Constituent ranking was the official ranking for the translation shared task from WMT07 to WMT08.

Yes or no constituent judgement

In yes or no constituent judgement, for each input, humans rank the acceptability of the outputs of an automatically selected syntactic constituent. The acceptability score measures the per cent of a system translation that was judged to be acceptable.

Yes or no constituent judgement was added as an official ranking for WMT08.

Direct assessment

In direct assessment, for each input, humans rate the output from each system with an absolute score or label. The sequence-level ratings can then be used to calculate system-level ranking.

Direct assessment was first added as an investigatory ranking for WMT16. Direct assessment is the official ranking for the translation shared task since WMT17.

There are different types of direct assessment.

  • Monolingual: Human raters see the system output only.
  • Bilingual: Human raters see the system input and output.
  • Reference-based: Human raters see the system output and a reference output.

Edit this article →

Machine Translate is created and edited by contributors like you!

Learn more about contributing →

Licensed under CC-BY-SA-4.0.

Cite this article →