CoCo4MT 2023

Workshop on Corpus Generation and Corpus Augmentation for Machine Translation


  • Macau Special Administrative Region, CN


Important Dates

Call for papers released 18 May
Shared task release of train, development and test data 19 May
Shared task release of baselines 25 May
Second call for papers 05 June
Third and final call for papers 20 June
Paper submissions due 16 July
Shared task deadline to submit results 16 July
Notification of acceptance 20 July
Shared task system description papers due 20 July
Camera-ready due 31 July
CoCo4MT workshop 04 September

Keynote speakers

  • Manuel Mager, Applied Scientist at AWS AI Labs
  • Jack Halpern, CEO at The CJK Dictionary Institute
  • Marta Costa-jussà, Research Scientist at Meta AI


  • Silvio Amir, Northeastern University


9:00 Opening remarks
9:15 Panel
Responsible Low-Resource MT
Silvio Amir, Northeastern University
Manuel Mager, AWS AI Lab
10:00 ☕️
10:30 Invited talk
Morphological Segmentation of Polysynthetic Languages
Manuel Mager, AWS AI Lab
11:00 Shared task
Introduction and Finding
Anaya Ganesh, University of Colorado Boulder
11:15 Shared task
Williams College's Submission for the Coco4MT 2023 Shared Task
Alex Root, Mark Hopkins
11:35 Shared task
The AST Submission for the CoCo4MT 2023 Shared Task on Corpus Construction for Low-Resource Machine Translation
Steinþór Steingrímsson
12:00 🍴
14:00 Invited Talk
Introducing Large-Scale Synthetic Corpora
Jack Halpern, The CJK Dictionary Institute
15:00 Paper 2
- Do Not Discard – Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation
Steinþór Steingrímsson, Pintu Lohar, Hrafn Loftsson, Andy Way
15:30 ☕️
16:00 Paper 3
Development of Urdu-English Religious Domain Parallel Corpus
Noor e Hira, Sadaf Abdul Rauf
17:00 Invited Talk
Beyond Semantic Evaluation in SeamlessM4T - Massively Multilingual & Multimodal Machine Translation
Marta Costa-jussà, Meta AI
17:45 Closing remarks

Call for papers

CoCo4MT sets out to be a workshop centered around research that focuses on corpora creation, cleansing, and augmentation techniques specifically for machine translation. We hope that submissions will provide high-quality corpora that is available publicly for download and can be used to increase machine translation performance thus encouraging new dataset creation for multiple languages that will, in turn, provide a general workshop to consult for corpora needs in the future.

Topics (not limited):

  • Difficulties with using existing corpora (for example, political considerations or domain limitations), and their effects on final machine translation systems
  • Strategies for collecting new machine translation datasets (for example, via crowdsourcing)
  • Data augmentation techniques
  • Data cleansing and denoising techniques
  • Quality control strategies for machine translation data
  • Exploration of datasets for pretraining or auxiliary tasks for training machine translation systems

Want to learn more about CoCo4MT 2023?

Edit this article →

Machine Translate is created and edited by contributors like you!

Learn more about contributing →

Licensed under CC-BY-SA-4.0.

Cite this article →