CoCo4MT 2023
Workshop on Corpus Generation and Corpus Augmentation for Machine Translation
The second workshop on Corpus Generation and Corpus Augmentation for Machine Translation (CoCo4MT) was co-located with the MT Summit 2023 conference from 4 September to 5 September, 2023.
Keynote speakers
- Manuel Mager, Applied Scientist at AWS AI Labs
- Jack Halpern, CEO at The CJK Dictionary Institute
- Marta Costa-jussà, Research Scientist at Meta AI
Panel
- Silvio Amir, Northeastern University
Schedule
09:00 - 09:15 | Opening remarks |
09:15 -10:00 | Panel Responsible Low-Resource MT Silvio Amir, Northeastern University Manuel Mager, AWS AI Lab |
10:00 - 10:30 | ☕️ |
10:30 - 11:00 | Invited talk Morphological Segmentation of Polysynthetic Languages Manuel Mager, AWS AI Lab |
11:00 - 11:15 | Shared task Introduction and Finding Anaya Ganesh, University of Colorado Boulder |
11:15 - 11:35 | Shared task Williams College’s Submission for the Coco4MT 2023 Shared Task Alex Root, Mark Hopkins |
11:35 - 12:00 | Shared task The AST Submission for the CoCo4MT 2023 Shared Task on Corpus Construction for Low-Resource Machine Translation Steinþór Steingrímsson |
12:00 - 14:00 | 🍴 |
14:00 - 15:00 | Invited Talk Introducing Large-Scale Synthetic Corpora Jack Halpern, The CJK Dictionary Institute |
15:00 - 15:30 | Paper 2 - Do Not Discard – Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation Steinþór Steingrímsson, Pintu Lohar, Hrafn Loftsson, Andy Way |
15:30 - 16:00 | ☕️ |
16:00 - 16:30 | Paper 3 Development of Urdu-English Religious Domain Parallel Corpus Noor e Hira, Sadaf Abdul Rauf |
17:00 - 17:45 | Invited Talk Beyond Semantic Evaluation in SeamlessM4T - Massively Multilingual & Multimodal Machine Translation Marta Costa-jussà, Meta AI |
17:45 - 18:00 | Closing remarks |
Call for papers
CoCo4MT sets out to be a workshop centered around research that focuses on corpora creation, cleansing, and augmentation techniques specifically for machine translation. We hope that submissions will provide high-quality corpora that is available publicly for download and can be used to increase machine translation performance thus encouraging new dataset creation for multiple languages that will, in turn, provide a general workshop to consult for corpora needs in the future.
Topics (not limited):
- Difficulties with using existing corpora (for example, political considerations or domain limitations), and their effects on final machine translation systems
- Strategies for collecting new machine translation datasets (for example, via crowdsourcing)
- Data augmentation techniques
- Data cleansing and denoising techniques
- Quality control strategies for machine translation data
- Exploration of datasets for pretraining or auxiliary tasks for training machine translation systems
sites.google.com/view/coco4mt (CoCo4MT)
Important dates
18 May 2023 | Call for papers released |
19 May 2023 | Shared task release of train, development and test data |
25 May 2023 | Shared task release of baselines |
05 June 2023 | Second call for papers |
20 June 2023 | Third and final call for papers |
16 July 2023 | Paper submissions due |
16 July 2023 | Shared task deadline to submit results |
20 July 2023 | Notification of acceptance |
20 July 2023 | Shared task system description papers due |
31 July 2023 | Camera-ready due |
4-5 September 2023 | CoCo4MT workshop |