CoCo4MT 2023
Workshop on Corpus Generation and Corpus Augmentation for Machine Translation
Location
- Macau Special Administrative Region, CN
Links
Important Dates
Call for papers released | 18 May |
Shared task release of train, development and test data | 19 May |
Shared task release of baselines | 25 May |
Second call for papers | 05 June |
Third and final call for papers | 20 June |
Paper submissions due | 16 July |
Shared task deadline to submit results | 16 July |
Notification of acceptance | 20 July |
Shared task system description papers due | 20 July |
Camera-ready due | 31 July |
CoCo4MT workshop | 04 September |
Keynote speakers
- Manuel Mager, Applied Scientist at AWS AI Labs
- Jack Halpern, CEO at The CJK Dictionary Institute
- Marta Costa-jussà, Research Scientist at Meta AI
Panel
- Silvio Amir, Northeastern University
Schedule
9:00 | Opening remarks |
9:15 | Panel Responsible Low-Resource MT Silvio Amir, Northeastern University Manuel Mager, AWS AI Lab |
10:00 | ☕️ |
10:30 | Invited talk Morphological Segmentation of Polysynthetic Languages Manuel Mager, AWS AI Lab |
11:00 | Shared task Introduction and Finding Anaya Ganesh, University of Colorado Boulder |
11:15 | Shared task Williams College's Submission for the Coco4MT 2023 Shared Task Alex Root, Mark Hopkins |
11:35 | Shared task The AST Submission for the CoCo4MT 2023 Shared Task on Corpus Construction for Low-Resource Machine Translation Steinþór Steingrímsson |
12:00 | 🍴 |
14:00 | Invited Talk Introducing Large-Scale Synthetic Corpora Jack Halpern, The CJK Dictionary Institute |
15:00 | Paper 2 - Do Not Discard – Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation Steinþór Steingrímsson, Pintu Lohar, Hrafn Loftsson, Andy Way |
15:30 | ☕️ |
16:00 | Paper 3 Development of Urdu-English Religious Domain Parallel Corpus Noor e Hira, Sadaf Abdul Rauf |
17:00 | Invited Talk Beyond Semantic Evaluation in SeamlessM4T - Massively Multilingual & Multimodal Machine Translation Marta Costa-jussà, Meta AI |
17:45 | Closing remarks |
Call for papers
CoCo4MT sets out to be a workshop centered around research that focuses on corpora creation, cleansing, and augmentation techniques specifically for machine translation. We hope that submissions will provide high-quality corpora that is available publicly for download and can be used to increase machine translation performance thus encouraging new dataset creation for multiple languages that will, in turn, provide a general workshop to consult for corpora needs in the future.
Topics (not limited):
- Difficulties with using existing corpora (for example, political considerations or domain limitations), and their effects on final machine translation systems
- Strategies for collecting new machine translation datasets (for example, via crowdsourcing)
- Data augmentation techniques
- Data cleansing and denoising techniques
- Quality control strategies for machine translation data
- Exploration of datasets for pretraining or auxiliary tasks for training machine translation systems