Filtering

Filtering training data for machine translation


Filtering for machine translation is the process of cleaning parallel data for training a machine translation system.

Parallel data can be filtered manually or automatically.

To filter parallel data, translation pairs with obvious noise and risky translations are dropped.

Translations pairs are also dropped if they have other obvious issues:

  • Translations to the wrong language
  • Sentence pairs with mismatched URLs, names, or numbers
  • Sentence pairs with mismatched tags
  • Sentence pairs with mismatched length
  • Translations with broken encoding or invalid characters

Some sentence pairs have issues that are more difficult to filter out.

  • Creative human translations
  • Translations that are structured differently than the original
  • Translations that are equal to the original
  • Translations from a third language
  • Very short input

Preprocessing is often normalisation.

  • Decoding or encoding
  • Invalid characters
  • Leading or trailing whitespace
  • Alternative whitespace characters
  • Alternative punctuation characters

Filtering and preprocessing can hurt translation quality, by removing or changing useful data.

Tools

  • ModelFront
  • Zipporah
  • Bicleaner
  • LASER

Techniques

  • Normalisation
  • Tokenisation
  • Removing duplicate segments
  • Removing non-alphabetical symbols
  • Removing irrelevant languages
  • Spelling out or collapsing acronyms
  • Replacing named entities with placeholders
  • Matching the original and the translated sentences punctuation
  • Fixing typos and spelling mistakes

Want to learn more about Filtering?


Edit this article →

Machine Translate is created and edited by contributors like you!

Learn more about contributing →

Licensed under CC-BY-SA-4.0.

Cite this article →