Crawling

Crawling the web for machine translation training data

Crawling or web scraping is the process of automatically collecting large amounts of training data for language models and machine translation systems.

Bilingual or multilingual websites provide text sets for parallel data extraction. Web crawling also collects monolingual text in the target language.

Process

Visit the first URL, manually specified by the user or selected from a pre-defined list of URLs
Extract URLs from the visited web page and add them to a queue of URLs to be visited
Extract text data from the visited web page and store it in a database for later use
Filter URLs to determine which web pages to visit next based on the pre-defined criteria
Repeat the steps of URL extraction, text extraction, and URL filtering for each URL in the queue
Preprocess and filter the extracted text data to remove noise and normalise quality

For parallel data collection, the crawler also uses language identification to extract multilingual text data from the same website.

Challenges

The web-crawled content is often noisy and contains irrelevant information, inconsistent formatting, and errors. The extracted and preprocessed data should undergo evaluation before processing.

Crawling can also raise privacy and ethical concerns if the web crawlers overstep website privacy policies and terms of use.