Crawling the web for machine translation training data
Bilingual or multilingual websites provide text sets for parallel data extraction. Web crawling also collects monolingual text in the target language.
- Visit the first URL, manually specified by the user or selected from a pre-defined list of URLs
- Extract URLs from the visited web page and add them to a queue of URLs to be visited
- Extract text data from the visited web page and store it in a database for later use
- Filter URLs to determine which web pages to visit next based on the pre-defined criteria
- Repeat the steps of URL extraction, text extraction, and URL filtering for each URL in the queue
- Preprocess and filter the extracted text data to remove noise and normalise quality
For parallel data collection, the crawler also uses language identification to extract multilingual text data from the same website.
The web-crawled content is often noisy and contains irrelevant information, inconsistent formatting, and errors. The extracted and preprocessed data should undergo evaluation before processing.