Facebook AI introduced M2M-100, the first multilingual translator that does not use the extra step of translating the original text into English. To train the algorithm, scientists automatically collected 7.5 billion sentences in 100 languages, for each of which translation is available both from the source language and into the target one, reports N+1.
In many cases, machine translation from one language to another goes through one mandatory stage, that is translation of the source text into English, and then translation of this text into the target language. This step greatly facilitates the task, especially when it comes to statistical translation based on parallel corpuses. There are significantly more texts in English than in any other language, and the likelihood that some will be translated into English and this can be used for translation is also much higher.
Translators based on neural networks partially allowed to get rid of translation into English. However, until now there have been no multilingual translators who did not use the extra step of translating into English.
In order to teach the system to translate from one language to another without using English, Facebook developers put together a corpus of sentences. They used available crawler programs, including CCAligned (a variation of Common Crawl) introduced last year. The developers focused on 100 languages (which is slightly less than Google Translate, which supports 108 languages), which are divided into 14 groups based on belonging to linguistic families, cultural characteristics of speakers and countries in which they live.
Then, all possible translation pairs from each of the 100 languages were sorted based on how often they are used, the most popular pairs were given more space in the resulting phrase dictionary. In total, developers managed to collect 7.5 billion phrases. They used the FastText service developed by Facebook to determine the language. Additionally, the developers used automatically translated sentences: this step is necessary for languages with very few parallel corpora.
The collected data was used to train a model based on XLM-R, a translation algorithm that Facebook introduced last year, and the number of grammatical, morphological and semantic parameters taken into account reaches 12 billion.
NIX Solutions notes that the quality of the translation of M2M-100 exceeds the systems based on the transition through the English language: the system scored 10 points BLEU (standard algorithm for assessing the quality of machine transition: usually it gives a coefficient from 0 to 1, but in the work of the developers, apparently used a different scale) more than other systems tested.
So far, Facebook has no plans to use M2M-100 in its services: the project is being implemented primarily for research purposes. The researchers also released the model and dataset for training.