NIX Solutions: Neural Network Translated Text into Sign Speech

British developers have created a neural network algorithm that turns text into a video with a person pronouncing the same text in sign language, reports N+1. During training, the algorithm checks the quality of its work after synthesizing the video, which made it possible to achieve a much better result than previous similar methods, including the quality of hand drawing. An article about the algorithm was published on arXiv.org.

People with complete or partial hearing loss communicate with each other in sign language. But the vast majority of events or content is for hearing people. In electronic form, this problem is solved with the help of subtitles, but when holding events or, for example, broadcasting a live broadcast, you have to use sign language translation from ordinary language into sign language. Researchers have been trying to automate this process for several years. They originally tried to do this with animated 3D avatars, but this approach gave poor results. Recently, researchers have switched to neural network synthesis, but until recently, they were unable to accurately draw important details, including the hands.

Developers from the University of Surrey, led by Richard Bowden, have created an algorithm that produces better sign language translation, including separately trained for high-quality brush synthesis – they are extremely important in sign languages. NIX Solutions explains that the algorithm initially accepts speech as text. The text then passes through the encoder and decoder and is transformed into a skeletal body model illustrating the speaker’s gesture. The resulting sequence of poses is then encoded into a vector. It is combined with a vector obtained from a style image – a photograph of a person that needs to be animated. Finally, after that, the sequence of poses is fed to the U-Net convolutional neural network, which transfers movements from poses to realistic video.

During training, the discriminator worked both with the entire generated person as a whole, and separately with their brushes. The result of the neural network’s work was marked up with the OpenPose algorithm, which marks on a person their skeletal body model. Comparing the body model generated by the algorithm with the one that was marked by people, the algorithm improved its skill of synthesizing high-quality frames during training, in which details are clearly distinguishable.

The authors trained the algorithm on the PHOENIX14T dataset, which consists of 386 annotated records of the work of sign language interpreters on a German TV channel. They tested the algorithm using several metrics, including a structural similarity index, which shows the similarity of two images. The index was calculated for the original image from the dataset and the synthetic one created by the neural network, both for the entire upper body and separately for the brushes. Comparison with other algorithms on the same data showed that the new algorithm outperforms its analogues in all four metrics used.

The developers also tested 46 volunteers (28 percent of them are sign language speakers), asking them to rate the realism of the video generated by different algorithms. In this comparison, the volunteers chose a new algorithm in the vast majority of cases.