Developers at Facebook have created a neural network capable of transferring the visual style of a caption to new text using just one example. The article was published on the website of the research unit Facebook.
Any inscription (it doesn’t matter whether it is a handwritten postcard, a sign on a store or a brand name on a pack of food) consists of two parts: the text itself, that is, the semantic component, and the visual style. Depending on drawing skills, people can copy the style of writing text and draw new inscriptions in this style with varying accuracy. Algorithms for transferring style between images have been around for a long time, but they mostly work with drawings, says Nplus1. Researchers are also tackling the problem of text wrapping, but so far with less success: algorithms require lengthy training in a particular style.
Praveen Krishnan and his colleagues at Facebook Research have created a text style transfer algorithm that needs just one example of the target style. It consists of several neural networks. At the first stage, all data is fed to two encoders: for content (text) and style. The content encoder accepts a text string that is converted to an image with the same text in Verily Serif Mono font on a white background. And the image of the target style with the inscription highlighted on it is fed to the style encoder.
Then both compressed representations from the encoders are fed to the generator neural network. It is based on NVIDIA’s StyleGAN2 architecture, but modified by the developers to work better with text. Unlike the original StyleGAN2, the new generator works as a Conditional Neural Network with two conditions in the form of data from encoders. Another modification is that if the data from the content encoder is fed directly to the first layer of the generator, then the data from the text encoder is fed in a different way.
To better convey all the distinctive features of the style, the developers placed another neural network between the style encoder and the generator, which encodes various aspects of the style and transfers them to separate layers of the generator. As a result, the algorithm well conveys the low-level and high-level features of the visual style of the original text on the generated image with new text.
Since there are virtually countless label styles, the developers have taken a self-learning approach with no tagged data. To do this, they applied several loss functions during training that control the transfer of both style and content. Also, the developers have created their own handwriting dataset Imgur5K, says NIXsolutions. They selected 5,000 publicly available images of English handwritten text from Imgur and extracted 135,000 words from them. The authors also used the already existing datasets ICDAR 2013, ICDAR 2015, TextVQA and IAM Handwriting Database, and in addition, they created synthetic images with text superimposed on images.
As a result, the developers created the first algorithm capable of transferring the style of handwritten and machine text from one example. Along with the successes, the authors also noted the limitations of the algorithm. For example, it does not always do well with short text (three letters or less) and italicized text. As a potential application, Facebook suggests using similar technology in the future in augmented reality glasses for realistic translation of text on objects in front of the user.