NIXSolutions: DALL-E 2 Dictionary Details

American researchers have found unusual features in the generative neural network DALL-E 2, which creates images from a text description. The text in its images, which seems to be a random set of characters, is probably not at all and is often associated with specific objects and concepts. For example, for the query “Apoploe vesrreaitais”, the model usually generates images with birds. The researchers suggested that DALL-E 2 forms its own semblance of a dictionary during the learning process. The article, which has not yet passed peer review, is available on the authors’ website. It caused an active discussion among the machine learning research community, which refuted some of the authors’ theses and confirmed others.

NIXSolutions

DALL-E 2 is a new and improved version of the DALL-E generative neural network introduced by OpenAI in early 2021. Then the researchers talked about two similar models at the same time: DALL-E and CLIP. In fact, they perform opposite tasks: DALL-E generates a realistic image from a text description given by a person, and CLIP generates a text description for an image given to it. In both cases, the models were trained on a huge amount of images and descriptions and were able to learn the qualitative relationship between the visual and textual representation of objects and concepts. In DALL-E 2, presented this spring, the developers changed some implementation details and managed to increase the realism of the images generated by it, says N+1. However, it still has notable problems, one of which is label generation. Typically, the model either generates a set of Latin characters in the wrong order, or uses non-existent characters or patterns.

OpenAI traditionally does not provide the code and full version of its models, and this time has launched an on-demand demo in which researchers can test the performance of the model by giving it a text description and receiving a set of images. Giannis Daras and Alexandros Dimakis of the University of Texas at Austin, who accessed the demo, found that the random text on the images didn’t seem to be all that random.

They loaded descriptions of certain scenes into the model, indicating that it needed to generate text. For example, in response to the sentence “Two whales talking about food, with subtitles.”, the model generated two whales and an illegible character set, which is best reflected in Latin characters by the phrase “Wa ch zod ahaakes rea “. The authors found that if this seemingly meaningless phrase was loaded into the model, it would generate images with different seafood.

Similarly, they found some other phrases and words that are consistently associated with specific concepts, for example, for the query “Apoploe vesrreaitais”, the model most often generates birds. However, in this case, they learned about this phrase by asking to generate two farmers talking about vegetables. In addition, they found that sometimes the model correctly generates images with a combination of these phrases. So, they first found that DALL-E 2 often generates beetles and other insects for the query “Contarra ccetnxniams luryca tanniounons”, and in response to the phrase “Apoploe vesrreaitais eating Contarra ccetnxniams luryca tanniounons”, it can generate images of birds that eat beetles.

The researchers assumed they had found a “hidden language” in DALL-E 2 and published the paper. It caused a heated discussion in the community of developers and researchers. Benjamin Hitton showed that some of the examples given by the authors often do not correspond to reality, and, apparently, were caused either by coincidence or by looking for successful generation options. So, in response to the phrase “Contarra ccetnxniams luryca tanniounons”, he most often received different animals, and not specifically beetles (although they were often in the results). Moreover, if you add a style indication to this phrase, the alleged connection between the concept of “bugs” and this phrase disappears altogether: if you ask the model to generate “Contarra ccetnxniams luryca tanniounons” in the style of the drawing, she consistently draws older women, and if you ask to make 3D render, renderings of shells, dinosaurs and other objects are obtained.

At the same time, both he and another researcher confirmed that, according to the phrase “Apoploe vesrreaitais”, DALL-E 2 indeed generates birds stably. A likely explanation for this was found by Twitter user BarneyFlames. He found that the tokenizer from the CLIP model, which is used in DALL-E 2 to turn text into embedding, splits the phrase “Apoploe vesrreaitais” into apo, plo, e, ,ve, sr, re, ait and ais tokens. The first two tokens occur at the beginning of the names of the bird families Apodidae and Ploceidae. He suggested that the model could have gotten most of its bird information from scientific illustrations. Thus, when generating images of birds, DALL-E 2 can form text from the tokens that it most often encountered in the descriptions of bird photographs during training.

NIXSolutions notes that after criticism from other researchers, the authors posted an amended version of the article, which, among other things, used the term “hidden vocabulary” instead of “hidden language”.