Canadian researchers have developed a method for finding bugs in games based on video recordings of gameplay. They suggested using the CLIP neural network for this. It allows you to search for frames that fit the description in simple language, for example, “A car flying in the air.” The development will be presented at the MSR 2022 conference, a preprint of the article is available at arXiv.org.
CLIP is a neural network introduced by OpenAI in January 2021. It consists of two encoders that turn the text or image given to it into embedding – a compact vector representation of this data. These encoders work with data of different types, but the embeddings they create have the same dimension, and more importantly, are located in the same feature space, says N+1. For example, a vector for a dog in this space is located very close to the vector for the word “dog”, but the vectors for a car or a tree will be located far away from them. Since this neural network was trained on hundreds of millions of pairs of images and text from the Internet, it learned to qualitatively connect the visual and textual representation of many objects and phenomena.
The first task for which its authors suggested using CLIP is image classification without the need to retrain the model on a specific dataset. Instead, CLIP can be given a set of classes of interest. Mohammad Reza Taesiri, along with colleagues from the University of Alberta, proposed using CLIP to find bugs in a large number of videos published by players and streamers on the Internet.
To find potential bugs, you need to give the model a video with a recording of the game and a textual description of what he is looking for, for example, a horse hanging in the air, which in most games can obviously only happen by mistake. Because CLIP works with images, the algorithm splits the video into individual frames. Each frame is encoded into an embedding which is compared to the text description embedding. If the vectors are close, it makes it clear that the frame is likely to contain what the user described in the text query. After analyzing the video, the algorithm generates two indicators: the maximum degree of frame compliance with the request and the number of frames corresponding to the request. The authors note that the first metric is more sensitive and often finds the right moments with bugs.
The developers have compiled a dataset of 26954 clips from the Physics gone wild community! on Reddit, where users post bugs from various games. For each frame, they counted the embeddings of the CLIP encoder responsible for the images, so that later they would not waste calculations and time on this when searching. The authors selected eight popular games for verification, GTA V was in the lead by a wide margin, as there are often videos with unusual car behavior in this particular game, notes NIX Solutions.
The developers tested the model on different queries. First, they searched with its help for specific objects and objects with a description, and then checked on the main task – finding bugs. The developers have compiled 44 phrases describing probable bugs, for example, “A man stuck in a tree.” The test showed that the average completeness (the proportion of bugs found relative to all bugs in the sample) was 66.35 percent, that is, the algorithm found most of the bugs. The highest accuracy was on cutscenes from Grand Theft Auto V, Just Cause 3 and Far Cry 5.