DeepMind researchers taught a machine learning model to understand the basic principles of how objects interact and to be “surprised” in case of physically impossible behavior. For example, if an object suddenly disappears or does not appear where it was moving. Unlike similar algorithms, the new one learned the basic physical principles on its own by watching 28 thousand hours of video of the interaction of various objects. Article was published in Nature Human Behavior.
Machine learning has made tremendous progress over the past decade, and advanced algorithms for solving specific problems are often better than humans. Large language models such as GPT and visual-text models such as CLIP are Of particular interest, says N+1. That is because they learn not only to perform a specific task (predict the next token in a sentence or select a description of objects), but also receive ideas about many subjects and concepts in the learning process in the world, and this knowledge can then be applied to a wide range of tasks. However, machine learning researchers believe that even this is still not enough to create a universal artificial intelligence. For example, Yann LeCun noted in a recent article that large language models retain a large amount of knowledge after training, but they lack the common sense that people form from the experience of interacting with the outside world.
The idea of knowing the world through observing the behavior of objects in it is not the first time used in scientific papers. For example, in 2019, American researchers proposed to implement in the algorithm the behavior of babies who observe the world, intuitively form an understanding of basic physical principles (for example, if you let go of an object, it will fall) and are surprised when their expectations from the behavior of objects do not coincide with reality. The developers have created an algorithm that selects objects, monitors them and is “surprised” when the expected dynamics of objects do not match the observations.
Researchers at DeepMind, led by Luis Piloto, took a similar approach, but created a model that itself represents how objects should behave. It was called PLATO (Physics Learning through Auto-encoding and Tracking Object). PLATO consists of two main parts: a perception module that finds objects in the video, and a dynamic module that predicts the movements of objects.
The perception module receives a frame containing objects and masks on which these objects are selected. It then encodes those images into an embedding, a compressed vector representation of the same data, just enough to recover key details from it. In order for the algorithm to learn this, it turned the images into embedding, then performed the reverse process, reconstructing the image, and during training changed the encoder and decoder parameters so that the difference between the original image and the reconstructed one was minimal.
The dynamic module is based on a neural network with a long short-term memory (LSTM), which “looks” at the current embedding and all previous ones in order to predict the next one describing the future frame. If the model’s predictions then do not agree with the actual behavior of the objects in the movie, this is interpreted as surprise.
In order to train the algorithm, the researchers collected the Physical Concepts dataset, which they published on GitHub. It consists of two parts with short procedurally generated cutscenes in which simple objects move and interact with each other. 300 thousand videos are used for training, another five thousand are intended for testing. It also has a testing part, which has five thousand videos (with physically correct and incorrect examples) to test the algorithm’s understanding of five basic concepts:
- Durability – objects are made of matter and cannot pass through each other.
- Continuity – for example, if an object moves behind two obstacles, it will be visible before the first obstacle, after the second and between them.
- Permanence – An object cannot simply disappear or appear out of nowhere.
- Immutability – an object retains its properties, such as shape, over time.
- Inertia – the object has speed and direction and prevents them from changing when interacting with others.
NIXSolutions notes that in order to confirm that the two-module scheme was correct, the authors trained both the full-fledged algorithm and a simplified one that did not use the object recognition module. It turned out that a full-fledged PLATO correctly demonstrates “surprise” much more often than a simplified model. The researchers also used a dataset from an article by their colleagues from 2019 and showed that PLATO is able to adapt to changed data.