NIX Solutions: Uber Algorithm Learned to Play Games on Atari Better Than People

Engineers at Uber AI Labs have developed a family of Reinforcement Learning-based Go-Explore algorithms that outperform most existing algorithms when tested on 1980s Atari games, reports N+1. Go-Explore passed 11 of the most challenging reinforcement learning games, including Montezuma’s Revenge and Pitfall, beating most other algorithms and the average human score in terms of score. The main feature of Go-Explore is the ability to remember previous prospective states and return to them, carrying out further exploration from there, which makes it possible to increase the efficiency of the reinforcement learning algorithm. The developers have demonstrated a possible practical application of the Go-Explore for robotics in the task of controlling a robot arm. In addition, the algorithm may be useful in the future when solving many other problems from the processing of natural languages ​​to the creation of new drugs.

NIX Solutions

Reinforcement learning is a form of machine learning in which a program or part of an algorithm called an agent learns to interact with its environment. The agent’s actions lead to a response from the environment, which reacts to the agent’s decisions by changing the value of the value that plays the role of a reward. Some actions increase it and will be considered more beneficial from the agent’s point of view, while others decrease the amount of remuneration, and, therefore, will be regarded by the agent as undesirable. Thus, by setting a utility function that determines the agent’s reward for their actions, one can train them to interact with the environment in such a way as to maximize this function, that is, the agent will prefer to perform combinations of actions that give the highest total reward.

Algorithms based on this approach have shown significant advances in recent times, notes NIX Solutions. So, for example, the self-learning programs of the AlphaGo and AlphaGo Zero families from DeepMind are able to beat the best players in board games of go, shogi and chess, and the algorithm from OpenAI beats professional players in Dota 2.

Reinforcement learning systems can learn to not only play games, but potentially perform any task. However, it is often very difficult to find the correct utility function, on which the efficiency of the algorithm largely depends. For example, to direct the robot from one end of the room to the other towards the door, you can set a utility function that will reward the robot when it comes close to the door, but this utility function will be too “sparse”, that is, the reward will come too rarely. If in order to achieve a goal it is necessary to perform many actions and the robot does not receive constant feedback, then it will not be able to understand which actions led it to success and which ones hinder the achievement of the goal. On the other hand, if you choose, for example, the distance to the door as a reward, then despite the continuous feedback that rewards the algorithm for getting closer to the object, naive following the shortest route can lead the robot to a dead end or to collide with an obstacle.

Developers at Uber’s Artificial Intelligence Lab, led by Jeff Clune, have created a family of reinforcement learning algorithms called Go-Explore that are less prone to the problems of infrequent rewards and incorrect local minima due to the ability to remember and return to previous prospective states. According to the developers, this allows the algorithm to explore the state space more efficiently and solves two key problems of reinforcement learning. The first is the loss of the algorithm’s ability to return to earlier already passed states, even though they are highly “promising”, that is, it is highly likely to lead to new, previously inaccessible areas in the research process. The algorithm, examining one area, can suddenly switch to the exploration of another and forget the way back. The second problem is related to the fact that the research process itself can prevent the return to earlier states of the algorithm, taking it far from the starting point through a lot of random actions.

As an environment for testing the capabilities of their algorithm, Uber engineers used a popular benchmark consisting of classic Atari games, selecting from them the most complex and previously poorly amenable to machine learning algorithms, for example, which have already become the benchmark Montezuma’s Revenge and Pitfall. Both of these games are particularly difficult for reinforcement learning algorithms due to their very sparse rewards. That is, between the moments when the agent receives a reward that reinforces the correctness of their actions, the algorithm can perform hundreds of actions, as a result of which it will be difficult to understand what actions led to success or failure.

NIX Solutions

To work around this issue, Go-Explore remembers the states it visits in the environment so that it can be revisited later. The process of constructing an “archive” of states occurs iteratively. Since there are too many states to store each of them separately explicitly, the algorithm groups similar states into cells. At the beginning of the cycle, the Go-Explore selects from the stored cells, based on the weights assigned to them and giving preference to the recently found one to which it will return, and then returns to this state (go to state step) and explores adjacent areas from it, performing random actions. After that, the archive is updated, storing information about the number of visits to the cells, the reward received in them and the trajectory. Thus, memorizing states and returning to them, Go-Explore does not miss the most interesting states, does not allow over-exploration of areas near the starting point, and does not get stuck near false minima, as in the above example with a robot that is stuck in a dead end.

Further, if necessary, after the research phase, Go-Explore can perform robustification of the found solutions in order to increase the resistance to possible noise for the trajectories found in the research phase. The fact is that, unlike the real world, the game environment in the Atari emulator is rather deterministic, that is, the same actions lead to the same results, and for simplicity, a return to the previous state is performed by restoring the previously saved state of the emulator. However, in the case of introducing chances into the environment, we must be sure that after returning to one of the previous states, we will be able to use the previously found path and that it will lead us to the given area. To increase robustness, developers use a backtracking algorithm, launching it iteratively and sequentially moving the agent from the end of the trajectory to its beginning. Also, to reduce determinism, random delays were introduced into the operation of the algorithm, imitating the human style of play.

The original game screenshots from the Atari emulator are preprocessed. The image is converted to grayscale and its resolution is reduced. Frame processing parameters are dynamically changed, as the amount of details displayed on the screen can vary even within the same game. The number of frames processed by the algorithm in each game is about 30 billion, which is comparable to other algorithms using the Atari benchmark.

Go-Explore shows good results even when the agent does not know anything about the environment. Nevertheless, the developers have added to the state information knowledge about the environment, obtained from the source frames from the Atari emulator, entering the algorithm, such as, for example, the current coordinates the agent on the frame, the current room, the level and the available items of the character (keys in Montezuma’s Revenge). It is expected that the use of such an improved representation leads to a significant increase in the number of points scored in the game.