NIX Solutions: AI Picked Up Sounds for Silent Video

American developers have created an algorithm that independently selects sounds for a video, for example, the sound of a bicycle if it moves in the frame. It also changes the sound parameters depending on what is happening in the video. The preprint of the article is published on the authors’ website.

NIX Solutions

In most cases, cameras shoot video immediately with sound from an internal or external microphone. But there are cases or even certain types of filming in which the video is devoid of sound. For example, this applies to drones: as a rule, they do not have a microphone at all, and if there is one, the sound from it will mainly contain the noise of motors and propellers. Because of this, editors who want not only to add music, but to convey the real sounds of the scene, have to carefully select similar sounds from the library and monitor how they relate to the behavior of objects in the frame.

Developers at Carnegie Mellon University and Runway, led by Nikolas Martelaro, have created an algorithm that does the job for a human. First, the algorithm detects sound sources in the frame. They can be of two types: specific objects and places with a characteristic background sound, for example, a cafe. The video is pre-divided into scenes by a sharp change in the histogram between two frames. Then the CLIP neural network classifies objects in it, using the Epidemic Sound effect base, which contains 90 thousand sounds, as classes, says  N+1. As a result, for each scene, the five most likely effects for objects and environments are given. By default, the system selects one of them, but the user can enable additional ones.

After selecting sound effects, the algorithm creates time intervals for them, as the object may not be present throughout the entire scene, but only on a part of it, explains NIX Solutions. Then each scene is divided into fragments a second long, the algorithm determines the location of sound sources and selects the appropriate stereo sound and volume parameters for it, so that moving objects sound realistic.