Stability AI, renowned for its Stable Diffusion AI technology for text-to-image conversion, unveils its latest breakthrough – the public release of the Stable Audio neural network. This cutting-edge technology empowers users to effortlessly generate short audio clips by simply describing their desired music or sound in text format. Building on the same foundational AI techniques that underpin Stable Diffusion’s image generation capabilities, Stable Audio marks a significant foray into the realm of audio creation.
A Musical Vision in Text
Ed Newton-Rex, Vice President of Audio at Stability AI, expressed, “Stability AI is best known for its work with images, but now we’re launching our first product for music and audio creation, called Stable Audio. The idea is very simple: you describe the music or audio you want to hear in text, and our system generates it for you.”
Ed’s familiarity with computer music dates back to 2011 when he founded Jukedeck, subsequently acquired by TikTok in 2019. However, the genesis of Stable Audio’s technology traces back to Stability AI’s in-house music production research studio, Harmonai, founded by Zach Evans. Evans elucidated that the text model leverages a technique known as Contrastive Language Audio Pretraining (CLAP). The Stable Audio model boasts approximately 1.2 billion parameters, akin to the original image generation iteration of Stable Diffusion.
Beyond MIDI and Symbol Generation
While generating basic audio tracks through technology is not novel, Stable Audio diverges from traditional methods like symbol generation, often used with the MIDI (Musical Instrument Digital Interface) format. Stable Audio’s generative AI capabilities empower users to craft music that transcends the repetitive note sequences commonly associated with MIDI and symbol generation.
Stable Audio operates directly with raw audio samples, delivering superior output quality. It was trained using over 800,000 licensed music pieces from the AudioSparks audio library. Evans elaborated, “One of the biggest challenges when creating text models is obtaining high-quality audio data with appropriate metadata.”
In the realm of image generation models, users often seek stylization for specific artists. However, Stable Audio takes a different approach, as its creators believe most musicians prefer to maintain their creative freedom.
Pricing and Availability
Stable Audio offers both a free version and a Pro tariff plan priced at $12 per month, notes NIX Solutions. The free version enables users to create up to 20 tracks each month, with a maximum duration of 20 seconds. Opting for the Pro version provides access to 500 tracks and extends the playing time to 90 seconds, while also permitting commercial usage of the generated works. As part of the launch, Stability AI will release a guide to text prompts, aiding users in harnessing the full potential of Stable Audio.