Researchers at Google have revealed a text-to-music AI that creates songs that last as long as five minutes.
The team published a paper summarizing their work and findings to introduce MusicLM to the world. The example is strikingly similar to a text prompt.
researchers claim Their model “outperforms previous systems in both sound quality and adherence to text descriptions.”
An example is a 30 second snippet of a song with an input caption like this:
- “The main soundtrack of an arcade game. It’s fast-paced and upbeat, with catchy electric guitar riffs. The music is repetitive and memorable, but contains unexpected sounds like cymbal crashes and drum rolls.” increase.”
- “Fusion of reggaeton and electronic dance music with an otherworldly, spacey sound. We design music to evoke a feeling of being lost in space, danceable, yet mysterious and awe-inspiring.”
- “Rising synths play reverb-heavy arpeggios, backed by pads, sub-basslines and soft drums. The song is full of synth sounds that create a calm, adventurous vibe.” We might play a festival between two songs to build up.
Using AI to generate music is nothing new, but no tools have yet been published that can actually generate decent music based on simple text prompts. According to the team behind MusicLM, it is until now.
researcher explain in their paper Various challenges faced by AI music generation. The first problem is that there are no audio/text data pairs. Unlike text-to-image machine learning, huge datasets “have greatly contributed” to recent progress, they said.
For example, both OpenAI’s DALL-E tool and Stable Diffusion sparked both a surge of public interest in the area and immediate use cases.
Another challenge of AI music generation is that music is structured “along the time dimension”. In other words, a music track exists over a period of time. Therefore, it is much more difficult to capture the intent of a music track using basic text captions, as opposed to using captions for still images.
MusicLM is the first step in overcoming these challenges, the team says.
It is a “hierarchical sequence-to-sequence model for music generation” that uses machine learning to generate sequences of different levels of a song, including structure, melody and individual notes.
To learn how to do this, the model is trained on a large dataset of unlabeled music, along with a musical captions dataset of over 5,500 examples prepared by musicians. This dataset is open to the public to support future research.
The model also allows for voice input such as whistling and humming, which can help convey the melody of a song. This will “render with the style described in the text prompt”.
Although not yet published, the creators acknowledge a potential “creative content diversion” risk if the songs generated are not sufficiently different from the source material the model learned from.