WavJourney: Compositional Audio Creation with
Large Language Models

Xubo Liu¹, Zhongkai Zhu², Haohe Liu¹, Yi Yuan¹, Meng Cui¹, Qiushi Huang¹, Jinhua Liang³,
Yin Cao⁴, Qiuqiang Kong⁵, Mark D. Plumbley¹, Wenwu Wang¹

¹University of Surrey, ²Independent Researcher, ³Queen Mary University of London,
⁴Xian Jiaotong Liverpool University, ⁵The Chinese University of Hong Kong

Paper Code App Discord

Abstract

Large Language Models (LLMs) have shown great promise in integrating diverse expert models to tackle intricate language and vision tasks. Despite their significance in advancing the field of Artificial Intelligence Generated Content (AIGC), their potential in intelligent audio content creation remains unexplored. In this work, we tackle the problem of creating audio content with storylines encompassing speech, music, and sound effects, guided by text instructions. We present WavJourney, a system that leverages LLMs to connect various audio models for audio content generation. Given a text description of an auditory scene, WavJourney first prompts LLMs to generate a structured script dedicated to audio storytelling. The audio script incorporates diverse audio elements, organized based on their spatio-temporal relationships. As a conceptual representation of audio, the audio script provides an interactive and interpretable rationale for human engagement. Afterward, the audio script is fed into a script compiler, converting it into a computer program. Each line of the program calls a task-specific audio generation model or computational operation function (e.g., concatenate, mix). The computer program is then executed to obtain an explainable solution for audio generation. We demonstrate the practicality of WavJourney across diverse real-world scenarios, including science fiction, education, and radio play. The explainable and interactive design of WavJourney fosters human-machine co-creation in multi-round dialogues, enhancing creative control and adaptability in audio production. WavJourney audiolizes the human imagination, opening up new avenues for creativity in multimedia content creation.

Scenario 1: Science Fiction

Instruction: Generate an audio in Science Fiction theme: Mars News reporting that Humans send light-speed probe to Alpha Centauri. Start with news anchor, followed by a reporter interviewing a chief engineer from an organization that built this probe, founded by United Earth and Mars Government, and end with the news anchor again.

Scenario 2: Education

Instruction: Generate a one-minute introduction to quantum mechanics by a professor.

Scenario 3: Fictional Radio

Instruction: Generate a fictional radio show: "In the bustling artistic landscape of 1920s Paris, a local surrealist artist vanishes without a trace, leaving the community in a state of anxious speculation. The broadcast delves into this perplexing disappearance, exploring its impact on the bohemian circles frequenting the famed nightclub, Le Chat Noir. Tune in for a captivating minute of news that uncovers the layers of mystery shrouding the City of Lights. From police bafflement to public intrigue, we bring you the latest on this enigmatic tale that has both captivated and confounded Parisians. Hosted by Edward Thompson for the BBC World Service, this broadcast serves as a haunting reminder of the secrets that lurk in the corners of artistic brilliance and nocturnal Paris."

The images were generated by Midjourney. The script was generated by ChatGPT 4. Video made by Jeff Barry.

Multi-genre Audio Storytelling Creation

There are seven audio clips for each row:

(a) Generated audio clips by WavJourney

(b) Generated audio clips by AudioLDM.

(c) Generated audio clips by AudioGen.

(d) Generated audio clips by AudioLDM2.

(e) Generated audio clips by Make an Audio.

(f) Generated audio clips by AudioBox(only support 10s generation).

(g) Generated audio clips by Tango(only support 10s generation).

Sci-Fi: ''Universal translators malfunction; humans and aliens bond over shared melodies. ''


(a) WavJourney	(b) AudioLDM	(c) AudioGen	(d) AudioLDM2	(e) Make An Audio	(f) AudioBox	(g) Tango

Travel Exploration: ''Peru's Andes peaks whisper tales of the Inca, where golden cities once stood. ''


(a) WavJourney	(b) AudioLDM	(c) AudioGen	(d) AudioLDM2	(e) Make An Audio	(f) AudioBox	(g) Tango

Romantic Drama: ''Secrets whispered, emotions swell, two hearts navigating love's turbulent sea. ''


(a) WavJourney	(b) AudioLDM	(c) AudioGen	(d) AudioLDM2	(e) Make An Audio	(f) AudioBox	(g) Tango

Radio Play: ''Seated by the window, rain outside, Kate listens to Jake's poetry. ''


(a) WavJourney	(b) AudioLDM	(c) AudioGen	(d) AudioLDM2	(e) Make An Audio	(f) AudioBox	(g) Tango

Education: ''Soundscapes and Symphonies: Introduction to Music Theory and Composition. ''


(a) WavJourney	(b) AudioLDM	(c) AudioGen	(d) AudioLDM2	(e) Make An Audio	(f) AudioBox	(g) Tango

Case study on AudioCaps Benchmark

We present audio samples generated by WavJourney in comparison to SOTA text-to-audio generation methods on AudioCaps dataset. WavJourney demonstrates superior performance over SOTA methods, particularly when conditioned on intricate text descriptions. It even stands on par with the ground truth. The compositional design enables WavJourney to model the complex spatio-temporal acoustic relationships among multiple sounds.

There are seven audio clips for each row:

(a) Generated audio clips by WavJourney

(b) Generated audio clips by AudioLDM.

(c) Generated audio clips by Tango.

(d) Ground truth audio in AudioCaps dataset.

(e) Generated audio clips by AudioLDM2.

(f) Generated audio clips by Make an Audio.

(g) Generated audio clips by AudioBox.

Audio Caption: ''A man talking followed by a goat baaing then a metal gate sliding while ducks quack and wind blows into a microphone ''



(a) WavJourney	(b) AudioLDM	(c) Tango	(d) Ground truth	(e) AudioLDM2	(f) Make an Audio	(g) AudioBox

Audio Caption: ''A train running on a railroad track followed by a vehicle door closing and a man talking in the distance while a train horn honks and railroad crossing warning signals ring ''



(a) WavJourney	(b) AudioLDM	(c) Tango	(d) Ground truth	(e) AudioLDM2	(f) Make an Audio	(g) AudioBox

Audio Caption: ''A man speaking followed by a woman talking then plastic clacking as footsteps walk on grass and a rooster crows in the distance ''



(a) WavJourney	(b) AudioLDM	(c) Tango	(d) Ground truth	(e) AudioLDM2	(f) Make an Audio	(g) AudioBox

Audio Caption: ''A man speaking over an intercom as a helicopter engine runs followed by several gunshots firing ''



(a) WavJourney	(b) AudioLDM	(c) Tango	(d) Ground truth	(e) AudioLDM2	(f) Make an Audio	(g) AudioBox

More samples on AudioCaps:

Audio Caption: ''Outside noises of insects buzzing around, birds communicating and a man exchanging information with another man ''



(a) WavJourney	(b) AudioLDM	(c) AudioGen	(d) Ground truth	(e) AudioLDM2	(f) Make an Audio	(g) AudioBox

Audio Caption: ''A loud burst followed by rustling and then spraying ''



(a) WavJourney	(b) AudioLDM	(c) AudioGen	(d) Ground truth	(e) AudioLDM2	(f) Make an Audio	(g) AudioBox

Audio Caption: ''A small motor buzzing followed by a man speaking as a metal door closes ''



(a) WavJourney	(b) AudioLDM	(c) AudioGen	(d) Ground truth	(e) AudioLDM2	(f) Make an Audio	(g) AudioBox

Audio Caption: ''A sewing machine operating idle followed by a man talking then several instances of metal ratcheting ''



(a) WavJourney	(b) AudioLDM	(c) AudioGen	(d) Ground truth	(e) AudioLDM2	(f) Make an Audio	(g) AudioBox

Audio Caption: ''A gun is fired few times followed by magazine clinking ''



(a) WavJourney	(b) AudioLDM	(c) AudioGen	(d) Ground truth	(e) AudioLDM2	(f) Make an Audio	(g) AudioBox

More samples on Clotho:

Audio Caption: ''Birds are tweeting with highway traffic in the background ''



(a) WavJourney	(b) AudioLDM	(c) AudioGen	(d) Ground truth	(e) AudioLDM2	(f) Make an Audio	(g) AudioBox

Audio Caption: ''A circular saw blade is cutting something as music is playing nearby ''



(a) WavJourney	(b) AudioLDM	(c) AudioGen	(d) Ground truth	(e) AudioLDM2	(f) Make an Audio	(g) AudioBox

Audio Caption: ''Religious chants over a loud speaker outside with birds chirping ''



(a) WavJourney	(b) AudioLDM	(c) AudioGen	(d) Ground truth	(e) AudioLDM2	(f) Make an Audio	(g) AudioBox

Audio Caption: ''The low hum of resonating sounds against a background of conversation''



(a) WavJourney	(b) AudioLDM	(c) AudioGen	(d) Ground truth	(e) AudioLDM2	(f) Make an Audio	(g) AudioBox

Audio Caption: ''A remote control car is running and then stops and then runs again ''



(a) WavJourney	(b) AudioLDM	(c) AudioGen	(d) Ground truth	(e) AudioLDM2	(f) Make an Audio	(g) AudioBox

WavJourney on Twitter

Acknowledgements

This work is partly supported by UK Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 "AI for Sound", British Broadcasting Corporation Research and Development (BBC R&D), a PhD scholarship from the Centre for Vision, Speech and Signal Processing, Faculty of Engineering and Physical Science, University of Surrey and a Grant "XJTLU RDF-22-01-084". For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising.