WavJourney: Compositional Audio Creation with
Large Language Models




Xubo Liu1, Zhongkai Zhu2, Haohe Liu1, Yi Yuan1, Meng Cui1, Qiushi Huang1, Jinhua Liang3,
Yin Cao4, Qiuqiang Kong5, Mark D. Plumbley1, Wenwu Wang1


1University of Surrey, 2Independent Researcher, 3Queen Mary University of London,
4Xian Jiaotong Liverpool University, 5The Chinese University of Hong Kong



Paper                   Code                   App                   Discord






Abstract

Large Language Models (LLMs) have shown great promise in integrating diverse expert models to tackle intricate language and vision tasks. Despite their significance in advancing the field of Artificial Intelligence Generated Content (AIGC), their potential in intelligent audio content creation remains unexplored. In this work, we tackle the problem of creating audio content with storylines encompassing speech, music, and sound effects, guided by text instructions. We present WavJourney, a system that leverages LLMs to connect various audio models for audio content generation. Given a text description of an auditory scene, WavJourney first prompts LLMs to generate a structured script dedicated to audio storytelling. The audio script incorporates diverse audio elements, organized based on their spatio-temporal relationships. As a conceptual representation of audio, the audio script provides an interactive and interpretable rationale for human engagement. Afterward, the audio script is fed into a script compiler, converting it into a computer program. Each line of the program calls a task-specific audio generation model or computational operation function (e.g., concatenate, mix). The computer program is then executed to obtain an explainable solution for audio generation. We demonstrate the practicality of WavJourney across diverse real-world scenarios, including science fiction, education, and radio play. The explainable and interactive design of WavJourney fosters human-machine co-creation in multi-round dialogues, enhancing creative control and adaptability in audio production. WavJourney audiolizes the human imagination, opening up new avenues for creativity in multimedia content creation.





Scenario 1: Science Fiction

Instruction: Generate an audio in Science Fiction theme: Mars News reporting that Humans send light-speed probe to Alpha Centauri. Start with news anchor, followed by a reporter interviewing a chief engineer from an organization that built this probe, founded by United Earth and Mars Government, and end with the news anchor again.






Scenario 2: Education

Instruction: Generate a one-minute introduction to quantum mechanics by a professor.






Scenario 3: Fictional Radio

Instruction: Generate a fictional radio show: "In the bustling artistic landscape of 1920s Paris, a local surrealist artist vanishes without a trace, leaving the community in a state of anxious speculation. The broadcast delves into this perplexing disappearance, exploring its impact on the bohemian circles frequenting the famed nightclub, Le Chat Noir. Tune in for a captivating minute of news that uncovers the layers of mystery shrouding the City of Lights. From police bafflement to public intrigue, we bring you the latest on this enigmatic tale that has both captivated and confounded Parisians. Hosted by Edward Thompson for the BBC World Service, this broadcast serves as a haunting reminder of the secrets that lurk in the corners of artistic brilliance and nocturnal Paris."

The images were generated by Midjourney. The script was generated by ChatGPT 4. Video made by Jeff Barry.






Multi-genre Audio Storytelling Creation


There are seven audio clips for each row:

(a) Generated audio clips by WavJourney

(b) Generated audio clips by AudioLDM.

(c) Generated audio clips by AudioGen.

(d) Generated audio clips by AudioLDM2.

(e) Generated audio clips by Make an Audio.

(f) Generated audio clips by AudioBox(only support 10s generation).

(g) Generated audio clips by Tango(only support 10s generation).





Sci-Fi: ''Universal translators malfunction; humans and aliens bond over shared melodies. ''

(a) WavJourney
(b) AudioLDM
(c) AudioGen
(d) AudioLDM2
(e) Make An Audio
(f) AudioBox
(g) Tango
Travel Exploration: ''Peru's Andes peaks whisper tales of the Inca, where golden cities once stood. ''

(a) WavJourney
(b) AudioLDM
(c) AudioGen
(d) AudioLDM2
(e) Make An Audio
(f) AudioBox
(g) Tango
Romantic Drama: ''Secrets whispered, emotions swell, two hearts navigating love's turbulent sea. ''

(a) WavJourney
(b) AudioLDM
(c) AudioGen
(d) AudioLDM2
(e) Make An Audio
(f) AudioBox
(g) Tango
Radio Play: ''Seated by the window, rain outside, Kate listens to Jake's poetry. ''

(a) WavJourney
(b) AudioLDM
(c) AudioGen
(d) AudioLDM2
(e) Make An Audio
(f) AudioBox
(g) Tango
Education: ''Soundscapes and Symphonies: Introduction to Music Theory and Composition. ''

(a) WavJourney
(b) AudioLDM
(c) AudioGen
(d) AudioLDM2
(e) Make An Audio
(f) AudioBox
(g) Tango



Case study on AudioCaps Benchmark


We present audio samples generated by WavJourney in comparison to SOTA text-to-audio generation methods on AudioCaps dataset. WavJourney demonstrates superior performance over SOTA methods, particularly when conditioned on intricate text descriptions. It even stands on par with the ground truth. The compositional design enables WavJourney to model the complex spatio-temporal acoustic relationships among multiple sounds.

There are seven audio clips for each row:

(a) Generated audio clips by WavJourney

(b) Generated audio clips by AudioLDM.

(c) Generated audio clips by Tango.

(d) Ground truth audio in AudioCaps dataset.

(e) Generated audio clips by AudioLDM2.

(f) Generated audio clips by Make an Audio.

(g) Generated audio clips by AudioBox.





Audio Caption: ''A man talking followed by a goat baaing then a metal gate sliding while ducks quack and wind blows into a microphone ''

fname
fname
fname
fname
fname
fname
fname
(a) WavJourney
(b) AudioLDM
(c) Tango
(d) Ground truth
(e) AudioLDM2
(f) Make an Audio
(g) AudioBox

Audio Caption: ''A train running on a railroad track followed by a vehicle door closing and a man talking in the distance while a train horn honks and railroad crossing warning signals ring ''

fname
fname
fname
fname
fname
fname
fname
(a) WavJourney
(b) AudioLDM
(c) Tango
(d) Ground truth
(e) AudioLDM2
(f) Make an Audio
(g) AudioBox

Audio Caption: ''A man speaking followed by a woman talking then plastic clacking as footsteps walk on grass and a rooster crows in the distance ''

fname
fname
fname
fname
fname
fname
fname
(a) WavJourney
(b) AudioLDM
(c) Tango
(d) Ground truth
(e) AudioLDM2
(f) Make an Audio
(g) AudioBox

Audio Caption: ''A man speaking over an intercom as a helicopter engine runs followed by several gunshots firing ''

fname
fname
fname
fname
fname
fname
fname
(a) WavJourney
(b) AudioLDM
(c) Tango
(d) Ground truth
(e) AudioLDM2
(f) Make an Audio
(g) AudioBox
More samples on AudioCaps:
Audio Caption: ''Outside noises of insects buzzing around, birds communicating and a man exchanging information with another man ''

fname
fname
fname
fname
fname
fname
fname
(a) WavJourney
(b) AudioLDM
(c) AudioGen
(d) Ground truth
(e) AudioLDM2
(f) Make an Audio
(g) AudioBox
Audio Caption: ''A loud burst followed by rustling and then spraying ''

fname
fname
fname
fname
fname
fname
fname
(a) WavJourney
(b) AudioLDM
(c) AudioGen
(d) Ground truth
(e) AudioLDM2
(f) Make an Audio
(g) AudioBox
Audio Caption: ''A small motor buzzing followed by a man speaking as a metal door closes ''

fname
fname
fname
fname
fname
fname
fname
(a) WavJourney
(b) AudioLDM
(c) AudioGen
(d) Ground truth
(e) AudioLDM2
(f) Make an Audio
(g) AudioBox
Audio Caption: ''A sewing machine operating idle followed by a man talking then several instances of metal ratcheting ''

fname
fname
fname
fname
fname
fname
fname
(a) WavJourney
(b) AudioLDM
(c) AudioGen
(d) Ground truth
(e) AudioLDM2
(f) Make an Audio
(g) AudioBox
Audio Caption: ''A gun is fired few times followed by magazine clinking ''

fname
fname
fname
fname
fname
fname
fname
(a) WavJourney
(b) AudioLDM
(c) AudioGen
(d) Ground truth
(e) AudioLDM2
(f) Make an Audio
(g) AudioBox
More samples on Clotho:
Audio Caption: ''Birds are tweeting with highway traffic in the background ''

fname
fname
fname
fname
fname
fname
fname
(a) WavJourney
(b) AudioLDM
(c) AudioGen
(d) Ground truth
(e) AudioLDM2
(f) Make an Audio
(g) AudioBox
Audio Caption: ''A circular saw blade is cutting something as music is playing nearby ''

fname
fname
fname
fname
fname
fname
fname
(a) WavJourney
(b) AudioLDM
(c) AudioGen
(d) Ground truth
(e) AudioLDM2
(f) Make an Audio
(g) AudioBox
Audio Caption: ''Religious chants over a loud speaker outside with birds chirping ''

fname
fname
fname
fname
fname
fname
fname
(a) WavJourney
(b) AudioLDM
(c) AudioGen
(d) Ground truth
(e) AudioLDM2
(f) Make an Audio
(g) AudioBox
Audio Caption: ''The low hum of resonating sounds against a background of conversation''

fname
fname
fname
fname
fname
fname
fname
(a) WavJourney
(b) AudioLDM
(c) AudioGen
(d) Ground truth
(e) AudioLDM2
(f) Make an Audio
(g) AudioBox
Audio Caption: ''A remote control car is running and then stops and then runs again ''

fname
fname
fname
fname
fname
fname
fname
(a) WavJourney
(b) AudioLDM
(c) AudioGen
(d) Ground truth
(e) AudioLDM2
(f) Make an Audio
(g) AudioBox





WavJourney on Twitter




Acknowledgements

This work is partly supported by UK Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 "AI for Sound", British Broadcasting Corporation Research and Development (BBC R&D), a PhD scholarship from the Centre for Vision, Speech and Signal Processing, Faculty of Engineering and Physical Science, University of Surrey and a Grant "XJTLU RDF-22-01-084". For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising.