Amuse: Human-AI Collaborative Songwriting with Multimodal Inspirations

1KAIST    2Carnegie Mellon University

We propose Amuse, a songwriting assistant that transforms multimodal (image, text, audio) inputs into chord progressions that can be seamlessly incorporated into songwriters' creative process.

Abstract


Songwriting is often driven by multimodal inspirations, such as imagery, narratives, or existing music, yet songwriters remain unsupported by current music AI systems in incorporating these multimodal inputs into their creative processes. We introduce Amuse, a songwriting assistant that transforms multimodal (image, text, or audio) inputs into chord progressions that can be seamlessly incorporated into songwriters' creative process. A key feature of Amuse is its novel method for generating coherent chords that are relevant to music keywords in the absence of datasets with paired examples of multimodal inputs and chords. Specifically, we propose a method that leverages multimodal LLMs to convert multimodal inputs into noisy chord suggestions and uses a unimodal chord model to filter the suggestions. A user study with songwriters shows that Amuse effectively supports transforming multimodal ideas into coherent musical suggestions, enhancing users' agency and creativity throughout the songwriting process.

Songs Created by the Participants


We conducted a user study with 10 songwriters to evaluate Amuse's effectiveness in supporting songwriting with multimodal inspirations. Participants created 8-bar choruses for two prompts: their favorite summer holiday memory and the beginning of an unexpected friendship. Each participant wrote two songs, one with Amuse's assistance and the other without. The following sound clips showcase the songs created by the participants. For more details, please refer to Sections 7-8 of the paper.

Listening Study Audio Samples


We conducted a listening study to assess the musical coherence and keyword relevance of chord progressions generated by our rejection sampling-based chord generation method. Participants evaluated pairs of chord progressions, with each pair generated by two of the following methods: LSTM Prior, GPT-4o, and Amuse (Ours). We list the selected audio samples used for the listening study. Further details can be found in Section 6.2 of the paper.

Musical Coherence

LSTM Prior GPT-4o Amuse (Ours)
1
2
3
4
5

Keyword Relevance

Keywords LSTM Prior GPT-4o Amuse (Ours)
energetic, dance pop, disco
acoustic, folk, country
smooth, jazz, swing
bossa nova, latin jazz, samba
emotional, ballad, sad

Citation


@article{kim2024amuse,
    title={Amuse: Human-AI Collaborative Songwriting with Multimodal Inspirations},
    author={Kim, Yewon and Lee, Sung-Ju and Donahue, Chris},
    year={2024},
    journal={arXiv preprint arXiv:2412.18940},
}