# AudioLDM2

**Repository Path**: jvluo/AudioLDM2

## Basic Information

- **Project Name**: AudioLDM2
- **Description**: No description available
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-06-21
- **Last Updated**: 2026-06-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# AudioLDM 2

[![arXiv](https://img.shields.io/badge/arXiv-2308.05734-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2308.05734)  [![githubio](https://img.shields.io/badge/GitHub.io-Audio_Samples-blue?logo=Github&style=flat-square)](https://audioldm.github.io/audioldm2/)  [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music)  

This repo currently support Text-to-Audio (including Music), Text-to-Speech Generation and Super Resolution Inpainting.

<hr>

## Change Log

- 2023-08-27: Add two new checkpoints! 
  - 🌟 **48kHz AudioLDM model**: Now we support high-fidelity audio generation! [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/haoheliu/AudioLDM_48K_Text-to-HiFiAudio_Generation)  
  - **16kHz improved AudioLDM model**: Trained with more data and optimized model architecture.

## TODO

- [x] Add the text-to-speech checkpoint
- [x] Open-source the [AudioLDM training code](https://github.com/haoheliu/AudioLDM-training-finetuning).
- [x] Support the generation of longer audio (> 10s)
- [x] Optimizing the inference speed of the model.
- [x] Integration with the Diffusers library (see [🧨 Diffusers](#hugging-face--diffusers))
- [ ] Add the style-transfer and inpainting code for the audioldm_48k checkpoint (PR welcomed, same logic as [AudioLDMv1](https://github.com/haoheliu/AudioLDM))

## Web APP

1. Prepare running environment

```shell
conda create -n audioldm python=3.8; conda activate audioldm
pip3 install git+https://gitee.com/jvluo/AudioLDM2.git
git clone https://gitee.com/jvluo/AudioLDM2.git; cd AudioLDM2
```

2. Start the web application (powered by Gradio)

```shell
python3 app.py
```

3. A link will be printed out. Click the link to open the browser and play.

## Commandline Usage

### Installation

Prepare running environment

```shell
# Optional
conda create -n audioldm python=3.8; conda activate audioldm
# Install AudioLDM
pip3 install git+https://github.com/haoheliu/AudioLDM2.git
```

If you plan to play around with text-to-speech generation. Please also make sure you have installed [espeak](https://espeak.sourceforge.net/download.html). On linux you can do it by

```shell
sudo apt-get install espeak
```

### Run the model in commandline

- Generate sound effect or Music based on a text prompt

```shell
audioldm2 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
```

- Generate sound effect or music based on a list of text

```shell
audioldm2 -tl batch.lst
```

- Generate speech based on (1) the transcription and (2) the description of the speaker

```shell
audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day"

audioldm2 -t "A female reporter is speaking" --transcription "Wish you have a good day"
```

Text-to-Speech use the *audioldm2-speech-gigaspeech* checkpoint by default. If you like to run TTS with LJSpeech pretrained checkpoint, simply set *--model_name audioldm2-speech-ljspeech*.

## Random Seed Matters

Sometimes model may not perform well (sounds weird or low quality) when changing into a different hardware. In this case, please adjust the random seed and find the optimal one for your hardware. 

```shell
audioldm2 --seed 1234 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
```

## Pretrained Models

You can choose model checkpoint by setting up "model_name":

```shell
# CUDA
audioldm2 --model_name "audioldm2-full" --device cuda -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

# MPS
audioldm2 --model_name "audioldm2-full" --device mps -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
```

We have five checkpoints you can choose:

1. **audioldm2-full** (default): Generate both sound effect and music generation with the AudioLDM2 architecture.
2. **audioldm_48k**: This checkpoint can generate high fidelity sound effect and music.
3. **audioldm_16k_crossattn_t5**: The improved version of [AudioLDM 1.0](https://github.com/haoheliu/AudioLDM).
4. **audioldm2-full-large-1150k**: Larger version of audioldm2-full. 
5. **audioldm2-music-665k**: Music generation. 
6. **audioldm2-speech-gigaspeech** (default for TTS): Text-to-Speech, trained on GigaSpeech Dataset.
7. **audioldm2-speech-ljspeech**: Text-to-Speech, trained on LJSpeech Dataset.

We currently support 3 devices:

- cpu
- cuda
- mps ( Notice that the computation requires about 20GB of RAM. )

## Other options

```shell
  usage: audioldm2 [-h] [-t TEXT] [-tl TEXT_LIST] [-s SAVE_PATH]
                 [--model_name {audioldm_48k, audioldm_16k_crossattn_t5, audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}] [-d DEVICE]
                 [-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [-n N_CANDIDATE_GEN_PER_TEXT]
                 [--seed SEED] [--mode {generation, sr_inpainting}] [-f FILE_PATH]

  optional arguments:
    -h, --help            show this help message and exit
    --mode {generation,sr_inpainting}
                        generation: text-to-audio generation; sr_inpainting: super resolution inpainting 
    -t TEXT, --text TEXT  Text prompt to the model for audio generation
    -f FILE_PATH, --file_path FILE_PATH
                        (--mode sr_inpainting): Original audio file for inpainting; Or 
                        (--mode generation): the guidance audio file for generating similar audio, DEFAULT None
    --transcription TRANSCRIPTION
                        Transcription used for speech synthesis
    -tl TEXT_LIST, --text_list TEXT_LIST
                          A file that contains text prompt to the model for audio generation
    -s SAVE_PATH, --save_path SAVE_PATH
                          The path to save model output
    --model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}
                          The checkpoint you gonna use
    -d DEVICE, --device DEVICE
                          The device for computation. If not specified, the script will automatically choose the device based on your environment. [cpu, cuda, mps, auto]
    -b BATCHSIZE, --batchsize BATCHSIZE
                          Generate how many samples at the same time
    --ddim_steps DDIM_STEPS
    -dur DURATION, --duration DURATION
                        The duration of the samples
                          The sampling step for DDIM
    -gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE
                          Guidance scale (Large => better quality and relavancy to text; Small => better diversity)
    -n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT
                          Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with
                          heavier computation
    --seed SEED           Change this value (any integer number) will lead to a different generation result.
```

## Hugging Face 🧨 Diffusers

AudioLDM 2 is available in the Hugging Face [🧨 Diffusers](https://github.com/huggingface/diffusers) library from v0.21.0 
onwards. The official checkpoints can be found on the [Hugging Face Hub](https://huggingface.co/cvssp/audioldm2#checkpoint-details), 
alongside [documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2) and 
[examples scripts](https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/AudioLDM-2.ipynb).

The Diffusers version of the code runs upwards of **3x faster** than the native AudioLDM 2 implementation, and supports 
generating audios of arbitrary length.

To install 🧨 Diffusers and 🤗 Transformers, run:

```bash
pip install --upgrade git+https://github.com/huggingface/diffusers.git transformers accelerate
```

You can then load pre-trained weights into the [AudioLDM2 pipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2),
and generate text-conditional audio outputs by providing a text prompt:

```python
from diffusers import AudioLDM2Pipeline
import torch
import scipy

repo_id = "cvssp/audioldm2"
pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "Techno music with a strong, upbeat tempo and high melodic riffs."
audio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0]

scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
```

Tips for obtaining high-quality generations can be found under the AudioLDM 2 [docs](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#tips),
including the use of prompt engineering and negative prompting.

Tips for optimising inference speed can be found in the blog post [AudioLDM 2, but faster ⚡️](https://huggingface.co/blog/audioldm2).

## Cite this work

If you found this tool useful, please consider citing

```bibtex
@article{audioldm2-2024taslp,
  author={Liu, Haohe and Yuan, Yi and Liu, Xubo and Mei, Xinhao and Kong, Qiuqiang and Tian, Qiao and Wang, Yuping and Wang, Wenwu and Wang, Yuxuan and Plumbley, Mark D.},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
  title={AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining}, 
  year={2024},
  volume={32},
  pages={2871-2883},
  doi={10.1109/TASLP.2024.3399607}
}
```

```bibtex
@article{liu2023audioldm,
  title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
  author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
  journal={Proceedings of the International Conference on Machine Learning},
  year={2023}
  pages={21450-21474}
}
```