VOOZH about

URL: https://huggingface.co/sesame/csm-1b

โ‡ฑ sesame/csm-1b ยท Hugging Face


You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

CSM 1B

2025/05/20 - CSM is availabile natively in Hugging Face Transformers ๐Ÿค— as of version 4.52.1

2025/03/13 - We are releasing the 1B CSM variant. The checkpoint is hosted on Hugging Face.


CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

A fine-tuned variant of CSM powers the interactive voice demo shown in our blog post.

A hosted HuggingFace space is also available for testing audio generation.

Usage

Generate a sentence

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor

model_id = "sesame/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"

# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

# prepare the inputs
text = "[0]Hello from Sesame." # `[0]` for speaker id 0
inputs = processor(text, add_special_tokens=True).to(device)

# another equivalent way to prepare the inputs
conversation = [
 {"role": "0", "content": [{"type": "text", "text": "Hello from Sesame."}]},
]
inputs = processor.apply_chat_template(
 conversation,
 tokenize=True,
 return_dict=True,
).to(device)

# infer the model
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, "example_without_context.wav")

CSM sounds best when provided with context

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio

model_id = "sesame/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"

# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

# prepare the inputs
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
# ensure the audio is 24kHz
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
conversation = []

# 1. context
for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]):
 conversation.append(
 {
 "role": f"{speaker_id}",
 "content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}],
 }
 )

# 2. text prompt
conversation.append({"role": f"{ds[4]['speaker_id']}", "content": [{"type": "text", "text": ds[4]["text"]}]})

inputs = processor.apply_chat_template(
 conversation,
 tokenize=True,
 return_dict=True,
).to(device)

# infer the model
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, "example_with_context.wav")

Batched Inference ๐Ÿ“ฆ

CSM supports batched inference:

Making The Model Go Brrr ๐ŸŽ๏ธ

CSM supports full-graph compilation with CUDA graphs!

Fine-tuning & training ๐Ÿ“‰

CSM can be fine-tuned using Transformers' Trainer.


FAQ

Does this model come with any voices?

The model open sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.

Can I converse with the model?

CSM is trained to be an audio generation model and not a general purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.

Does it support other languages?

The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.

Misuse and abuse โš ๏ธ

This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following:

  • Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent.
  • Misinformation or Deception: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
  • Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes.

By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology.

Authors Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.

Downloads last month
271,704
Safetensors
Model size
2B params
Tensor type
F32
ยท

Model tree for sesame/csm-1b

Adapters
4 models
Finetunes
26 models
Merges
1 model
Quantizations
4 models

Spaces using sesame/csm-1b 100