Raon-Speech-9B

👁 Homepage
👁 GitHub

👁 Hugging Face
👁 X

👁 License

Demo | Technical Report | Blog (Coming soon)

Raon-Speech is a 9B-parameter speech language model that supports state-of-the-art speech understanding, answering and generation in English and Korean. This model successfully transforms a pre-trained LLM into a SpeechLM to both understand and generate speech without compromising its original language capabilities. It trains on millions of hours of English-Korean speech-text datasets with the following training stages: (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training.

Key Features

End-to-End Speech Language Model: 9B-parameter multimodal model built on Qwen3 (36 layers, 4096 hidden dim), Qwen3OmniMoeAudioEncoder (24 layers), Mimi codec (32 quantizers), and ECAPA-TDNN speaker encoder.
Bilingual Support: State-of-the-art speech understanding, answering, and generation in both English and Korean.
Multi-Task Capabilities: Supports STT (audio → text), TTS (text → audio), TextQA (text + audio → text), and SpeechChat (audio → text) in a single unified model.
Speaker Voice Conditioning: TTS with optional speaker reference audio for voice cloning via ECAPA-TDNN embeddings.
TTS Continuation: Generate speech that naturally continues from a reference audio, with prefill-based continuation for seamless prosody.
Multi-Reward DPO Post-Training: Three-stage training pipeline — (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training — for high-quality speech generation.
HuggingFace Transformers Integration: Load and run directly via AutoModel.from_pretrained with trust_remote_code=True — no custom package installation required.

Benchmark Results

Raon-Speech is optimized for low-latency, real-time speech generation while maintaining strong performance across ASR, speech generation, spoken QA, audio understanding, and text QA tasks.

👁 Raon-Speech Benchmark Results

Measured with LibriSpeech test-clean samples on single-GPU setups via streaming TTS. All values are averaged.

Metric	RTX 6000 Pro	L40S
RTF	0.27 (3.7× real-time)	0.45 (2.2× real-time)
TTFT	617 ms	887 ms
TBT	135 ms	233 ms

RTF (Real-Time Factor): Lower is faster. Values below 1.0 mean faster-than-real-time synthesis.
TTFT (Time to First Token): Latency until the first audio chunk is returned.
TBT (Time Between Tokens): Average interval between consecutive audio chunks.

Requirements

pip install 'transformers>=4.57.1,<5.0' torch torchaudio soundfile accelerate

# Optional
pip install speechbrain # for TTS with speaker voice conditioning
pip install gradio # for Gradio demo

Quick Start

Option 1: Load from Hub (recommended)

No pip install raon needed.

from transformers import AutoConfig
from transformers.dynamic_module_utils import get_class_from_dynamic_module

MODEL_ID = "KRAFTON/Raon-Speech-9B"

config = AutoConfig.from_pretrained(MODEL_ID, trust_remote_code=True)
RaonPipeline = get_class_from_dynamic_module(
 "modeling_raon.RaonPipeline",
 MODEL_ID,
 revision=getattr(config, "_commit_hash", None),
)

pipe = RaonPipeline(MODEL_ID, device="cuda", dtype="bfloat16")

Option 2: With raon package installed

git clone https://github.com/krafton-ai/Raon-Speech.git
cd Raon-Speech/raon
pip install -e . # or: uv sync

from raon import RaonPipeline

# From Hub (local code + Hub weights)
pipe = RaonPipeline("KRAFTON/Raon-Speech-9B")

# From local path
pipe = RaonPipeline("/path/to/raon-model")

Tasks

STT (Audio → Text)

text = pipe.stt("audio.wav")

TTS (Text → Audio)

# Without speaker conditioning
audio, sr = pipe.tts("Hello, how are you?")
pipe.save_audio((audio, sr), "output.wav")

# With speaker conditioning (requires speechbrain)
audio, sr = pipe.tts("Hello, how are you?", speaker_audio="speaker_ref.wav")

TextQA (Text + Audio → Text)

answer = pipe.textqa("What is the speaker saying?", audio="audio.wav")

SpeechChat (Audio → Text)

answer = pipe.speech_chat("question.wav")

Chat (Multimodal)

messages = [
 {
 "role": "user",
 "content": [
 {"type": "audio", "audio": "audio.wav"},
 {"type": "text", "text": "Transcribe and summarise this audio."},
 ],
 },
]
response = pipe.chat(messages)

Deployment (vLLM-Omni)

1. Clone & Build

git clone https://github.com/krafton-ai/vllm-omni.git
cd vllm-omni
docker build -f docker/Dockerfile.ci -t vllm-omni .

2. Serve

docker run --rm --gpus all \
 --shm-size=16g \
 -p 8000:8000 \
 vllm-omni \
 bash -c "vllm serve KRAFTON/Raon-Speech-9B --omni --port 8000 --trust-remote-code"

3. Test — TTS

curl -X POST http://localhost:8000/v1/audio/speech \
 -H "Content-Type: application/json" \
 -d '{
 "input": "Hello, how are you?",
 "model": "KRAFTON/Raon-Speech-9B",
 "response_format": "wav"
 }' --output output.wav

4. Test — TTS with voice cloning

curl -X POST http://localhost:8000/v1/audio/speech \
 -H "Content-Type: application/json" \
 -d '{
 "input": "Hello, how are you?",
 "model": "KRAFTON/Raon-Speech-9B",
 "ref_audio": "data:audio/wav;base64,'$(base64 -w0 speaker_ref.wav)'",
 "task_type": "Base",
 "response_format": "wav"
 }' --output cloned.wav

5. Test — STT

curl -X POST http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
 "model": "KRAFTON/Raon-Speech-9B",
 "messages": [
 {
 "role": "user",
 "content": [
 {"type": "audio_url", "audio_url": {"url": "data:audio/wav;base64,'"$(base64 -w0 audio.wav)"'"}},
 {"type": "text", "text": "Transcribe the audio into text."}
 ]
 }
 ]
 }'

Intended use

This checkpoint is suitable for:

bilingual English/Korean speech research,
speech QA and audio-understanding experiments,
TTS and speaker-conditioned TTS prototyping,
evaluation and serving work on open speech language models,
multimodal assistants that need both audio understanding and speech output.

License

This repository is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

Acknowledgement

The current release includes:

model weights,
Hugging Face custom code,
inference pipeline,
technical report,
demo links,
related GitHub repositories.

For exact architectural details, training hyperparameters, Korean benchmark construction, and the Raon-SpeechChat full-duplex extension, consult the technical report included in this repository.

Citation

@misc{raonspeech,
 title = {Raon-Speech Technical Report},
 author = {{KRAFTON}},
 month = {April},
 year = {2026}
}

Downloads last month: 1,387

Safetensors

Model size

9B params

Tensor type

F32

BF16

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including KRAFTON/Raon-Speech-9B

9 items • Updated 28 days ago • 46

URL: https://huggingface.co/KRAFTON/Raon-Speech-9B

⇱ KRAFTON/Raon-Speech-9B · Hugging Face