VOOZH about

URL: https://huggingface.co/KRAFTON/Raon-Speech-9B

โ‡ฑ KRAFTON/Raon-Speech-9B ยท Hugging Face


Raon-Speech-9B

๐Ÿ‘ Homepage
๐Ÿ‘ GitHub

๐Ÿ‘ Hugging Face
๐Ÿ‘ X

๐Ÿ‘ License

Demo | Technical Report | Blog (Coming soon)

Raon-Speech is a 9B-parameter speech language model that supports state-of-the-art speech understanding, answering and generation in English and Korean. This model successfully transforms a pre-trained LLM into a SpeechLM to both understand and generate speech without compromising its original language capabilities. It trains on millions of hours of English-Korean speech-text datasets with the following training stages: (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training.

Key Features

  • End-to-End Speech Language Model: 9B-parameter multimodal model built on Qwen3 (36 layers, 4096 hidden dim), Qwen3OmniMoeAudioEncoder (24 layers), Mimi codec (32 quantizers), and ECAPA-TDNN speaker encoder.
  • Bilingual Support: State-of-the-art speech understanding, answering, and generation in both English and Korean.
  • Multi-Task Capabilities: Supports STT (audio โ†’ text), TTS (text โ†’ audio), TextQA (text + audio โ†’ text), and SpeechChat (audio โ†’ text) in a single unified model.
  • Speaker Voice Conditioning: TTS with optional speaker reference audio for voice cloning via ECAPA-TDNN embeddings.
  • TTS Continuation: Generate speech that naturally continues from a reference audio, with prefill-based continuation for seamless prosody.
  • Multi-Reward DPO Post-Training: Three-stage training pipeline โ€” (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training โ€” for high-quality speech generation.
  • HuggingFace Transformers Integration: Load and run directly via AutoModel.from_pretrained with trust_remote_code=True โ€” no custom package installation required.

Benchmark Results

Raon-Speech is optimized for low-latency, real-time speech generation while maintaining strong performance across ASR, speech generation, spoken QA, audio understanding, and text QA tasks.

Measured with LibriSpeech test-clean samples on single-GPU setups via streaming TTS. All values are averaged.

Metric RTX 6000 Pro L40S
RTF 0.27 (3.7ร— real-time) 0.45 (2.2ร— real-time)
TTFT 617 ms 887 ms
TBT 135 ms 233 ms
  • RTF (Real-Time Factor): Lower is faster. Values below 1.0 mean faster-than-real-time synthesis.
  • TTFT (Time to First Token): Latency until the first audio chunk is returned.
  • TBT (Time Between Tokens): Average interval between consecutive audio chunks.

Requirements

pip install 'transformers>=4.57.1,<5.0' torch torchaudio soundfile accelerate

# Optional
pip install speechbrain # for TTS with speaker voice conditioning
pip install gradio # for Gradio demo

Quick Start

Option 1: Load from Hub (recommended)

No pip install raon needed.

from transformers import AutoConfig
from transformers.dynamic_module_utils import get_class_from_dynamic_module

MODEL_ID = "KRAFTON/Raon-Speech-9B"

config = AutoConfig.from_pretrained(MODEL_ID, trust_remote_code=True)
RaonPipeline = get_class_from_dynamic_module(
 "modeling_raon.RaonPipeline",
 MODEL_ID,
 revision=getattr(config, "_commit_hash", None),
)

pipe = RaonPipeline(MODEL_ID, device="cuda", dtype="bfloat16")

Option 2: With raon package installed

git clone https://github.com/krafton-ai/Raon-Speech.git
cd Raon-Speech/raon
pip install -e . # or: uv sync
from raon import RaonPipeline

# From Hub (local code + Hub weights)
pipe = RaonPipeline("KRAFTON/Raon-Speech-9B")

# From local path
pipe = RaonPipeline("/path/to/raon-model")

Tasks

STT (Audio โ†’ Text)

text = pipe.stt("audio.wav")

TTS (Text โ†’ Audio)

# Without speaker conditioning
audio, sr = pipe.tts("Hello, how are you?")
pipe.save_audio((audio, sr), "output.wav")

# With speaker conditioning (requires speechbrain)
audio, sr = pipe.tts("Hello, how are you?", speaker_audio="speaker_ref.wav")

TextQA (Text + Audio โ†’ Text)

answer = pipe.textqa("What is the speaker saying?", audio="audio.wav")

SpeechChat (Audio โ†’ Text)

answer = pipe.speech_chat("question.wav")

Chat (Multimodal)

messages = [
 {
 "role": "user",
 "content": [
 {"type": "audio", "audio": "audio.wav"},
 {"type": "text", "text": "Transcribe and summarise this audio."},
 ],
 },
]
response = pipe.chat(messages)

Deployment (vLLM-Omni)

1. Clone & Build

git clone https://github.com/krafton-ai/vllm-omni.git
cd vllm-omni
docker build -f docker/Dockerfile.ci -t vllm-omni .

2. Serve

docker run --rm --gpus all \
 --shm-size=16g \
 -p 8000:8000 \
 vllm-omni \
 bash -c "vllm serve KRAFTON/Raon-Speech-9B --omni --port 8000 --trust-remote-code"

3. Test โ€” TTS

curl -X POST http://localhost:8000/v1/audio/speech \
 -H "Content-Type: application/json" \
 -d '{
 "input": "Hello, how are you?",
 "model": "KRAFTON/Raon-Speech-9B",
 "response_format": "wav"
 }' --output output.wav

4. Test โ€” TTS with voice cloning

curl -X POST http://localhost:8000/v1/audio/speech \
 -H "Content-Type: application/json" \
 -d '{
 "input": "Hello, how are you?",
 "model": "KRAFTON/Raon-Speech-9B",
 "ref_audio": "data:audio/wav;base64,'$(base64 -w0 speaker_ref.wav)'",
 "task_type": "Base",
 "response_format": "wav"
 }' --output cloned.wav

5. Test โ€” STT

curl -X POST http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
 "model": "KRAFTON/Raon-Speech-9B",
 "messages": [
 {
 "role": "user",
 "content": [
 {"type": "audio_url", "audio_url": {"url": "data:audio/wav;base64,'"$(base64 -w0 audio.wav)"'"}},
 {"type": "text", "text": "Transcribe the audio into text."}
 ]
 }
 ]
 }'

Intended use

This checkpoint is suitable for:

  • bilingual English/Korean speech research,
  • speech QA and audio-understanding experiments,
  • TTS and speaker-conditioned TTS prototyping,
  • evaluation and serving work on open speech language models,
  • multimodal assistants that need both audio understanding and speech output.

License

This repository is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

Acknowledgement

The current release includes:

  • model weights,
  • Hugging Face custom code,
  • inference pipeline,
  • technical report,
  • demo links,
  • related GitHub repositories.

For exact architectural details, training hyperparameters, Korean benchmark construction, and the Raon-SpeechChat full-duplex extension, consult the technical report included in this repository.

Citation

@misc{raonspeech,
 title = {Raon-Speech Technical Report},
 author = {{KRAFTON}},
 month = {April},
 year = {2026}
}

ยฉ 2026 KRAFTON

Downloads last month
1,387
Safetensors
Model size
9B params
Tensor type
F32
ยท
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including KRAFTON/Raon-Speech-9B