Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

Overview

HyperCLOVA X SEED 8B Omni is a unified multimodal model that brings text, vision, and speech together, based on an auto-regressive Transformer architecture, enabling consistent multimodal understanding and generation. SEED 8B Omni aligns textual, visual, and audio representations in a shared semantic space and supports bidirectional interactions across modalities, including established text capabilities as well as vision–language QA, text-to-image generation and editing, speech recognition and translation, and text-to-speech, within a 32K context window. As an early pathfinding milestone of HyperCLOVA X toward Any-to-Any-Korean-First intelligence, SEED 8B Omni serves as a practical exploration of unified multimodal modeling and provides a reference point for future development and scaling.

Technical Report

HyperCLOVAX-SEED-Omni-8B Tech Report (PDF)

Basic Information

Architecture : Transformer-based omni-model architecture (Dense Model)
Parameters : 8B
Input Format: Text/Image/Video/Audio(Speech)
Output Format: Text/Image/Audio(Speech)
Context Length : 32K
Knowledge Cutoff: May 2025

Benchmarks

👁 테크니컬 리포트 05_2@2x

Text-to-Text : MMLU-Pro, GSM8K, KMMLU-Pro, HAERAE 1.0
Vision-to-Text :SEED-IMG, AI2D, K-MMBench
Text-to-Vision: GenEval, ImgEdit
Audio-to-Text: Librispeech, Ksponspeech
Audio-to-Audio:Fleurs en2ko, Fleurs ko2en

Examples

Text-to-Image Generation

👁 hf_img01

Text-based Image Editing

👁 hf_img02
👁 hf_img03
👁 hf_img04

Inference

We provide OmniServe, a production-ready multimodal inference system with OpenAI-compatible API.

Capabilities

Inputs: Text, Image, Audio, Video
Outputs: Text, Image, Audio (no video generation)

Requirements

4x NVIDIA A100 80GB
Docker & Docker Compose
NVIDIA Driver 525+, CUDA 12.1+
S3-compatible storage (for image/audio output)

Installation

# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe

# Install dependencies
pip install huggingface_hub safetensors torch openai easydict

# Download model (~16GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B \
 --local-dir ./models/HyperCLOVAX-SEED-Omni-8B

# Convert model to component format
python convert_model.py \
 --input ./models/HyperCLOVAX-SEED-Omni-8B \
 --output ./track_b \
 --track b

# Configure environment
cp .env.example .env
# Edit .env with model paths and S3 credentials

# Build and run (Track B only - OMNI model)
docker compose --profile track-b build
docker compose --profile track-b up -d

# Wait for model loading (~5 minutes)
docker compose logs -f omni

# Note: To run both VLM and OMNI models together:
# docker compose --profile track-a --profile track-b up -d

Basic Usage

from openai import OpenAI

client = OpenAI(
 base_url="http://localhost:8000/b/v1",
 api_key="not-needed"
)

# Image understanding
response = client.chat.completions.create(
 model="track_b_model",
 messages=[
 {
 "role": "user",
 "content": [
 {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
 {"type": "text", "text": "What is in this image?"}
 ]
 }
 ],
 max_tokens=256,
 extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)

More Examples

Architecture

 User Request
 (Image/Audio/Video/Text)
 │
 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ OmniServe │
│ POST /b/v1/chat/completions │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ [1] INPUT ENCODING │ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ Vision Encoder │ │ Audio Encoder │ │ │
│ │ └────────┬────────┘ └────────┬────────┘ │ │
│ │ │ │ │ │
│ │ └────────────┬────────────────────┘ │ │
│ │ │ embeddings │ │
│ └──────────────────────────┼───────────────────────────────────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ LLM (8B) │◀──── text │
│ └──────┬───────┘ │
│ │ │
│ ┌─────────────────────────┼────────────────────────────────────────┐ │
│ │ [2] OUTPUT DECODING │ │
│ │ │ │ │
│ │ ┌──────────────┼──────────────┐ │ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Text │ │ Vision │ │ Audio │ │ │
│ │ │ │ │ Decoder │ │ Decoder │ │ │
│ │ └───────────┘ └─────┬─────┘ └─────┬─────┘ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ Image URL Audio URL │ │
│ │ (S3) (S3) │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
 │
 ▼
 Response
 (Text / Image URL / Audio URL)

Hardware Requirements

Component	GPU	VRAM
Vision Encoder	1x	~8GB
Audio Encoder	(shared)	~4GB
LLM (8B)	1x	~16GB
Vision Decoder	1x	~16GB
Audio Decoder	(shared)	~4GB
Total	3x	~48GB

Key Parameters

Parameter	Description	Default
`chat_template_kwargs.skip_reasoning`	Skip reasoning	`true`
`max_tokens`	Max output tokens	-
`temperature`	Sampling temperature	0.7
`tools`	Required for image generation	-

S3 Configuration

Required for image/audio generation:

NCP_S3_ENDPOINT=https://your-s3-endpoint.com
NCP_S3_ACCESS_KEY=your-access-key
NCP_S3_SECRET_KEY=your-secret-key
NCP_S3_BUCKET_NAME=your-bucket-name

For more details, see OmniServe documentation.

Citation

TBU (Technical Report)

Questions

For any other questions, please feel free to contact us at dl_hcxopensource@navercorp.com.

License

The model is licensed under HyperCLOVA X SEED 8B Omni Model License Agreement

Downloads last month: 388

Safetensors

Model size

11B params

Tensor type

F32

Model tree for naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B

Quantizations

1 model

Collection including naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B

HyperCLOVA X SEED is NAVER's lightweight open-source lineup with a strong focus on Korean language performance • 6 items • Updated Dec 24, 2025 • 42

URL: https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B

⇱ naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B · Hugging Face