Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)
Overview
HyperCLOVA X SEED 8B Omni is a unified multimodal model that brings text, vision, and speech together, based on an auto-regressive Transformer architecture, enabling consistent multimodal understanding and generation. SEED 8B Omni aligns textual, visual, and audio representations in a shared semantic space and supports bidirectional interactions across modalities, including established text capabilities as well as visionβlanguage QA, text-to-image generation and editing, speech recognition and translation, and text-to-speech, within a 32K context window. As an early pathfinding milestone of HyperCLOVA X toward Any-to-Any-Korean-First intelligence, SEED 8B Omni serves as a practical exploration of unified multimodal modeling and provides a reference point for future development and scaling.
Technical Report
Basic Information
- Architecture : Transformer-based omni-model architecture (Dense Model)
- Parameters : 8B
- Input Format: Text/Image/Video/Audio(Speech)
- Output Format: Text/Image/Audio(Speech)
- Context Length : 32K
- Knowledge Cutoff: May 2025
Benchmarks
π αα
¦αα
³αα
΅αα
₯α― α
α
΅αα
©αα
³ 05_2@2x
- Text-to-Text : MMLU-Pro, GSM8K, KMMLU-Pro, HAERAE 1.0
- Vision-to-Text :SEED-IMG, AI2D, K-MMBench
- Text-to-Vision: GenEval, ImgEdit
- Audio-to-Text: Librispeech, Ksponspeech
- Audio-to-Audio:Fleurs en2ko, Fleurs ko2en
Examples
Text-to-Image Generation
Text-based Image Editing
π hf_img02
π hf_img03
π hf_img04
Inference
We provide OmniServe, a production-ready multimodal inference system with OpenAI-compatible API.
Capabilities
- Inputs: Text, Image, Audio, Video
- Outputs: Text, Image, Audio (no video generation)
Requirements
- 4x NVIDIA A100 80GB
- Docker & Docker Compose
- NVIDIA Driver 525+, CUDA 12.1+
- S3-compatible storage (for image/audio output)
Installation
# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe
# Install dependencies
pip install huggingface_hub safetensors torch openai easydict
# Download model (~16GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B \
--local-dir ./models/HyperCLOVAX-SEED-Omni-8B
# Convert model to component format
python convert_model.py \
--input ./models/HyperCLOVAX-SEED-Omni-8B \
--output ./track_b \
--track b
# Configure environment
cp .env.example .env
# Edit .env with model paths and S3 credentials
# Build and run (Track B only - OMNI model)
docker compose --profile track-b build
docker compose --profile track-b up -d
# Wait for model loading (~5 minutes)
docker compose logs -f omni
# Note: To run both VLM and OMNI models together:
# docker compose --profile track-a --profile track-b up -d
Basic Usage
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/b/v1",
api_key="not-needed"
)
# Image understanding
response = client.chat.completions.create(
model="track_b_model",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "What is in this image?"}
]
}
],
max_tokens=256,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
print(response.choices[0].message.content)
More Examples
Architecture
User Request
(Image/Audio/Video/Text)
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OmniServe β
β POST /b/v1/chat/completions β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β [1] INPUT ENCODING β β
β β β β
β β βββββββββββββββββββ βββββββββββββββββββ β β
β β β Vision Encoder β β Audio Encoder β β β
β β ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ β β
β β β β β β
β β ββββββββββββββ¬βββββββββββββββββββββ β β
β β β embeddings β β
β ββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββ β
β β LLM (8B) ββββββ text β
β ββββββββ¬ββββββββ β
β β β
β βββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ β
β β [2] OUTPUT DECODING β β
β β β β β
β β ββββββββββββββββΌβββββββββββββββ β β
β β βΌ βΌ βΌ β β
β β βββββββββββββ βββββββββββββ βββββββββββββ β β
β β β Text β β Vision β β Audio β β β
β β β β β Decoder β β Decoder β β β
β β βββββββββββββ βββββββ¬ββββββ βββββββ¬ββββββ β β
β β β β β β
β β βΌ βΌ β β
β β Image URL Audio URL β β
β β (S3) (S3) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
Response
(Text / Image URL / Audio URL)
Hardware Requirements
| Component | GPU | VRAM |
|---|---|---|
| Vision Encoder | 1x | ~8GB |
| Audio Encoder | (shared) | ~4GB |
| LLM (8B) | 1x | ~16GB |
| Vision Decoder | 1x | ~16GB |
| Audio Decoder | (shared) | ~4GB |
| Total | 3x | ~48GB |
Key Parameters
| Parameter | Description | Default |
|---|---|---|
chat_template_kwargs.skip_reasoning |
Skip reasoning | true |
max_tokens |
Max output tokens | - |
temperature |
Sampling temperature | 0.7 |
tools |
Required for image generation | - |
S3 Configuration
Required for image/audio generation:
NCP_S3_ENDPOINT=https://your-s3-endpoint.com
NCP_S3_ACCESS_KEY=your-access-key
NCP_S3_SECRET_KEY=your-secret-key
NCP_S3_BUCKET_NAME=your-bucket-name
For more details, see OmniServe documentation.
Citation
TBU (Technical Report)
Questions
For any other questions, please feel free to contact us at dl_hcxopensource@navercorp.com.
License
The model is licensed under HyperCLOVA X SEED 8B Omni Model License Agreement
- Downloads last month
- 388
