VOOZH about

URL: https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B

⇱ naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B Β· Hugging Face


Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

πŸ‘ image

Overview

HyperCLOVA X SEED 8B Omni is a unified multimodal model that brings text, vision, and speech together, based on an auto-regressive Transformer architecture, enabling consistent multimodal understanding and generation. SEED 8B Omni aligns textual, visual, and audio representations in a shared semantic space and supports bidirectional interactions across modalities, including established text capabilities as well as vision–language QA, text-to-image generation and editing, speech recognition and translation, and text-to-speech, within a 32K context window. As an early pathfinding milestone of HyperCLOVA X toward Any-to-Any-Korean-First intelligence, SEED 8B Omni serves as a practical exploration of unified multimodal modeling and provides a reference point for future development and scaling.


Technical Report


Basic Information

  • Architecture : Transformer-based omni-model architecture (Dense Model)
  • Parameters : 8B
  • Input Format: Text/Image/Video/Audio(Speech)
  • Output Format: Text/Image/Audio(Speech)
  • Context Length : 32K
  • Knowledge Cutoff: May 2025

Benchmarks

πŸ‘ 테크나ᄏα…₯α†― 라포트 05_2@2x

  • Text-to-Text : MMLU-Pro, GSM8K, KMMLU-Pro, HAERAE 1.0
  • Vision-to-Text :SEED-IMG, AI2D, K-MMBench
  • Text-to-Vision: GenEval, ImgEdit
  • Audio-to-Text: Librispeech, Ksponspeech
  • Audio-to-Audio:Fleurs en2ko, Fleurs ko2en

Examples

Text-to-Image Generation

πŸ‘ hf_img01

Text-based Image Editing

πŸ‘ hf_img02
πŸ‘ hf_img03
πŸ‘ hf_img04


Inference

We provide OmniServe, a production-ready multimodal inference system with OpenAI-compatible API.

Capabilities

  • Inputs: Text, Image, Audio, Video
  • Outputs: Text, Image, Audio (no video generation)

Requirements

  • 4x NVIDIA A100 80GB
  • Docker & Docker Compose
  • NVIDIA Driver 525+, CUDA 12.1+
  • S3-compatible storage (for image/audio output)

Installation

# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe

# Install dependencies
pip install huggingface_hub safetensors torch openai easydict

# Download model (~16GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B \
 --local-dir ./models/HyperCLOVAX-SEED-Omni-8B

# Convert model to component format
python convert_model.py \
 --input ./models/HyperCLOVAX-SEED-Omni-8B \
 --output ./track_b \
 --track b

# Configure environment
cp .env.example .env
# Edit .env with model paths and S3 credentials

# Build and run (Track B only - OMNI model)
docker compose --profile track-b build
docker compose --profile track-b up -d

# Wait for model loading (~5 minutes)
docker compose logs -f omni

# Note: To run both VLM and OMNI models together:
# docker compose --profile track-a --profile track-b up -d

Basic Usage

from openai import OpenAI

client = OpenAI(
 base_url="http://localhost:8000/b/v1",
 api_key="not-needed"
)

# Image understanding
response = client.chat.completions.create(
 model="track_b_model",
 messages=[
 {
 "role": "user",
 "content": [
 {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
 {"type": "text", "text": "What is in this image?"}
 ]
 }
 ],
 max_tokens=256,
 extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)

More Examples

Architecture

 User Request
 (Image/Audio/Video/Text)
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ OmniServe β”‚
β”‚ POST /b/v1/chat/completions β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ [1] INPUT ENCODING β”‚ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ Vision Encoder β”‚ β”‚ Audio Encoder β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β”‚ embeddings β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ LLM (8B) │◀──── text β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ [2] OUTPUT DECODING β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β–Ό β–Ό β–Ό β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ Text β”‚ β”‚ Vision β”‚ β”‚ Audio β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ Decoder β”‚ β”‚ Decoder β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β–Ό β–Ό β”‚ β”‚
β”‚ β”‚ Image URL Audio URL β”‚ β”‚
β”‚ β”‚ (S3) (S3) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
 Response
 (Text / Image URL / Audio URL)

Hardware Requirements

Component GPU VRAM
Vision Encoder 1x ~8GB
Audio Encoder (shared) ~4GB
LLM (8B) 1x ~16GB
Vision Decoder 1x ~16GB
Audio Decoder (shared) ~4GB
Total 3x ~48GB

Key Parameters

Parameter Description Default
chat_template_kwargs.skip_reasoning Skip reasoning true
max_tokens Max output tokens -
temperature Sampling temperature 0.7
tools Required for image generation -

S3 Configuration

Required for image/audio generation:

NCP_S3_ENDPOINT=https://your-s3-endpoint.com
NCP_S3_ACCESS_KEY=your-access-key
NCP_S3_SECRET_KEY=your-secret-key
NCP_S3_BUCKET_NAME=your-bucket-name

For more details, see OmniServe documentation.


Citation

TBU (Technical Report)


Questions

For any other questions, please feel free to contact us at dl_hcxopensource@navercorp.com.


License

The model is licensed under HyperCLOVA X SEED 8B Omni Model License Agreement

Downloads last month
388
Safetensors
Model size
11B params
Tensor type
F32
Β·

Model tree for naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B

Quantizations
1 model

Collection including naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B