AIOne-Agent-52B-A36B-it
A 52B / A36B sparse Mixture-of-Experts multimodal model for Korean reasoning, image understanding, and video understanding.
Model Description
AIOne-Agent-52B-A36B-it is a Korean-tuned multimodal Mixture-of-Experts (MoE) model based on Gemma 4 31B IT. The model retains the full text + image + video capabilities of the base Gemma 4 family and adds a Korean-domain MoE branch that activates the right experts for the input on the fly.
- Multimodal. Accepts text, images, and video; produces fluent Korean (and English) responses.
- Sparse MoE (
top_k=2of8experts) with always-on dense shared MLP. ~36 B parameters are active per token in the text backbone, while the full text backbone holds ~52 B parameters worth of capacity. - Long context. 256K tokens, inherited from the base model.
The name follows the Gemma 4 convention (google/gemma-4-26B-A4B-it): the first number is the text backbone parameter count, A{X}B is the per-token active parameter count, and the vision encoder (0.57 B) is reported separately.
Key Capabilities
- Korean reasoning and instruction following.
- Image understanding (caption, VQA, document understanding).
- Video understanding (frame-by-frame reasoning).
- Long-context document QA in Korean.
- Bilingual: Korean (primary) + English.
Quick Start
Transformers
import torch
from transformers import AutoProcessor, Gemma4ForConditionalGeneration
MODEL_ID = "JDONE-Research/AIOne-Agent-52B-A36B-it"
model = Gemma4ForConditionalGeneration.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image.jpg"},
{"type": "text", "text": "์ด ์ฌ์ง์ ๋ฌด์์ด ๋ณด์ด๋์? ํ๊ตญ์ด๋ก ๋ตํด์ฃผ์ธ์."},
],
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(
processor.tokenizer.decode(
generated[0, inputs.input_ids.shape[1]:], skip_special_tokens=True
)
)
Text-only
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "์ฌ๊ณผ 3๊ฐ์ ๋ฐฐ 5๊ฐ์ ๊ฐ๊ฒฉ์ด 12,000์์
๋๋ค. "
"์ฌ๊ณผ 1๊ฐ๊ฐ 1,500์์ด๋ผ๋ฉด ๋ฐฐ 1๊ฐ ๊ฐ๊ฒฉ์? ๋จ๊ณ์ ์ผ๋ก ํ์ดํด์ฃผ์ธ์.",
},
],
},
]
vLLM (recommended for serving)
vllm serve JDONE-Research/AIOne-Agent-52B-A36B-it \
--dtype bfloat16 \
--tensor-parallel-size 4 \
--max-model-len 32768
Sample Output
Korean math reasoning (text-only)
๋จ๊ณ๋ณ ํ์ด ๊ณผ์ ์ ๋ค์๊ณผ ๊ฐ์ต๋๋ค.
1๋จ๊ณ: ์ฌ๊ณผ 3๊ฐ์ ์ ์ฒด ๊ฐ๊ฒฉ ๊ตฌํ๊ธฐ ์ฌ๊ณผ 1๊ฐ์ ๊ฐ๊ฒฉ์ด 1,500์์ด๋ฏ๋ก, 3๊ฐ์ ๊ฐ๊ฒฉ์ ๊ณ์ฐํฉ๋๋ค.
- 1,500์ ร 3๊ฐ = 4,500์
2๋จ๊ณ: ๋ฐฐ 5๊ฐ์ ์ ์ฒด ๊ฐ๊ฒฉ ๊ตฌํ๊ธฐ ์ ์ฒด ๊ธ์ก(12,000์)์์ ์ฌ๊ณผ 3๊ฐ์ ๊ฐ๊ฒฉ(4,500์)์ ๋นผ๋ฉด ๋ฐฐ 5๊ฐ์ ์ ์ฒด ๊ฐ๊ฒฉ์ด ๋์ต๋๋ค.
Multimodal (image + Korean caption)
๋ค์ํ ์์์ ์ ๋ค์ด ์์ฌ ๋ฌด์ง๊ฐ ๋น๊น์ ๊ทธ๋ผ๋ฐ์ด์ ์ ์ด๋ฃจ๊ณ ์๋ ์ด๋ฏธ์ง์ ๋๋ค.
Model Specs
| Field | Value |
|---|---|
| Architecture | Gemma4ForConditionalGeneration |
| Base model | google/gemma-4-31B-it |
| Text backbone parameters | 51.51 B โ 52 B (in name) |
| Active parameters per token (text) | 35.90 B โ A36B (in name) (dense MLP always on + top-2 of 8 experts + attention) |
| Vision tower | 0.57 B (SigLIP-style, 27 layers) |
| MM projector | 0.01 B |
| Total weights on disk | 52.09 B / ~104 GB (BF16) |
| MoE config | num_experts=8, top_k=2, moe_intermediate_size=2688 |
| Modality | Text + Image + Video โ Text |
| Precision | bfloat16 |
| Context length | 256K |
| Languages | Korean (primary), English |
Intended Use
- Korean enterprise agent backend (long-context tool use, RAG, multi-turn reasoning).
- Image and video understanding with Korean output.
- Document QA in Korean.
Out-of-Scope Use
- Sole-source decision-making with legal consequences.
- Automated use of force or coercive control based purely on this model's output.
- Any media analysis that infringes on personal privacy, image rights, or applicable data-protection laws.
License
This model is released under the Apache License 2.0 license.
- Commercial use, redistribution, and modification are permitted with attribution.
- Provided "as is" without warranties or conditions of any kind.
Citation
@misc{aione_agent_52b_a36b_it,
title = {AIOne-Agent-52B-A36B-it: A Korean Sparse-MoE Multimodal Model},
author = {JDONE Research},
year = {2026},
howpublished = {\url{https://huggingface.co/JDONE-Research/AIOne-Agent-52B-A36B-it}}
}
- Downloads last month
- 490
