VOOZH about

URL: https://huggingface.co/JDONE-Research/AIOne-Agent-52B-A36B-it

โ‡ฑ JDONE-Research/AIOne-Agent-52B-A36B-it ยท Hugging Face


๐Ÿ‘ AIOne-Agent

AIOne-Agent-52B-A36B-it

A 52B / A36B sparse Mixture-of-Experts multimodal model for Korean reasoning, image understanding, and video understanding.


Model Description

AIOne-Agent-52B-A36B-it is a Korean-tuned multimodal Mixture-of-Experts (MoE) model based on Gemma 4 31B IT. The model retains the full text + image + video capabilities of the base Gemma 4 family and adds a Korean-domain MoE branch that activates the right experts for the input on the fly.

  • Multimodal. Accepts text, images, and video; produces fluent Korean (and English) responses.
  • Sparse MoE (top_k=2 of 8 experts) with always-on dense shared MLP. ~36 B parameters are active per token in the text backbone, while the full text backbone holds ~52 B parameters worth of capacity.
  • Long context. 256K tokens, inherited from the base model.

The name follows the Gemma 4 convention (google/gemma-4-26B-A4B-it): the first number is the text backbone parameter count, A{X}B is the per-token active parameter count, and the vision encoder (0.57 B) is reported separately.


Key Capabilities

  • Korean reasoning and instruction following.
  • Image understanding (caption, VQA, document understanding).
  • Video understanding (frame-by-frame reasoning).
  • Long-context document QA in Korean.
  • Bilingual: Korean (primary) + English.

Quick Start

Transformers

import torch
from transformers import AutoProcessor, Gemma4ForConditionalGeneration

MODEL_ID = "JDONE-Research/AIOne-Agent-52B-A36B-it"

model = Gemma4ForConditionalGeneration.from_pretrained(
 MODEL_ID,
 torch_dtype=torch.bfloat16,
 device_map="auto",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

messages = [
 {
 "role": "user",
 "content": [
 {"type": "image", "image": "file:///path/to/image.jpg"},
 {"type": "text", "text": "์ด ์‚ฌ์ง„์— ๋ฌด์—‡์ด ๋ณด์ด๋‚˜์š”? ํ•œ๊ตญ์–ด๋กœ ๋‹ตํ•ด์ฃผ์„ธ์š”."},
 ],
 },
]

inputs = processor.apply_chat_template(
 messages,
 add_generation_prompt=True,
 tokenize=True,
 return_tensors="pt",
 return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(
 processor.tokenizer.decode(
 generated[0, inputs.input_ids.shape[1]:], skip_special_tokens=True
 )
)

Text-only

messages = [
 {
 "role": "user",
 "content": [
 {
 "type": "text",
 "text": "์‚ฌ๊ณผ 3๊ฐœ์™€ ๋ฐฐ 5๊ฐœ์˜ ๊ฐ€๊ฒฉ์ด 12,000์›์ž…๋‹ˆ๋‹ค. "
 "์‚ฌ๊ณผ 1๊ฐœ๊ฐ€ 1,500์›์ด๋ผ๋ฉด ๋ฐฐ 1๊ฐœ ๊ฐ€๊ฒฉ์€? ๋‹จ๊ณ„์ ์œผ๋กœ ํ’€์ดํ•ด์ฃผ์„ธ์š”.",
 },
 ],
 },
]

vLLM (recommended for serving)

vllm serve JDONE-Research/AIOne-Agent-52B-A36B-it \
 --dtype bfloat16 \
 --tensor-parallel-size 4 \
 --max-model-len 32768

Sample Output

Korean math reasoning (text-only)

๋‹จ๊ณ„๋ณ„ ํ’€์ด ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

1๋‹จ๊ณ„: ์‚ฌ๊ณผ 3๊ฐœ์˜ ์ „์ฒด ๊ฐ€๊ฒฉ ๊ตฌํ•˜๊ธฐ ์‚ฌ๊ณผ 1๊ฐœ์˜ ๊ฐ€๊ฒฉ์ด 1,500์›์ด๋ฏ€๋กœ, 3๊ฐœ์˜ ๊ฐ€๊ฒฉ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

  • 1,500์› ร— 3๊ฐœ = 4,500์›

2๋‹จ๊ณ„: ๋ฐฐ 5๊ฐœ์˜ ์ „์ฒด ๊ฐ€๊ฒฉ ๊ตฌํ•˜๊ธฐ ์ „์ฒด ๊ธˆ์•ก(12,000์›)์—์„œ ์‚ฌ๊ณผ 3๊ฐœ์˜ ๊ฐ€๊ฒฉ(4,500์›)์„ ๋นผ๋ฉด ๋ฐฐ 5๊ฐœ์˜ ์ „์ฒด ๊ฐ€๊ฒฉ์ด ๋‚˜์˜ต๋‹ˆ๋‹ค.

Multimodal (image + Korean caption)

๋‹ค์–‘ํ•œ ์ƒ‰์ƒ์˜ ์ ๋“ค์ด ์„ž์—ฌ ๋ฌด์ง€๊ฐœ ๋น›๊น”์˜ ๊ทธ๋ผ๋ฐ์ด์…˜์„ ์ด๋ฃจ๊ณ  ์žˆ๋Š” ์ด๋ฏธ์ง€์ž…๋‹ˆ๋‹ค.


Model Specs

Field Value
Architecture Gemma4ForConditionalGeneration
Base model google/gemma-4-31B-it
Text backbone parameters 51.51 B โ†’ 52 B (in name)
Active parameters per token (text) 35.90 B โ†’ A36B (in name) (dense MLP always on + top-2 of 8 experts + attention)
Vision tower 0.57 B (SigLIP-style, 27 layers)
MM projector 0.01 B
Total weights on disk 52.09 B / ~104 GB (BF16)
MoE config num_experts=8, top_k=2, moe_intermediate_size=2688
Modality Text + Image + Video โ†’ Text
Precision bfloat16
Context length 256K
Languages Korean (primary), English

Intended Use

  • Korean enterprise agent backend (long-context tool use, RAG, multi-turn reasoning).
  • Image and video understanding with Korean output.
  • Document QA in Korean.

Out-of-Scope Use

  • Sole-source decision-making with legal consequences.
  • Automated use of force or coercive control based purely on this model's output.
  • Any media analysis that infringes on personal privacy, image rights, or applicable data-protection laws.

License

This model is released under the Apache License 2.0 license.

  • Commercial use, redistribution, and modification are permitted with attribution.
  • Provided "as is" without warranties or conditions of any kind.

Citation

@misc{aione_agent_52b_a36b_it,
 title = {AIOne-Agent-52B-A36B-it: A Korean Sparse-MoE Multimodal Model},
 author = {JDONE Research},
 year = {2026},
 howpublished = {\url{https://huggingface.co/JDONE-Research/AIOne-Agent-52B-A36B-it}}
}
Downloads last month
490
Safetensors
Model size
52B params
Tensor type
BF16
ยท

Model tree for JDONE-Research/AIOne-Agent-52B-A36B-it

Finetuned
(191)
this model
Quantizations
1 model