Voozh

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model Details

Model Description

This repository provides Asagi-14B, a large-scale Japanese Vision & Language Model (VLM). Asagi-14B has been trained on an extensive Japanese dataset, incorporating a diverse range of data sources.

A significant portion of the training data is synthesized using models such as the Japanese large language model (CALM3-22B-Chat) and the English Vision & Language Model (Phi3.5-vision-instruct).

Importantly, we do not use LLMs that restrict the usage of their outputs in the license terms (e.g., GPT-4) to synthesize the training data.

Model components	Model / Architecture	Parameters
Vision encoder	siglip-so400m-patch14-384	428M
Projector	2-layer MLP	64M
LLM	llm-jp-3-13b-instruct	13B

Usage

Requirements

transformers==4.45.1
accelerate==0.34.2
torch==2.4.0
torchvision==0.19.0

How to use

import requests
import torch
import transformers
from PIL import Image
from transformers import AutoModel, AutoProcessor, GenerationConfig

transformers.set_seed(42)
model_path = "MIL-UT/Asagi-14B"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModel.from_pretrained(
 model_path, trust_remote_code=True,
 torch_dtype=torch.bfloat16,
 device_map="auto"
)

generation_config = GenerationConfig(
 do_sample=True,
 num_beams=5,
 max_new_tokens=256,
 temperature=0.7,
 repetition_penalty=1.5
)

prompt = ("以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n"
 "### 指示:\n<image>\nこの画像を見て、次の質問に詳細かつ具体的に答えてください。この写真はどこで撮影されたものか教えてください。また、画像の内容についても詳しく説明してください。\n\n### 応答:\n")

# sample image
sample_image_url = "https://raw.githubusercontent.com/uehara-mech/uehara-mech.github.io/refs/heads/master/images/shibuya.jpg"
image = Image.open(requests.get(sample_image_url, stream=True).raw)

inputs = processor(
 text=prompt, images=image, return_tensors="pt"
)
inputs_text = processor.tokenizer(prompt, return_tensors="pt")
inputs['input_ids'] = inputs_text['input_ids']
inputs['attention_mask'] = inputs_text['attention_mask']
for k, v in inputs.items():
 if v.dtype == torch.float32:
 inputs[k] = v.to(model.dtype)
inputs = {k: inputs[k].to(model.device) for k in inputs if k != "token_type_ids"}

generate_ids = model.generate(
 **inputs,
 generation_config=generation_config
)
generated_text = processor.batch_decode(
 generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

# do not print the prompt
if "<image>" in prompt:
 prompt = prompt.replace("<image>", " ")
generated_text = generated_text.replace(prompt, "")

print(f"Generated text: {generated_text}")

# >>> Generated text: この写真は渋谷スクランブル交差点の夜景を撮影したものです。渋谷スクランブル交差点は、東京都渋谷区に位置する有名な交差点で、多くの人々が行き交う場所として知られています。写真には、高層ビル群やネオンサイン、街灯などが写っており、都会の活気とエネルギーを感じることができます。

Example

👁 Image

Prompt	Output
この画像を見て、次の質問に詳細かつ具体的に答えてください。この写真はどこで撮影されたものか教えてください。また、画像の内容についても詳しく説明してください。	この写真は渋谷スクランブル交差点の夜景を撮影したものです。渋谷スクランブル交差点は、東京都渋谷区に位置する有名な交差点で、多くの人々が行き交う場所として知られています。写真には、高層ビル群やネオンサイン、街灯などが写っており、都会の活気とエネルギーを感じることができます。

Prompt

Output

この画像を見て、次の質問に詳細かつ具体的に答えてください。この写真はどこで撮影されたものか教えてください。また、画像の内容についても詳しく説明してください。

この写真は渋谷スクランブル交差点の夜景を撮影したものです。渋谷スクランブル交差点は、東京都渋谷区に位置する有名な交差点で、多くの人々が行き交う場所として知られています。写真には、高層ビル群やネオンサイン、街灯などが写っており、都会の活気とエネルギーを感じることができます。

More Examples

Training Details

Training Data

Dataset	Preprocess	Stage1	Stage2	Size
ROIS (Ours)	Synthesized	✓	✓	8.4M
Japanese image text pairs	Synthesized	✓	✓	4.4M
Wikipedia	Synthesized	✓	✓	2.5M
Open Images	Translated	✓	✓	680K
DCI	Translated	✓	✓	7K
CommonCatalog CC-BY	Translated	✓	✓	3.5M
LLaVA-Pretrain-JA	✓	✓	550K
STAIR Captions	✓	✓	410K
Flickr-JP	✓	✓	160K
YJ Captions	✓	✓	130K
Japanese Pascal	✓	✓	5K
ArtBench	Synthesized	✓	100K
GQA	Translated	✓	1.9M
VQA v2	Translated	✓	880K
A-OKVQA	Translated	✓	34K
OK-VQA	Translated	✓	18K
Japanese Visual Genome	Translated	✓	1.6M
PangeaInstruct	✓	93K

Note: ROIS (Ours) is a newly collected dataset crawled from the web specifically for this project. The dataset consists of image and raw text pairs, which are used to synthesize the training data.

Evaluation

We evaluated our model using Heron-Bench, JA-VLM-Bench-in-the-Wild, and JA-VG-VQA-500. We used eval-mm library for this evaluation.

Here, models with "†" are not trained with GPT-generated data. Bold numbers indicate the best performance among all models, and underlined numbers indicate the best performance among models not trained with GPT-generated data.

Model	LM Size	Heron-Bench (LLM (%))	JA-VLM-Bench-In-the-Wild (ROUGE-L)	JA-VLM-Bench-In-the-Wild (LLM (/5.0))	JA-VG-VQA-500 (ROUGE-L)	JA-VG-VQA-500 (LLM (/5.0))
Japanese InstructBLIP Alpha†	7B	14.0	20.8	2.42	-	-
Japanese Stable VLM†	7B	24.2	23.3	2.47	-	-
LLaVA-CALM2-SigLIP†	7B	43.3	47.2	3.15	17.4	3.21
Llama-3-EvoVLM-JP-v2	8B	39.3	41.4	2.92	23.5	2.96
VILA-jp	13B	57.2	52.3	3.69	16.2	3.62
Asagi-2B†	1.8B	44.7	48.8	3.26	53.7	3.69
Asagi-4B†	3.7B	49.3	49.6	3.38	55.6	3.78
Asagi-8B†	7.2B	54.7	49.4	56.43
Asagi-14B†	13B	3.44
GPT-4o	-	87.6	37.6	3.85	12.1	3.58

Risks and Limitations

The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.

Model Card Authors

Kohei Uehara

Downloads last month: 24

Safetensors

Model size

14B params

Tensor type

BF16

Model tree for MIL-UT/Asagi-14B

Base model

llm-jp/llm-jp-3-13b-instruct

Finetuned

(7)

this model

Space using MIL-UT/Asagi-14B 1

Collection including MIL-UT/Asagi-14B

Asagi is a Japanese Vision & Language model, trained on a large-scale synthetic dataset. • 4 items • Updated Nov 27, 2025 • 7

URL: https://huggingface.co/MIL-UT/Asagi-14B

⇱ MIL-UT/Asagi-14B · Hugging Face