Salience 1 โ 9B
๐ VectionLabs Maestro 1 Banner
A 9B multimodal reasoning model, sharpened for code and agentic work โ that can see.
Vection Labs
Weights ยท Benchmarks ยท Quickstart ยท Fast inference ยท Limitations
Abstract
Salience 1 (9B) is a dense, 9-billion-parameter vision-language model built for hard, practical work: writing and debugging real code, driving tools and agents, multi-step mathematical reasoning, and visual understanding over images and video โ inside a single model with a context window of up to 1M tokens.
It is the successor of Maestro1-9B, engineered around a single goal: push the axis users ask for most โ code and agentic/tool use โ without giving up the deep reasoning, vision, and million-token context the family is known for.
It is designed for people who care less about chat pleasantries and more about whether the model can do the thing: ship the function, find the bug, call the right tool, read the diagram, finish the proof.
Highlights
- Code & agentic first. Built with a coding/DevOps donor on top of a reasoning core; tuned to produce runnable code and well-formed tool calls.
- Reasoning that shows its work. Structured, inspectable chains of thought for math, logic, code.
- Genuinely multimodal. Images and video are first-class inputs, not bolted-on captioning.
- Long context. Up to 1M tokens via interleaved multimodal RoPE โ whole repos, long papers, or long videos in a single prompt.
- Fast on modest hardware. Runs on 2ร T4 with no GGUF (fp16 sharded, or 4-bit on a single T4), with lossless speculative decoding and hybrid-thinking latency control.
- Open weights. Apache-2.0,
transformers-native, single-file deployment.
Model overview
| Parameters | 9B (dense) |
| Modalities | text, image, video โ text |
| Context window | up to 1,000,000 tokens (interleaved multimodal RoPE) |
| Precision | bfloat16 master weights |
| Architecture | Qwen3-VL (Qwen3-8B language model, 36 layers) + native vision encoder |
| License | Apache-2.0 |
| Library | ๐ค transformers (AutoModelForImageTextToText) |
Architecture & capabilities
Salience 1 is a dense Qwen3-VL model: a 36-layer Qwen3-8B language model coupled to a native vision encoder, with interleaved multimodal RoPE carrying the context window from 256K up to 1M tokens.
Its capability profile is built around three pillars:
- Code & agentic execution โ runnable code, repo-scale edits, and well-formed tool calls.
- Deep reasoning โ structured, inspectable chains of thought for math and logic.
- Multimodal perception โ images and video as first-class inputs, not bolted-on captioning.
The vision pathway and long-context behavior are preserved end to end, so the same reasoning that solves an olympiad problem also reads a chart, a UI screenshot, or a short clip.
Intended use
Salience 1 targets technical assistance, coding agents, and research:
- Code generation, explanation, debugging, review, and repo-scale tasks.
- Agentic / tool-using workflows that emit structured calls.
- Step-by-step math and quantitative reasoning.
- Visual question answering and document/diagram/chart understanding.
- Video understanding over short clips, and long-document / long-context analysis.
It is not intended for high-stakes decisions without human review, nor as a source of truth for medical, legal, or financial advice.
Benchmarks
All results use a single reproducible evaluation harness with greedy/CoT settings; the Maestro1-9B column is run under the identical protocol for a like-for-like comparison.
Reasoning, math & code
| Benchmark | Setting | Maestro1-9B | Salience-1-9B |
|---|---|---|---|
| GSM8K | 0-shot CoT, exact match | โ | โ |
| MATH-500 | 0-shot CoT, exact match | โ | โ |
| HumanEval | 0-shot, pass@1 | โ | โ |
| MBPP | 3-shot, pass@1 | โ | โ |
| MMLU | 0-shot | โ | โ |
Multimodal
| Benchmark | Setting | Maestro1-9B | Salience-1-9B |
|---|---|---|---|
| MMMU (val) | 0-shot | โ | โ |
| MathVista (testmini) | 0-shot | โ | โ |
| DocVQA (val) | 0-shot, ANLS | โ | โ |
The evaluation protocol, prompts, and answer-extraction logic are fixed and reproducible end-to-end.
Quickstart
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
model_id = "vectionlabs/Salience-1-9B"
proc = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, dtype=torch.bfloat16, device_map="auto"
)
messages = [{
"role": "user",
"content": [
{"type": "image", "image": "https://example.com/diagram.png"},
{"type": "text", "text": "Explain what this diagram proves, step by step."},
],
}]
text = proc.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
imgs, vids = process_vision_info(messages)
inputs = proc(text=[text], images=imgs, videos=vids, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(proc.batch_decode(out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])
Text-only works the same way with a plain {"type": "text", ...} message.
Speed & efficiency
Salience 1 is built to be fast in production, not just accurate:
- Speculative decoding delivers a 1.5โ2.5ร speedup on code and structured text with
no change to outputs โ a lightweight draft proposes tokens and the model verifies them in a
single pass. Supported natively in
transformers(assistant_model=) and in vLLM (--speculative-model). - Adaptive thinking. Append
/no_thinkfor instant direct answers, or/thinkto unlock deep step-by-step reasoning on hard math and multi-step agentic planning โ you spend latency only when the task is worth it. - Runs on consumer hardware. 4-bit quantization brings the full model onto a single consumer GPU; bf16/fp16 serves comfortably on one modern accelerator with room for long context.
Prompting tips
- Code: specify language, constraints ("no external libraries"), and the exact I/O contract.
- Agentic / tools: give the tool schema and ask for the call as strict JSON.
- Math/logic: ask it to reason step by step; it is tuned to externalize its work.
- Vision: put the image/video before the question in the message content.
- Sampling (Qwen3 family): thinking โ
temperature=0.6, top_p=0.95, top_k=20; direct answers โtemperature=0.7, top_p=0.8, top_k=20.
Deployment
- Single-GPU: loads in bf16/fp16 with
device_map="auto"on one modern accelerator; 4-bit quantization fits the model on a single consumer GPU. - Serving: integrates with standard
transformersgeneration and vision-capable serving stacks such as vLLM (with optional speculative decoding) for high-throughput production use. - Quantized formats: GGUF and other community quantizations are supported.
Limitations & responsible use
- Salience 1 can be confidently wrong. Verify mathematical and factual claims.
- Generated code may be insecure or incorrect โ review before running, never execute untrusted output.
- Long-context and long-video inputs increase latency and memory substantially.
- It inherits the licenses, biases, and failure modes of all source models. Do not use it for surveillance, manipulation, or any use that violates applicable law or the Apache-2.0 terms.
- No audio modality.
Citation
@misc{vectionlabs2026salience1,
title = {Salience 1 (9B): A Multimodal Reasoning and Coding Model},
author = {Vection Labs},
year = {2026},
url = {https://huggingface.co/vectionlabs/Salience-1-9B}
}
- Downloads last month
- 60,059
