Voozh

OpenCompass Website ^HOT OpenCompass Toolkit ^{TRY IT OUT}

What is OpenCompass ? OpenCompass is a platform focused on understanding of the AGI, include Large Language Model and Multi-modality Model.

We aim to:

develop high-quality libraries to reduce the difficulties in evaluation
provide convincing leaderboards for improving the understanding of the large models
create powerful toolchains targeting a variety of abilities and tasks
build solid benchmarks to support the large model research
research on inference of Large Model(analysis, reasoning, prompt engineering.)

Toolkit

OpenCompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (LLaMA, LLaMa2, ChatGLM2, ChatGPT, Claude, etc) over 80+ datasets.
https://github.com/open-compass/opencompass

VLMEvalKit

VLMEvalKit is a toolkit for evaluating large vision-language models (LVLMs), currently supporting ~20 LVLMs and five multi-modal benchmarks.
https://github.com/open-compass/vlmevalkit

CompassVerifier

CompassVerifier is an accurate and robust lightweight verifier model for evaluation and outcome reward.
https://github.com/open-compass/CompassVerifier

CompassJudger

opencompass Public

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Python 6.8k 754
VLMEvalKit Public

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

Python 4k 667
MMBench Public

Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"

295 17
CompassVerifier Public

[EMNLP 2025] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

Jupyter Notebook 68 2
CompassJudger Public

The All-in-one Judge Models introduced by Opencompass

119 6
MMBench-GUI Public

Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical manner across multiple platforms, includi…

Python 103 6

Showing 10 of 47 repositories

GenEditEvalKit Public
The first unified, efficient, and extensible evaluation toolkit for evaluating image generation and editing models across multiple benchmarks.

Jupyter Notebook 39 MIT 4 0 0 Updated
pinchbench_server Public

Python 0 0 0 0 Updated
SearchAgentService Public

Python 0 1 0 0 Updated
SWE-bench-server Public

Python 1 0 0 0 Updated
opencompass Public
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Python 6,830 Apache-2.0 754 372 (1 issue needs help) 71 Updated
VLMEvalKit Public
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

Python 3,995 Apache-2.0 667 203 28 Updated
TextEdit Public
We provide TextEdit, a high-quality, multi-scenario text editing benchmark for generation models.

Python 19 MIT 0 0 0 Updated
GTA Public
[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents

Python 138 Apache-2.0 9 0 0 Updated
MiroFlow Public Forked from MiroMindAI/MiroFlow
MiroMind Research Agent: Fully Open-Source Deep Research Agent with Reproducible State-of-the-Art Performance on FutureX, GAIA, HLE, BrowserComp and xBench.

Python 0 Apache-2.0 305 0 0 Updated
RePro Public
[ICLR 2026] Rectifying LLM Thought From Lens of Optimization

Python 14 MIT 4 1 0 Updated