Skip to content
You signed in with another tab or window. to refresh your session.
You signed out in another tab or window. to refresh your session.
You switched accounts on another tab or window. to refresh your session.
👁 GitHub Org's stars
What is OpenCompass ?
OpenCompass is a platform focused on understanding of the AGI, include Large Language Model and Multi-modality Model.
We aim to:
- develop high-quality libraries to reduce the difficulties in evaluation
- provide convincing leaderboards for improving the understanding of the large models
- create powerful toolchains targeting a variety of abilities and tasks
- build solid benchmarks to support the large model research
- research on inference of Large Model(analysis, reasoning, prompt engineering.)
Toolkit
OpenCompass
VLMEvalKit
Models
CompassVerifier
CompassJudger
Benchmarks and Methods
Pinned
Loading
-
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Python
6.8k
754
-
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Python
4k
667
-
Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
295
17
-
[EMNLP 2025] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
Jupyter Notebook
68
2
-
The All-in-one Judge Models introduced by Opencompass
119
6
-
Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical manner across multiple platforms, includi…
Python
103
6
Repositories
Showing 10 of 47 repositories
-
GenEditEvalKit
Public
The first unified, efficient, and extensible evaluation toolkit for evaluating image generation and editing models across multiple benchmarks.
Jupyter Notebook
39
MIT
4
0
0
Updated
-
-
-
-
opencompass
Public
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
-
VLMEvalKit
Public
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
-
TextEdit
Public
We provide TextEdit, a high-quality, multi-scenario text editing benchmark for generation models.
Python
19
MIT
0
0
0
Updated
-
GTA
Public
[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
Python
138
Apache-2.0
9
0
0
Updated
-
MiroFlow
Public
Forked from
MiroMindAI/MiroFlow
MiroMind Research Agent: Fully Open-Source Deep Research Agent with Reproducible State-of-the-Art Performance on FutureX, GAIA, HLE, BrowserComp and xBench.
Python
0
Apache-2.0
305
0
0
Updated
-
RePro
Public
[ICLR 2026] Rectifying LLM Thought From Lens of Optimization
Python
14
MIT
4
1
0
Updated
You can’t perform that action at this time.