LLM Evaluation Framework

👁 Image
👁 Image
👁 Image
👁 Image
👁 Image
👁 Image

Production-grade open-source LLM benchmarking. Evaluate GPT-4, Claude, Gemini, Mistral and Llama on 5 metrics — side by side — in one command.

What This Is

This is the model card / hub page for the LLM Evaluation Framework. The framework itself is a Python tool, not a neural network weight — this page serves as the HuggingFace hub entry point linking all resources together.

Resource	Link
GitHub	https://github.com/vignesh2027/LLM-Evaluation-Framework
Live Demo	https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
Dataset	https://huggingface.co/datasets/vigneshwar234/llm-eval-benchmark
Docs	https://vignesh2027.github.io/LLM-Evaluation-Framework/

Quick Start

pip install llm-evaluation-framework
export OPENAI_API_KEY="sk-..."
llm-eval run --model gpt-4o-mini --benchmark mmlu --samples 100

Output:

╭──────────────────────────────────────╮
│ Evaluation: gpt-4o-mini │
├──────────────────┬───────────────────┤
│ Accuracy │ 78.00% │
│ Avg Latency │ 432 ms │
│ P95 Latency │ 1240 ms │
│ Total Cost │ $0.0023 │
│ Hallucination │ 2.40% │
│ Reasoning Score │ 7.2 / 10 │
╰──────────────────┴───────────────────╯

5 Evaluation Metrics

Metric	Description	Output
Accuracy	4-strategy cascade: exact → normalized → MC → fuzzy	0.0–1.0
Latency	p50, p75, p90, p95, p99 percentiles + SLA violation rate	ms
Cost	Real token counts × pricing table for 15+ models	$/1K tokens
Hallucination Rate	Linguistic signal analysis (v1), NLI planned (v2)	0.0–1.0
Reasoning Quality	Chain-of-thought depth scoring	1–10

Supported Models

Provider	Models
OpenAI	GPT-4o, GPT-4o-mini, o1, o1-mini, GPT-3.5-turbo
Anthropic	Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus
Google	Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 2.0 Flash
Mistral	Mistral Large, Mistral Small
Meta	Llama 3 70B, Llama 3 8B (via Together AI)
Local	Ollama, vLLM, HuggingFace TGI

Sample Benchmark Results (MMLU, 100 samples)

Model	Accuracy	Latency	Cost/1K	Hallucination	Reasoning
GPT-4o	88.2%	892ms	$0.0080	1.8%	8.4/10
Claude 3.5 Sonnet	87.6%	1240ms	$0.0090	2.1%	8.6/10
GPT-4o-mini	78.4%	432ms	$0.0003	3.2%	7.2/10
Gemini 1.5 Flash	76.8%	380ms	$0.0001	4.1%	6.8/10
Claude 3 Haiku	74.2%	410ms	$0.0010	4.8%	6.5/10

Key finding: GPT-4o-mini achieves 88% of GPT-4o's accuracy at 4% of the cost.

Features

Async parallel evaluation — 10 models at once via asyncio.Semaphore
Streamlit dashboard — radar charts, latency histograms, cost vs quality scatter
FastAPI REST API — 12 endpoints with OpenAPI docs
CLI tool — 7 subcommands with rich terminal output
PDF report generator — professional layout via ReportLab
SQLite persistence — zero-config, file-based storage
Docker ready — multi-stage build, docker-compose up
40+ tests, 95% coverage — pytest, no API keys needed

Architecture

CLI / FastAPI / Streamlit / PDF Generator
 │
 Core Evaluator (asyncio)
 │
 ┌──────────┼──────────┬──────────┐
Metrics Benchmarks Database LiteLLM
accuracy MMLU SQLite OpenAI
latency TruthfulQA Anthropic
cost Custom CSV Google
hallucin. Mistral
reasoning Together

Install

# pip
pip install llm-evaluation-framework

# With extras
pip install "llm-evaluation-framework[dashboard,reports,dev]"

# Docker
docker-compose up -d

License

MIT — free for research and commercial use.

Citation

@software{vigneshwar234_llm_eval_2025,
 author = {Vigneshwar S},
 title = {LLM Evaluation Framework},
 year = {2025},
 url = {https://github.com/vignesh2027/LLM-Evaluation-Framework},
 license = {MIT}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

URL: https://huggingface.co/vigneshwar234/llm-evaluation-framework

⇱ vigneshwar234/llm-evaluation-framework · Hugging Face