VOOZH about

URL: https://huggingface.co/vigneshwar234/llm-evaluation-framework

โ‡ฑ vigneshwar234/llm-evaluation-framework ยท Hugging Face


LLM Evaluation Framework

๐Ÿ‘ Image
๐Ÿ‘ Image
๐Ÿ‘ Image
๐Ÿ‘ Image
๐Ÿ‘ Image
๐Ÿ‘ Image

Production-grade open-source LLM benchmarking. Evaluate GPT-4, Claude, Gemini, Mistral and Llama on 5 metrics โ€” side by side โ€” in one command.

What This Is

This is the model card / hub page for the LLM Evaluation Framework. The framework itself is a Python tool, not a neural network weight โ€” this page serves as the HuggingFace hub entry point linking all resources together.

Quick Start

pip install llm-evaluation-framework
export OPENAI_API_KEY="sk-..."
llm-eval run --model gpt-4o-mini --benchmark mmlu --samples 100

Output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Evaluation: gpt-4o-mini โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Accuracy โ”‚ 78.00% โ”‚
โ”‚ Avg Latency โ”‚ 432 ms โ”‚
โ”‚ P95 Latency โ”‚ 1240 ms โ”‚
โ”‚ Total Cost โ”‚ $0.0023 โ”‚
โ”‚ Hallucination โ”‚ 2.40% โ”‚
โ”‚ Reasoning Score โ”‚ 7.2 / 10 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

5 Evaluation Metrics

Metric Description Output
Accuracy 4-strategy cascade: exact โ†’ normalized โ†’ MC โ†’ fuzzy 0.0โ€“1.0
Latency p50, p75, p90, p95, p99 percentiles + SLA violation rate ms
Cost Real token counts ร— pricing table for 15+ models $/1K tokens
Hallucination Rate Linguistic signal analysis (v1), NLI planned (v2) 0.0โ€“1.0
Reasoning Quality Chain-of-thought depth scoring 1โ€“10

Supported Models

Provider Models
OpenAI GPT-4o, GPT-4o-mini, o1, o1-mini, GPT-3.5-turbo
Anthropic Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus
Google Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 2.0 Flash
Mistral Mistral Large, Mistral Small
Meta Llama 3 70B, Llama 3 8B (via Together AI)
Local Ollama, vLLM, HuggingFace TGI

Sample Benchmark Results (MMLU, 100 samples)

Model Accuracy Latency Cost/1K Hallucination Reasoning
GPT-4o 88.2% 892ms $0.0080 1.8% 8.4/10
Claude 3.5 Sonnet 87.6% 1240ms $0.0090 2.1% 8.6/10
GPT-4o-mini 78.4% 432ms $0.0003 3.2% 7.2/10
Gemini 1.5 Flash 76.8% 380ms $0.0001 4.1% 6.8/10
Claude 3 Haiku 74.2% 410ms $0.0010 4.8% 6.5/10

Key finding: GPT-4o-mini achieves 88% of GPT-4o's accuracy at 4% of the cost.

Features

  • Async parallel evaluation โ€” 10 models at once via asyncio.Semaphore
  • Streamlit dashboard โ€” radar charts, latency histograms, cost vs quality scatter
  • FastAPI REST API โ€” 12 endpoints with OpenAPI docs
  • CLI tool โ€” 7 subcommands with rich terminal output
  • PDF report generator โ€” professional layout via ReportLab
  • SQLite persistence โ€” zero-config, file-based storage
  • Docker ready โ€” multi-stage build, docker-compose up
  • 40+ tests, 95% coverage โ€” pytest, no API keys needed

Architecture

CLI / FastAPI / Streamlit / PDF Generator
 โ”‚
 Core Evaluator (asyncio)
 โ”‚
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
Metrics Benchmarks Database LiteLLM
accuracy MMLU SQLite OpenAI
latency TruthfulQA Anthropic
cost Custom CSV Google
hallucin. Mistral
reasoning Together

Install

# pip
pip install llm-evaluation-framework

# With extras
pip install "llm-evaluation-framework[dashboard,reports,dev]"

# Docker
docker-compose up -d

License

MIT โ€” free for research and commercial use.

Citation

@software{vigneshwar234_llm_eval_2025,
 author = {Vigneshwar S},
 title = {LLM Evaluation Framework},
 year = {2025},
 url = {https://github.com/vignesh2027/LLM-Evaluation-Framework},
 license = {MIT}
}
Downloads last month

-

Downloads are not tracked for this model. How to track