Voozh

👁 Image
1 reaction

Add Comment

6 min read

👁 lanternproton profile

keeper

Jun 19

Stop Asking 'Is GAI Here' — Ask 'At What Layer'

#ai #gai #framework #evaluation

👁 Image
1 reaction

Add Comment

3 min read

👁 saurav_bhattacharya profile

Saurav Bhattacharya

Jun 18

Your Agent Passed Every Eval and Still Cost $4,000 a Day

#ai #agents #evaluation #observability

👁 Image
1 reaction

Add Comment

5 min read

👁 saurav_bhattacharya profile

Saurav Bhattacharya

Jun 17

Your Eval Suite Is Grading Fiction: Stop Inventing Test Cases and Mine Your Traces

#ai #evaluation #observability #typescript

👁 Image
1 reaction

Add Comment

5 min read

👁 arthurpro profile

Arthur

Jun 11

An LLM benchmark is only useful for as long as it's hard

#llm #evaluation #benchmarks #humaneval

👁 Image
2 reactions

Add Comment

10 min read

👁 saurav_bhattacharya profile

Saurav Bhattacharya

Jun 9

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

#ai #agents #safety #evaluation

👁 Image
2 reactions

Add Comment

11 min read

👁 phylis-data4impact profile

Phylis Korir

Jun 3

Monitoring vs Evaluation — What's the Difference (and Why It Matters)

#monitoring #evaluation #projectmanagement #beginners

👁 Image
👁 Image
👁 Image
5 reactions

Add Comment

6 min read

👁 saurav_bhattacharya profile

Saurav Bhattacharya

Jun 16

Put Your Agent Evals in CI or Stop Calling Them Evals

#ai #agents #evaluation #devops

👁 Image
2 reactions

1 comment

5 min read

👁 saurav_bhattacharya profile

Saurav Bhattacharya

Jun 7

Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

#ai #security #evaluation #agents

👁 Image
1 reaction

Add Comment

5 min read

👁 guangda88 profile

guangda

Jun 6

第一次对AI Agent的精神病学评估

#ai #agents #psychology #evaluation

👁 Image
1 reaction

Add Comment

1 min read

👁 balagmadhu profile

Bala Madhusoodhanan

May 25

Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions

#aibuilder #powerplatform #evaluation #powerfuldevs

👁 Image
👁 Image
👁 Image
5 reactions

Add Comment

4 min read

👁 guangda88 profile

guangda

Jun 5

The First Psychiatric Evaluation of AI Agents

#ai #agents #psychology #evaluation

Add Comment

3 min read

👁 bhj37193 profile

Bohyeon Jang

May 31

Why I used three different critic roles instead of one (and what the eval taught me)

#llm #python #ai #evaluation

2 comments

6 min read

👁 tech_nuggets profile

Tech_Nuggets

Jun 4

Building a domain-specific LLM evaluation set from scratch

#llm #ai #evaluation #opensource

👁 Image
1 reaction

Add Comment

8 min read

👁 tech_nuggets profile

Tech_Nuggets

Jun 3

What is an LLM evaluation harness? A deep dive into lm-eval-harness

#llm #ai #evaluation #opensource

👁 Image
1 reaction

Add Comment

7 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.

URL: https://dev.to/t/evaluation

⇱ Evaluation - DEV Community

Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output

Stop Asking 'Is GAI Here' — Ask 'At What Layer'

Your Agent Passed Every Eval and Still Cost $4,000 a Day

Your Eval Suite Is Grading Fiction: Stop Inventing Test Cases and Mine Your Traces

An LLM benchmark is only useful for as long as it's hard

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

Monitoring vs Evaluation — What's the Difference (and Why It Matters)

Put Your Agent Evals in CI or Stop Calling Them Evals

Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

第一次对AI Agent的精神病学评估

Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions

The First Psychiatric Evaluation of AI Agents

Why I used three different critic roles instead of one (and what the eval taught me)

Building a domain-specific LLM evaluation set from scratch

What is an LLM evaluation harness? A deep dive into lm-eval-harness