VOOZH
about
URL: https://dev.to/t/evaluation
β± Evaluation - DEV Community
Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output
π saurav_bhattacharya profile
Saurav Bhattacharya
π Image
Saurav Bhattacharya
Jun 19
Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output
#
ai
#
agents
#
evaluation
#
observability
π Image
1
reaction
Add Comment
6 min read
Stop Asking 'Is GAI Here' β Ask 'At What Layer'
π lanternproton profile
keeper
π Image
keeper
Jun 19
Stop Asking 'Is GAI Here' β Ask 'At What Layer'
#
ai
#
gai
#
framework
#
evaluation
π Image
1
reaction
Add Comment
3 min read
Your Agent Passed Every Eval and Still Cost $4,000 a Day
π saurav_bhattacharya profile
Saurav Bhattacharya
π Image
Saurav Bhattacharya
Jun 18
Your Agent Passed Every Eval and Still Cost $4,000 a Day
#
ai
#
agents
#
evaluation
#
observability
π Image
1
reaction
Add Comment
5 min read
Your Eval Suite Is Grading Fiction: Stop Inventing Test Cases and Mine Your Traces
π saurav_bhattacharya profile
Saurav Bhattacharya
π Image
Saurav Bhattacharya
Jun 17
Your Eval Suite Is Grading Fiction: Stop Inventing Test Cases and Mine Your Traces
#
ai
#
evaluation
#
observability
#
typescript
π Image
1
reaction
Add Comment
5 min read
An LLM benchmark is only useful for as long as it's hard
π arthurpro profile
Arthur
π Image
Arthur
Jun 11
An LLM benchmark is only useful for as long as it's hard
#
llm
#
evaluation
#
benchmarks
#
humaneval
π Image
2
reactions
Add Comment
10 min read
I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.
π saurav_bhattacharya profile
Saurav Bhattacharya
π Image
Saurav Bhattacharya
Jun 9
I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.
#
ai
#
agents
#
safety
#
evaluation
π Image
2
reactions
Add Comment
11 min read
Monitoring vs Evaluation β What's the Difference (and Why It Matters)
π phylis-data4impact profile
Phylis Korir
π Image
Phylis Korir
Jun 3
Monitoring vs Evaluation β What's the Difference (and Why It Matters)
#
monitoring
#
evaluation
#
projectmanagement
#
beginners
π Image
π Image
π Image
5
reactions
Add Comment
6 min read
Put Your Agent Evals in CI or Stop Calling Them Evals
π saurav_bhattacharya profile
Saurav Bhattacharya
π Image
Saurav Bhattacharya
Jun 16
Put Your Agent Evals in CI or Stop Calling Them Evals
#
ai
#
agents
#
evaluation
#
devops
π Image
2
reactions
1
comment
5 min read
Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks
π saurav_bhattacharya profile
Saurav Bhattacharya
π Image
Saurav Bhattacharya
Jun 7
Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks
#
ai
#
security
#
evaluation
#
agents
π Image
1
reaction
Add Comment
5 min read
第δΈζ¬‘ε―ΉAI Agentηη²Ύη₯η ε¦θ―δΌ°
π guangda88 profile
guangda
π Image
guangda
Jun 6
第δΈζ¬‘ε―ΉAI Agentηη²Ύη₯η ε¦θ―δΌ°
#
ai
#
agents
#
psychology
#
evaluation
π Image
1
reaction
Add Comment
1 min read
Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions
π balagmadhu profile
Bala Madhusoodhanan
π Image
Bala Madhusoodhanan
May 25
Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions
#
aibuilder
#
powerplatform
#
evaluation
#
powerfuldevs
π Image
π Image
π Image
5
reactions
Add Comment
4 min read
The First Psychiatric Evaluation of AI Agents
π guangda88 profile
guangda
π Image
guangda
Jun 5
The First Psychiatric Evaluation of AI Agents
#
ai
#
agents
#
psychology
#
evaluation
Add Comment
3 min read
Why I used three different critic roles instead of one (and what the eval taught me)
π bhj37193 profile
Bohyeon Jang
π Image
Bohyeon Jang
May 31
Why I used three different critic roles instead of one (and what the eval taught me)
#
llm
#
python
#
ai
#
evaluation
2
comments
6 min read
Building a domain-specific LLM evaluation set from scratch
π tech_nuggets profile
Tech_Nuggets
π Image
Tech_Nuggets
Jun 4
Building a domain-specific LLM evaluation set from scratch
#
llm
#
ai
#
evaluation
#
opensource
π Image
1
reaction
Add Comment
8 min read
What is an LLM evaluation harness? A deep dive into lm-eval-harness
π tech_nuggets profile
Tech_Nuggets
π Image
Tech_Nuggets
Jun 3
What is an LLM evaluation harness? A deep dive into lm-eval-harness
#
llm
#
ai
#
evaluation
#
opensource
π Image
1
reaction
Add Comment
7 min read
π
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
π DEV Community
We're a place where coders share, stay up-to-date and grow their careers.
Log in
Create account
π Image
π Image
π Image
π Image
π Image