VOOZH about

URL: https://dev.to/marcuswwchen/a-9-point-eval-gain-vanished-when-we-deduped-train-against-test-3baj

⇱ A 9-point eval gain vanished when we deduped train against test - DEV Community


TL;DR: We fine-tuned an 8B model for an enterprise ticket-routing task and saw accuracy jump from 71% to 80%. The gain was fake. Roughly 6% of our eval set had near-duplicates in the training data. After MinHash dedup, the real number was 72%. Contamination is the most boring bug in ML and it keeps eating people.

At Nexus Labs my team fine-tunes models for enterprise agent automation. One task: classify inbound support tickets into 40 routing buckets. We had a held-out eval set of 4,000 labeled tickets and a training set of about 90,000.

The fine-tune looked great. Base Qwen3-8B sat at 71.2% exact-match on the eval set. After a QLoRA run on the 90k, we hit 80.4%. Nine points. Everyone wanted to ship Friday.

I didn't believe it. Nine points from a single LoRA pass on a noisy classification task is not how the world usually works.

Where the points came from

The training data and the eval data came from the same Zendesk export. Different time windows, supposedly. But customers paste the same boilerplate. "My SSO login redirects to a blank page" shows up verbatim across dozens of tickets, sometimes months apart.

So the model wasn't generalizing. It was memorizing tickets it had already seen, then getting graded on slightly-reworded copies of them. The eval set was leaking.

Exact-string matching found almost nothing. 38 identical rows out of 4,000. That's why nobody caught it in the first pass. The leakage was near-duplicates, not exact ones: same ticket body with a different greeting, a trimmed signature, one extra sentence.

Catching near-duplicates with MinHash

We used datasketch MinHash LSH on character 5-grams. The idea is cheap: hash each document into a signature, bucket signatures that collide, then compute Jaccard similarity only inside buckets. You avoid the 90,000 x 4,000 brute-force comparison.

from datasketch import MinHash, MinHashLSH

def shingles(text, k=5):
 text = "".join(text.lower().split())
 return {text[i:i+k] for i in range(len(text) - k + 1)}

def signature(text, num_perm=128):
 m = MinHash(num_perm=num_perm)
 for s in shingles(text):
 m.update(s.encode("utf8"))
 return m

lsh = MinHashLSH(threshold=0.7, num_perm=128)
sigs = {}
for i, doc in enumerate(train_docs):
 sig = signature(doc)
 sigs[f"train-{i}"] = sig
 lsh.insert(f"train-{i}", sig)

leaked = []
for j, doc in enumerate(eval_docs):
 if lsh.query(signature(doc)):
 leaked.append(j)

print(f"{len(leaked)} / {len(eval_docs)} eval rows leak")

At a Jaccard threshold of 0.7 this flagged 247 eval rows, about 6.2%, with a near-duplicate somewhere in the training set. We pulled every flagged row out of the eval set and re-scored.

The honest numbers

Configuration Eval accuracy Notes
Base, full eval set 71.2% original baseline
Fine-tuned, full eval set 80.4% the fake 9-point win
Fine-tuned, exact-dedup only 80.1% 38 rows removed, barely moves
Fine-tuned, MinHash-dedup (0.7) 72.3% 247 rows removed
Base, MinHash-dedup eval 70.9% baseline barely changes

The base model score barely moved after dedup, from 71.2% to 70.9%. That's the tell. Contamination only inflates the model that trained on the contaminated data. The fine-tune dropped 8 points once it couldn't recite tickets it had memorized. Real lift was about 1.4 points, inside the noise band we measure with bootstrap resampling on this eval.

We did not ship Friday.

Threshold tuning is the actual work

The 0.7 threshold isn't magic. Set it too high and you miss paraphrases. Too low and you delete legitimately distinct tickets that happen to share a template. We swept it.

Jaccard threshold Eval rows flagged Fine-tuned acc on clean set
0.9 71 78.0%
0.8 156 74.6%
0.7 247 72.3%
0.6 489 72.0%

Below 0.7 the accuracy stabilizes around 72%, which told us we'd caught the real contamination and were now just deleting clean rows. We froze at 0.7 and documented it.

One operational note. We run the post-dedup eval as a batch of LLM-judge calls for the fuzzy-label cases, and route those through Bifrost (https://github.com/maximhq/bifrost) so a single provider rate limit doesn't stall a 4,000-row eval run. It's one config gateway in front of the judge calls, nothing fancy. Failover was the only feature we cared about there.

Trade-offs and Limitations

MinHash LSH is approximate. At num_perm=128 you get variance in the similarity estimate, so a borderline pair near your threshold might flip between runs. If you need determinism, bump num_perm to 256 and eat the memory cost.

Character 5-grams catch surface paraphrase. They do not catch semantic duplicates that share zero substrings, like a ticket translated into Spanish. For that you need embedding-based dedup, which is slower and brings its own threshold-tuning headache. We accepted the gap because our tickets are English and templated.

Dedup also shrinks your eval set. We went from 4,000 to 3,753 rows. Smaller eval means wider confidence intervals. There's no free lunch: you trade a contaminated big set for a clean smaller one, and the clean smaller one is the only one worth trusting.

Last caveat. This only fixes train-eval leakage. If your eval set itself is unrepresentative of production traffic, dedup won't tell you. That's a different audit.

What we changed in the pipeline

Dedup now runs before every train-eval split, not after. The split script refuses to write an eval set if more than 0.5% of rows have a training near-duplicate above 0.7. It's a CI gate. Cheap to run, about 90 seconds on 94k documents, and it has already blocked two contaminated splits since.

The model was never the problem here. A clean eval set was.