VOOZH about

URL: https://dev.to/dev48v/sample-your-llm-5-times-and-take-a-majority-vote-accuracy-jumps-35-points-1fhh

โ‡ฑ Sample Your LLM 5 Times and Take a Majority Vote โ€” Accuracy Jumps 35 Points - DEV Community


๐ŸŒ Live demo (LOOK ยท UNDERSTAND ยท BUILD): https://dev48v.infy.uk/prompt/day3-self-consistency.html

Day 3 of my PromptFromZero series. 50 LLM techniques in 50 days, each visualized with LOOK / UNDERSTAND / BUILD.

Today: self-consistency. The simplest +35-point accuracy lift you can give a small model. Pairs naturally with Chain of Thought (Day 2).


The setup

Same hard math problem. Same model. Five runs.

A train leaves at 60 km/h. 30 minutes later, a second train
leaves the same station at 80 km/h on the same track. How
many minutes after the second train leaves do they meet?

Correct answer: 90 min. (First has 30 km lead. Second closes at 20 km/h. 30รท20 = 1.5h.)

A single CoT call on gemini-2.5-flash with temperature 0.7:

Run 1: 90 โœ“

Looks right. But run it again:

Run 2: 22 โœ— (the model divided by 80 instead of 20)

You can't tell which run is the right one from a single call. The wrong run looks just as confident as the right one. Single-roll = stuck with whatever the dice say.


The fix

Sample the same prompt N times in parallel, extract each numeric answer, and take the majority.

import { generateText } from "ai";
import { google } from "@ai-sdk/google";

const model = google("gemini-2.5-flash");

const N = 5;
const samples = await Promise.all(
 Array.from({ length: N }, () =>
 generateText({ model, prompt, temperature: 0.7 })
 )
);

function extract(text) {
 const nums = text.match(/-?\d+(?:\.\d+)?/g);
 return nums ? nums[nums.length - 1] : null;
}

const answers = samples.map(s => extract(s.text));
const tally = {};
for (const a of answers) tally[a] = (tally[a] || 0) + 1;
const [winner, votes] = Object.entries(tally)
 .sort((a, b) => b[1] - a[1])[0];

console.log(`Final: ${winner} โ€” ${votes}/${N} votes`);

Now:

Samples:["90","90","22","90","90"]Final:90โ€”4/5votes

The outlier got out-voted. The wrong answer never reaches the user.


Why it works

Each LLM call is a stochastic sample from the model's probability distribution over outputs. With temperature 0 you'd get the SAME (often-wrong) answer every time. With temperature 0.7 the model takes slightly different reasoning paths, and independent errors don't all line up.

If the model is right 60% of the time on a problem:

  • 1 sample: 60% accuracy
  • 3 samples + majority: 1 - P(2 or 3 wrong) โ‰ˆ 1 - (0.4ยณ + 3ยท0.4ยฒยท0.6) = 65%
  • 5 samples + majority: โ‰ˆ 68% on this distribution
  • Multiply by Chain-of-Thought's lift over zero-shot (~+25 points): 95% accuracy on grade-school math.

Numbers depend on the model + problem. The shape always: more samples โ†’ fewer mistakes, with diminishing returns past N=10.


Temperature matters

  • temp = 0 โ†’ deterministic, all 5 samples identical, defeats the point
  • temp = 0.7 โ†’ sweet spot, diverse reasoning paths, math stays valid
  • temp = 1.5 โ†’ too random, the model starts writing nonsense

You want diversity without losing competence. 0.7 is the standard.


Confidence for free

votes / N gives you a free confidence score:

  • 5/5 โ†’ trust it, auto-accept
  • 3-4/5 โ†’ use but flag for human review
  • โ‰ค2/5 โ†’ the model is guessing, refuse to answer

You can build a calibrated AI product on top of this signal alone.


The trade-off โ€” cost

N=5 = 5ร— the tokens of a single call. Per request:

  • Single CoT: ~1k tokens, 60% accurate on hard math
  • Self-consistency (N=5): ~5k tokens, ~95% accurate

For high-stakes problems (medical, finance, code review, judgments), you pay 5ร— to lift accuracy from 60 โ†’ 95%. For low-stakes tasks (chat, summarization, creative writing), single-shot CoT is fine.


Non-numeric answers

For text answers (yes/no, multi-class), normalize before tallying. "Yes" / "yes." / " yes" should all count as one bucket.

function normalize(s) {
 return s.toLowerCase().replace(/[^a-z0-9]/g, "").trim();
}
const canonical = answers.map(normalize);

Build it in 10 minutes

mkdir self-consistency && cd self-consistency
npm init -y
npm install ai @ai-sdk/google
echo "GOOGLE_GENERATIVE_AI_API_KEY=your_key" > .env

Get a free Gemini key at https://aistudio.google.com/apikey โ€” no credit card.

Drop the JS snippet above into self-consistency.mjs and run:

node --env-file=.env self-consistency.mjs

5 parallel calls. Tally. Winner.


Try it now

Three tabs on one page:
https://dev48v.infy.uk/prompt/day3-self-consistency.html

  • LOOK โ€” animated 5-sample run with live tally bar chart
  • UNDERSTAND โ€” 9 click-through steps on why it works
  • BUILD โ€” full Node script, copy + run

What's next in PromptFromZero

Day 4: Few-shot prompting. Drop 2-3 worked examples in the prompt โ†’ the model copies the format and reasoning depth on the actual question. The poor man's fine-tune.

๐ŸŒ All techniques: https://dev48v.infy.uk/promptfromzero.php