๐ Live demo (LOOK ยท UNDERSTAND ยท BUILD): https://dev48v.infy.uk/prompt/day3-self-consistency.html
Day 3 of my PromptFromZero series. 50 LLM techniques in 50 days, each visualized with LOOK / UNDERSTAND / BUILD.
Today: self-consistency. The simplest +35-point accuracy lift you can give a small model. Pairs naturally with Chain of Thought (Day 2).
The setup
Same hard math problem. Same model. Five runs.
A train leaves at 60 km/h. 30 minutes later, a second train
leaves the same station at 80 km/h on the same track. How
many minutes after the second train leaves do they meet?
Correct answer: 90 min. (First has 30 km lead. Second closes at 20 km/h. 30รท20 = 1.5h.)
A single CoT call on gemini-2.5-flash with temperature 0.7:
Run 1: 90 โ
Looks right. But run it again:
Run 2: 22 โ (the model divided by 80 instead of 20)
You can't tell which run is the right one from a single call. The wrong run looks just as confident as the right one. Single-roll = stuck with whatever the dice say.
The fix
Sample the same prompt N times in parallel, extract each numeric answer, and take the majority.
import { generateText } from "ai";
import { google } from "@ai-sdk/google";
const model = google("gemini-2.5-flash");
const N = 5;
const samples = await Promise.all(
Array.from({ length: N }, () =>
generateText({ model, prompt, temperature: 0.7 })
)
);
function extract(text) {
const nums = text.match(/-?\d+(?:\.\d+)?/g);
return nums ? nums[nums.length - 1] : null;
}
const answers = samples.map(s => extract(s.text));
const tally = {};
for (const a of answers) tally[a] = (tally[a] || 0) + 1;
const [winner, votes] = Object.entries(tally)
.sort((a, b) => b[1] - a[1])[0];
console.log(`Final: ${winner} โ ${votes}/${N} votes`);
Now:
Samples:["90","90","22","90","90"]Final:90โ4/5votes
The outlier got out-voted. The wrong answer never reaches the user.
Why it works
Each LLM call is a stochastic sample from the model's probability distribution over outputs. With temperature 0 you'd get the SAME (often-wrong) answer every time. With temperature 0.7 the model takes slightly different reasoning paths, and independent errors don't all line up.
If the model is right 60% of the time on a problem:
- 1 sample: 60% accuracy
- 3 samples + majority: 1 - P(2 or 3 wrong) โ 1 - (0.4ยณ + 3ยท0.4ยฒยท0.6) = 65%
- 5 samples + majority: โ 68% on this distribution
- Multiply by Chain-of-Thought's lift over zero-shot (~+25 points): 95% accuracy on grade-school math.
Numbers depend on the model + problem. The shape always: more samples โ fewer mistakes, with diminishing returns past N=10.
Temperature matters
- temp = 0 โ deterministic, all 5 samples identical, defeats the point
- temp = 0.7 โ sweet spot, diverse reasoning paths, math stays valid
- temp = 1.5 โ too random, the model starts writing nonsense
You want diversity without losing competence. 0.7 is the standard.
Confidence for free
votes / N gives you a free confidence score:
- 5/5 โ trust it, auto-accept
- 3-4/5 โ use but flag for human review
- โค2/5 โ the model is guessing, refuse to answer
You can build a calibrated AI product on top of this signal alone.
The trade-off โ cost
N=5 = 5ร the tokens of a single call. Per request:
- Single CoT: ~1k tokens, 60% accurate on hard math
- Self-consistency (N=5): ~5k tokens, ~95% accurate
For high-stakes problems (medical, finance, code review, judgments), you pay 5ร to lift accuracy from 60 โ 95%. For low-stakes tasks (chat, summarization, creative writing), single-shot CoT is fine.
Non-numeric answers
For text answers (yes/no, multi-class), normalize before tallying. "Yes" / "yes." / " yes" should all count as one bucket.
function normalize(s) {
return s.toLowerCase().replace(/[^a-z0-9]/g, "").trim();
}
const canonical = answers.map(normalize);
Build it in 10 minutes
mkdir self-consistency && cd self-consistency
npm init -y
npm install ai @ai-sdk/google
echo "GOOGLE_GENERATIVE_AI_API_KEY=your_key" > .env
Get a free Gemini key at https://aistudio.google.com/apikey โ no credit card.
Drop the JS snippet above into self-consistency.mjs and run:
node --env-file=.env self-consistency.mjs
5 parallel calls. Tally. Winner.
Try it now
Three tabs on one page:
https://dev48v.infy.uk/prompt/day3-self-consistency.html
- LOOK โ animated 5-sample run with live tally bar chart
- UNDERSTAND โ 9 click-through steps on why it works
- BUILD โ full Node script, copy + run
What's next in PromptFromZero
Day 4: Few-shot prompting. Drop 2-3 worked examples in the prompt โ the model copies the format and reasoning depth on the actual question. The poor man's fine-tune.
๐ All techniques: https://dev48v.infy.uk/promptfromzero.php
For further actions, you may consider blocking this person and/or reporting abuse
