VOOZH about

URL: https://dev.to/sendotltd/a-scatter-plot-explorer-for-world-statistics-log-scales-and-hand-rolled-pearson-correlation-2l5m

⇱ A Scatter-Plot Explorer for World Statistics β€” Log Scales and Hand-Rolled Pearson Correlation - DEV Community


"Do countries with higher GDP per capita also have longer life expectancy?" I built a tool that lets you explore questions like that across 48 countries by picking any two of five metrics as scatter-plot axes. Two implementation hinges: (1) metrics that span orders of magnitude (population: Singapore 5.6M to India 1,417M, a 250Γ— range) must be plotted and correlated on a log scale or every point collapses to one corner, and (2) a hand-rolled Pearson correlation coefficient recomputed live as you change axes. Vanilla JS, no chart library, with 34 Node tests on the computation layer.

🌐 Demo: https://sen.ltd/portfolio/global-stats/
πŸ“¦ GitHub: https://github.com/sen-ltd/global-stats

πŸ‘ Screenshot

The data model

48 countries Γ— 5 metrics (population, GDP per capita, life expectancy, CO2 per capita, area):

{ name: "Japan", code: "JP", region: "γ‚’γ‚Έγ‚’",
 population: 125.1, gdpPerCapita: 33800, lifeExpectancy: 84.5,
 co2PerCapita: 8.5, area: 378 },

Metric definitions live in a separate table with a log flag:

export const METRICS = [
 { key: "population", label: "...", log: true },
 { key: "gdpPerCapita", label: "...", log: true },
 { key: "lifeExpectancy", label: "...", log: false },
 { key: "co2PerCapita", label: "...", log: true },
 { key: "area", label: "...", log: true },
];

Only life expectancy is log: false. That distinction does real work.

Why log scale is non-negotiable

Plot "population vs GDP" on linear axes and it's a disaster. Population spans 250Γ— (Singapore to India); GDP per capita spans 100Γ— (Ethiopia $1,030 to Norway $106,150). On linear axes:

  • nearly every point collapses into the bottom-left corner
  • China and India alone stick to the right edge
  • the correlation coefficient gets dragged around by the big outliers

The fix is log transformation β€” equal spacing per order of magnitude, so countries of wildly different size share one viewport. Linear metrics like life expectancy (52–85 years, a mere 1.6Γ— range) stay linear.

export function normalize(value, metric, domainMin, domainMax) {
 if (metric.log) {
 const lv = Math.log10(value);
 const lmin = Math.log10(domainMin);
 const lmax = Math.log10(domainMax);
 if (lmax === lmin) return 0.5;
 return (lv - lmin) / (lmax - lmin);
 }
 if (domainMax === domainMin) return 0.5;
 return (value - domainMin) / (domainMax - domainMin);
}

Tested by asserting the geometric midpoint maps to center:

test("log: geometric midpoint β†’ 0.5", () => {
 const m = getMetric("gdpPerCapita");
 // domain 1000..100000, geometric mean = 10000 β†’ 0.5
 assert.ok(Math.abs(normalize(10000, m, 1000, 100000) - 0.5) < 1e-9);
});

Linear would put 50500 at the center; log puts the geometric mean 10000 there. That difference is what "thinking in orders of magnitude" means.

Hand-rolled Pearson correlation

Pick two axes, get a coefficient r. Straight from the definition:

export function pearson(xs, ys) {
 const n = xs.length;
 if (n < 2 || ys.length !== n) return null;
 const meanX = xs.reduce((a, b) => a + b, 0) / n;
 const meanY = ys.reduce((a, b) => a + b, 0) / n;
 let num = 0, denX = 0, denY = 0;
 for (let i = 0; i < n; i++) {
 const dx = xs[i] - meanX, dy = ys[i] - meanY;
 num += dx * dy; denX += dx * dx; denY += dy * dy;
 }
 const den = Math.sqrt(denX * denY);
 if (den === 0) return null; // zero variance β†’ undefined
 return num / den;
}

Returning null for zero variance matters: 0/0 = NaN would corrupt axis labels downstream. Handle undefined explicitly.

test("perfect positive correlation = 1", () => {
 assert.ok(Math.abs(pearson([1, 2, 3], [2, 4, 6]) - 1) < 1e-9);
});
test("no correlation β‰ˆ 0", () => {
 // a symmetric V has zero LINEAR correlation
 assert.ok(Math.abs(pearson([-2, -1, 0, 1, 2], [4, 1, 0, 1, 4])) < 1e-9);
});
test("zero variance β†’ null", () => {
 assert.equal(pearson([5, 5, 5], [1, 2, 3]), null);
});

The V-shape test is the important one: zero correlation means zero linear correlation, not "no relationship." A perfect parabola has Pearson r = 0. The test documents that limitation.

Correlate in log space too

The key insight: if you display on a log scale, you must correlate on log-transformed values to match. Power-law relationships (y = ax^b) become straight lines in log-log space (log y = bΒ·log x + log a), so Pearson on the logs captures the true strength:

export function metricCorrelation(keyX, keyY, pool) {
 const mx = getMetric(keyX), my = getMetric(keyY);
 const xs = [], ys = [];
 for (const c of pool) {
 let x = c[keyX], y = c[keyY];
 if (mx.log) x = Math.log10(x); // power-law β†’ linear
 if (my.log) y = Math.log10(y);
 xs.push(x); ys.push(y);
 }
 return pearson(xs, ys);
}

test("GDP vs life expectancy is a strong positive correlation", () => {
 assert.ok(metricCorrelation("gdpPerCapita", "lifeExpectancy") > 0.5);
});

The actual value comes out at r β‰ˆ 0.84 β€” the famous Preston curve (income vs longevity) reproduced from the data. GDP is log, life expectancy is linear, so it's a semi-log correlation, which matches the economics finding that life expectancy scales with the logarithm of income.

Y-axis inversion

SVG's origin is top-left, so "bigger value = higher up" needs a Y flip:

return pool.map((c) => ({
 country: c,
 cx: normalize(c[keyX], mx, dx.min, dx.max),
 cy: 1 - normalize(c[keyY], my, dy.min, dy.max), // invert
}));

Guarded by a test:

test("y is inverted: highest life-expectancy country has smallest cy", () => {
 const top = pts.reduce((a, b) => (b.country.lifeExpectancy > a.country.lifeExpectancy ? b : a));
 for (const p of pts) {
 if (p.country.code !== top.country.code) assert.ok(p.cy >= top.cy - 1e-9);
 }
});

Data integrity tests

Hardcoded public data deserves integrity checks β€” and for a log-scale tool, "all metrics positive" is a precondition, not a nicety (log10(0) = -∞, log10(negative) = NaN):

test("no duplicate ISO codes", () => { /* ... */ });
test("every metric field is present and positive", () => {
 for (const c of COUNTRIES)
 for (const m of METRICS)
 assert.ok(typeof c[m.key] === "number" && c[m.key] > 0);
});
test("life expectancy in a sane range (40-90)", () => { /* ... */ });

Architecture

data.js ← 48 countries Γ— 5 metrics (World Bank / UN / OWID ~2022)
core.js ← pearson, normalize (log-aware), scatter scaling, region aggregation (DOM-free, 34 tests)
app.js ← SVG scatter + sortable table

Try it

Set the axes to "CO2 vs GDP" for a clear positive correlation (richer = more emissions). Set "population vs life expectancy" for near-zero (big and small countries live equally long). Colors encode region.

Takeaways

  • Order-of-magnitude metrics (population, GDP, area) need log scales or points collapse. Linear metrics (life expectancy) stay linear. A per-metric log flag toggles both.
  • Pearson is implementable from the definition. Return null for zero variance β€” don't leak NaN into the view.
  • Display on log β†’ correlate on log. Power laws straighten out in log-log space.
  • Pearson measures linear correlation only. A V-shape test documents that.
  • With log scales, "all values positive" is a precondition, not a check. Test it.
  • GDP vs life expectancy gives r β‰ˆ 0.84 β€” the Preston curve, straight from the data.

This is OSS portfolio #262 from SEN LLC (Tokyo). https://sen.ltd/portfolio/