VOOZH about

URL: https://dev.to/dev48v/loss-functions-mse-vs-mae-vs-cross-entropy-visualized-3bkn

⇱ Loss Functions: MSE vs MAE vs Cross-Entropy, Visualized - DEV Community


Pick the wrong loss function and your model optimises the wrong thing — perfectly. The loss is the single number training tries to shrink, so it quietly defines what "wrong" even means. I built an interactive visualiser of MSE, MAE, and cross-entropy so you can see why the choice matters.

🎯 Drag the prediction: https://dev48v.infy.uk/dl/day6-loss-functions.html

This is Day 6 of DeepLearningFromZero.

Loss = one number for "how wrong"

The network's output is compared to the truth and collapsed into one scalar. Everything in training exists to make that number smaller. Choose the loss and you've defined the network's entire goal.

MSE — square the error (regression)

const mse = (pred, y) => (pred - y) ** 2;

Squaring means off-by-4 hurts 16×, off-by-1 hurts 1×. MSE obsesses over large errors — great when big misses are unacceptable, risky when outliers will drag the model around.

MAE — absolute error, outlier-robust

const mae = (pred, y) => Math.abs(pred - y);

Linear penalty: off-by-4 hurts exactly 4× off-by-1. One wild outlier can't dominate. The trade-off is a constant gradient, so it can be slower and less precise near the answer.

Cross-entropy — for classification

When the output is a probability, you don't use MSE. Cross-entropy rewards confident-and-right and brutally punishes confident-and-wrong:

const bce = (p, y) => -(y * Math.log(p) + (1 - y) * Math.log(1 - p));

Predict 1% for the true class and the loss screams toward infinity. In the demo, switch to Classification and slide p toward 0 to watch it explode.

The slope is what learning actually uses

Backprop doesn't follow the loss value — it follows the loss's gradient (slope) downhill. That's why the shape matters: cross-entropy's steep slope when very wrong gives a strong corrective push, helping classifiers learn faster than MSE would.

grad = dLoss / dPred; // gradient descent steps along this

Choosing the loss is a design decision

Predicting a price? MSE or MAE. Yes/no? Binary cross-entropy. One-of-many? Categorical cross-entropy. Same network, different loss, genuinely different behaviour — because the loss encodes what you actually care about.

The takeaway

The loss is the goal. Match it to the task, and remember its slope is what drives the learning. Drag the prediction in the demo and watch MSE's parabola tower over MAE's gentle V.