Skip to content
Amardeep Kumar
Go back

Loss Functions: How a Neural Network Knows It's Wrong

We now have a network that can make predictions. But the weights start random, so those predictions are wrong. To fix them, we first need a way to measure how wrong. That measurement is the loss function.


Why it exists

Imagine you’re training a model to predict house prices. It predicts $320,000 for a house that sold for $300,000. How wrong is that? You need one number that captures the error - something you can minimise.

That number is the loss (also called the cost or error). The lower the loss, the closer the predictions are to the correct answers. Training is the process of adjusting weights to push the loss down.

The loss function is the formula that computes this number. Different tasks need different formulas - predicting a continuous number (price, temperature) needs a different formula than predicting a category (spam/not-spam, cat/dog).

A number line showing actual vs predicted value with the error gap labelled, illustrating what MSE measures.


How it works

MSE - for regression

MSE (Mean Squared Error): MSE = (1/n) × Σ (predicted - actual)²

For each prediction, compute the difference from the actual value, square it, then average across all examples. Squaring does two things: it makes all errors positive (so over- and under-predictions don’t cancel), and it punishes large errors more than small ones - a $100k error is not just 10× worse than a $10k error, it’s 100× worse in the loss.

Pro: Smooth and differentiable everywhere - the gradient always exists, which is essential for backpropagation (post 1.4). Heavily penalises large outlier errors, so the model is pushed hard to fix its worst predictions.

Con: That same heavy penalty on outliers can be a problem if your data has genuine outliers you don’t care about - a single extreme value can dominate the loss and pull training in the wrong direction.

MAE - a simpler alternative

MAE (Mean Absolute Error): MAE = (1/n) × Σ |predicted - actual|

Instead of squaring the error, take the absolute value. A $100k error is exactly 10× worse than a $10k error - no amplification.

Pro: More robust to outliers than MSE. Each error contributes proportionally, so extreme values don’t dominate training.

Con: Not differentiable at zero. When predicted = actual, the gradient of |x| is undefined - there’s a sharp corner. In practice this is handled with a subgradient, but it makes training less stable than MSE, especially near the minimum.

This is a preview of why differentiability matters - gradient descent (post 1.4) needs a smooth, well-defined gradient at every point to work reliably.

Huber loss - best of both

Huber loss combines MSE and MAE: it behaves like MSE for small errors (smooth, stable gradient) and like MAE for large errors (less sensitive to outliers).

Huber(δ) = 0.5 × error² if |error| ≤ δ, else δ × (|error| - 0.5 × δ)

The threshold δ controls where the switch happens. A common default is δ = 1.

Pro: Gets the stability of MSE near zero and the outlier robustness of MAE. Often a better default than either for regression tasks.

Con: One extra hyperparameter (δ) to tune. In practice the default works well enough that this is rarely an issue.

Cross-entropy - for classification

Cross-entropy loss is used when you’re predicting a category.

Before computing the loss, the network outputs a score for each class. A softmax function converts those scores into probabilities that sum to 1. Cross-entropy then measures how far that predicted probability distribution is from the true one (where the correct class has probability 1 and everything else has 0).

Cross-entropy = -log(predicted probability of the correct class)

The key insight is the -log. If the model assigns 0.9 probability to the right class, loss = -log(0.9) ≈ 0.1 - low, good. If it assigns 0.01 probability, loss = -log(0.01) ≈ 4.6 - very high. The log heavily penalises confident wrong answers.

Think of it like a confidence penalty: being unsure costs you a little, being confidently wrong costs you a lot.

Binary cross-entropy is the two-class version (spam/not-spam). Categorical cross-entropy handles more than two classes (cat/dog/bird).

Pro: Directly optimises for confident correct predictions. Differentiable everywhere (no sharp corners). Works naturally with softmax outputs.

Con: Sensitive to class imbalance - if 99% of your data is one class, the model can get low loss by always predicting that class while being useless on the rare class. Needs techniques like class weighting to handle this.

The plot below shows the loss for three scenarios - confident and correct, uncertain, and confident and wrong:

Cross-entropy loss for three scenarios: confident and correct (0.04), uncertain (1.10), confident and wrong (4.56). Being confidently wrong is ~100x worse than being confidently right.

Summary

LossTaskFormulaProCon
MSERegressionmean((ŷ - y)²)Smooth gradient, penalises large errorsSensitive to outliers
MAERegressionmean(|ŷ - y|)Robust to outliersNot differentiable at zero
HuberRegressionMSE near 0, MAE far from 0Best of MSE + MAEExtra hyperparameter δ
Cross-entropyClassification (multi-class)-log(p_correct)Optimises confidence, smoothSensitive to class imbalance
Binary cross-entropyClassification (2 classes)-log(p) - log(1-p)Same as above, simplerSame as above

Why the choice of loss function matters

The loss function is the signal the network learns from. If you pick the wrong one, the network optimises for the wrong thing.

A concrete example: if you use MSE for a classification task, the network learns to minimise the numeric difference between its output scores and the labels (0 or 1). This works, but it doesn’t directly optimise for getting the right class with high confidence. Cross-entropy is specifically designed for that, which is why it trains faster and more reliably on classification tasks.

Getting the architecture right but the loss wrong is a common mistake. The model trains, the loss goes down, and it still performs poorly because it was optimising the wrong objective.


Code

Step 1 - MSE, MAE, and Huber on the same data:

import torch
import torch.nn as nn

predicted = torch.tensor([2.5, 0.0, 2.0, 8.0])
actual    = torch.tensor([3.0, -0.5, 2.0, 7.0])

mse_fn   = nn.MSELoss()
mae_fn   = nn.L1Loss()
huber_fn = nn.HuberLoss(delta=1.0)

print("MSE:  ", mse_fn(predicted, actual).item())
print("MAE:  ", mae_fn(predicted, actual).item())
print("Huber:", huber_fn(predicted, actual).item())
MSE:   0.1875
MAE:   0.375
Huber: 0.125

Now add an outlier and watch MSE blow up while MAE stays stable:

predicted_with_outlier = torch.tensor([2.5, 0.0, 2.0, 50.0])  # last value is way off

print("MSE:  ", mse_fn(predicted_with_outlier, actual).item())
print("MAE:  ", mae_fn(predicted_with_outlier, actual).item())
print("Huber:", huber_fn(predicted_with_outlier, actual).item())
MSE:   395.1875
MAE:   10.875
Huber: 42.625

One extreme prediction pushed MSE from 0.19 to 395. MAE moved from 0.38 to 10.9 - bad, but not catastrophic. Huber sits in between.

Step 2 - cross-entropy by hand on a 3-class example:

import torch
import torch.nn.functional as F

# raw scores (logits) from the network for 3 classes
logits = torch.tensor([[2.0, 1.0, 0.1]])
# correct class is 0
target = torch.tensor([0])

# by hand: softmax then -log of correct class probability
probs     = F.softmax(logits, dim=1)
loss_manual = -torch.log(probs[0, target[0]])
print("Probabilities:", probs.round(decimals=3))
print("Manual cross-entropy:", loss_manual.item())

# with PyTorch (takes raw logits, applies softmax internally)
ce_fn = nn.CrossEntropyLoss()
print("PyTorch cross-entropy:", ce_fn(logits, target).item())
Probabilities: tensor([[0.659, 0.242, 0.099]])
Manual cross-entropy: 0.417
PyTorch cross-entropy: 0.417

The model assigned 65.9% probability to the correct class - not confident, but right. Loss is 0.417.

Step 3 - show how loss changes with confidence:

import matplotlib.pyplot as plt
import torch
import torch.nn.functional as F

scenarios = {
    "Confident\n& correct":  torch.tensor([[5.0, 0.5, 0.5]]),
    "Uncertain":              torch.tensor([[1.2, 1.0, 0.8]]),
    "Confident\n& wrong":    torch.tensor([[0.5, 0.5, 5.0]]),
}
target = torch.tensor([0])  # correct class is always 0
ce_fn  = nn.CrossEntropyLoss()

labels = list(scenarios.keys())
losses = [ce_fn(logits, target).item() for logits in scenarios.values()]

plt.figure(figsize=(7, 4))
bars = plt.bar(labels, losses, color=["#2dc653", "#f4a261", "#e63946"], width=0.5)
plt.ylabel("Cross-entropy loss")
plt.title("How confidence affects loss")
for bar, val in zip(bars, losses):
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.05,
             f"{val:.2f}", ha="center", fontsize=11, fontweight="bold")
plt.tight_layout()
plt.savefig("assets/images/1.3/cross-entropy-scenarios.png", dpi=150)
plt.show()
Confident & correct : 0.04
Uncertain           : 1.10
Confident & wrong   : 4.56

Being confidently wrong is not just a little bad - it’s ~100× worse than being confidently right.


Key takeaways


Now we can measure how wrong the network is. The next question is: how do we use that measurement to actually fix the weights? That’s the job of gradient descent and backpropagation.

Post 1.4 - Gradient Descent and Backpropagation


Share this post on:

Previous Post
Gradient Descent and Backpropagation: How a Network Actually Learns
Next Post
Activation Functions: Why ReLU, GELU, and SiLU Exist