Skip to content
Amardeep Kumar
Go back

Bias, Variance, and the Tradeoff Every Model Faces

In post 1.7 we saw that a model can memorise training data and fail on anything new. Dropout was the fix. But there’s a deeper question underneath: why do models fail in the first place? It turns out there are exactly two ways a model can be wrong, and they pull in opposite directions. Understanding them tells you what to do when training isn’t working.


Why it exists

Every model makes errors on data it hasn’t seen. Some of that error is unavoidable — the world is noisy. But the rest comes from one of two failure modes.

The first is being too simple. A model that assumes a straight-line relationship between input and output will be wrong even if you gave it infinite training data, because reality isn’t a straight line. It doesn’t matter which samples you train on — the model is fundamentally wrong. This is bias. High bias means the model has wrong assumptions baked in. It underfits — it misses the true pattern entirely.

The second is being too sensitive. A model that memorises every training sample perfectly will perform well on those exact samples and poorly on everything else. Change which samples you train on and you get a completely different model. This is variance. High variance means the model is too sensitive to the specific data it was trained on. It overfits — it learns the noise, not the signal.

These aren’t just abstract categories. They tell you what to fix when something goes wrong:

SymptomLikely causeFix
High training error AND high validation errorHigh bias (underfitting)Use a bigger model
Low training error, high validation errorHigh variance (overfitting)Regularise, get more data

Three panels showing the same scatter of data points. Left: a straight line that misses the curve — underfitting. Middle: a smooth curve that follows the trend — good fit. Right: a wiggly line threading through every point — overfitting.


How it works

Bias

Bias is the error that comes from the model’s assumptions being wrong.

A linear model (y = wx + b) can only fit straight lines. Give it data that follows a curve and it will always miss — not because it hasn’t trained enough, but because its architecture won’t let it represent the right answer. This is a high-bias model. More data doesn’t help. More training steps don’t help. The model needs to be more expressive.

Think of integer division in code. 7 / 2 gives 3, always. No matter how many examples you show it, it will never give you 3.5. The bias is built into the operation itself.

Variance

Variance is the error that comes from the model being too sensitive to the specific training samples.

A very large network trained on a small dataset will find a set of weights that fits every training point exactly — including the noise in those measurements. Those noisy details don’t generalise to new data. If you retrained the same model on a slightly different set of 60 samples, you’d get completely different predictions. The model’s output is unstable — it varies with the data.

This is a lookup table problem. A lookup table that maps every training input to its exact output will hit 100% on training. But ask it about an input it’s never seen and it has nothing to return. All it learned was the exact keys it was given.

The tradeoff

As you make a model more complex — more layers, more neurons — bias falls. The model can represent more shapes, fit more patterns. But variance rises. A model with more capacity can memorise more, and memorisation increases sensitivity to training data.

The plot below shows this directly. Train a two-layer network on 25 points from a sine curve. Vary only the width of the hidden layers — from a single linear layer up to 256 neurons:

A line chart with two curves: train error (blue) and validation error (orange) plotted against model width (linear, 2, 4, 8, 16, 32, 64, 128, 256 neurons). The left region is shaded red and labelled "high bias (underfitting)" — both errors are high. As width increases, both errors drop and val error reaches its minimum around width=4. Beyond that, train error keeps falling but val error rises steadily into the purple-shaded region labelled "high variance (overfitting)".

The linear model (far left) has both train and validation error high — it can’t fit the sine curve at all. Too simple. As we add neurons, both errors drop and the model starts capturing the real pattern. Validation error bottoms out around width 4. Past that, train error keeps falling — the model memorises more — while validation error climbs, because what it’s memorising is noise.

The sweet spot is in the middle: complex enough to capture the signal, simple enough not to memorise the noise.

Where this matters in practice

Diagnosing training problems:

When a model isn’t working, the first thing to check is where the error is. If training error is high, you have a bias problem — the model can’t fit the data. Add capacity: more layers, more neurons, a different architecture. If training error is low but validation error is high, you have a variance problem — add regularisation (dropout, weight decay), get more data, or simplify the model.

Modern deep learning changed the picture:

The classical tradeoff suggests you have to pick a point on the bias-variance curve. But large neural networks with good regularisation can break this intuition — they have low bias (they’re expressive enough to fit anything) AND controlled variance (regularisation prevents them from memorising too specifically).

GPT-4, LLaMA, and similar models have billions of parameters — high capacity, should overfit. But they’re trained on so much data, with enough regularisation, that the variance is managed. The tradeoff still exists. It’s just managed with better tools than simply choosing a small model.

The practical rule of thumb: start with a model large enough to fit the training data (get bias low), then add regularisation to pull variance back in. Don’t start with a tiny model hoping to avoid overfitting — an underfit model is just as broken as an overfit one, and harder to diagnose.


Code

Step 1 - reproduce the complexity curve:

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

torch.manual_seed(42)

N = 25
X_train = torch.linspace(0, 1, N).unsqueeze(1)
y_train = torch.sin(2 * np.pi * X_train) + 0.15 * torch.randn_like(X_train)
X_val   = torch.linspace(0, 1, 300).unsqueeze(1)
y_val   = torch.sin(2 * np.pi * X_val)   # clean sine — no noise

def make_model(width):
    if width == 0:
        return nn.Linear(1, 1)  # purely linear — maximum bias
    return nn.Sequential(
        nn.Linear(1, width), nn.Tanh(),
        nn.Linear(width, width), nn.Tanh(),
        nn.Linear(width, 1),
    )

widths = [0, 2, 4, 8, 16, 32, 64, 128, 256]
loss_fn = nn.MSELoss()
train_errors, val_errors = [], []

for w in widths:
    torch.manual_seed(0)
    model = make_model(w)
    opt   = torch.optim.Adam(model.parameters(), lr=3e-3)

    for _ in range(3000):
        model.train()
        loss = loss_fn(model(X_train), y_train)
        opt.zero_grad(); loss.backward(); opt.step()

    model.eval()
    with torch.no_grad():
        train_errors.append(loss_fn(model(X_train), y_train).item())
        val_errors.append(loss_fn(model(X_val), y_val).item())

for w, tr, vl in zip(widths, train_errors, val_errors):
    print(f"width={str(w):>4}  train={tr:.4f}  val={vl:.4f}")
width=   0  train=0.2400  val=0.2005
width=   2  train=0.0362  val=0.0107
width=   4  train=0.0177  val=0.0090
width=   8  train=0.0102  val=0.0180
width=  16  train=0.0084  val=0.0231
width=  32  train=0.0067  val=0.0266
width=  64  train=0.0032  val=0.0364
width= 128  train=0.0014  val=0.0423
width= 256  train=0.0027  val=0.0330

Train error falls monotonically as width grows. Validation error reaches its minimum at width 4, then rises — the model is memorising the noise in the 25 training points. Width 4 is the sweet spot for this particular problem.

Step 2 - diagnose a model using train vs val error:

# high bias: both errors are high
print(f"Linear model  — train: {train_errors[0]:.4f}  val: {val_errors[0]:.4f}")

# good fit: both errors are low and close
print(f"Width 4       — train: {train_errors[2]:.4f}  val: {val_errors[2]:.4f}")

# high variance: train low, val high
print(f"Width 256     — train: {train_errors[-1]:.4f}  val: {val_errors[-1]:.4f}")
Linear model  — train: 0.2400  val: 0.2005
Width 4       — train: 0.0177  val: 0.0090
Width 256     — train: 0.0027  val: 0.0330

The diagnostic pattern:


Key takeaways


We’ve covered all the core building blocks of deep learning. The next two posts step back to look at the architectures that came before Transformers — and why each one hit a wall that eventually made a new approach necessary.

Post 1.9 - CNN, RNN, LSTM: The Road to Transformers


Share this post on:

Next Post
Dropout and Overfitting: Teaching a Network Not to Cheat