Archives

All the articles I've archived.

2026 ¹²

July ¹

Distributed Training at Scale: Parallelism, Memory, and the Optimizer Stack

17 Jul, 2026

How DDP, tensor/pipeline/sequence parallelism, ZeRO/FSDP, activation recomputation, mixed precision, and the optimizer stack combine to train models no single GPU could hold.

May ¹¹

Bias, Variance, and the Tradeoff Every Model Faces

8 May, 2026

Why models fail in two opposite ways — being too rigid or too sensitive — and how to find the sweet spot between them.
Dropout and Overfitting: Teaching a Network Not to Cheat

7 May, 2026

What overfitting is, why it happens, and how dropout stops a network from memorising the training data.
Transformer Architecture & Key Design Decisions

6 May, 2026

A deep dive into the transformer architecture, why decoder-only models won, and the key design decisions — RoPE, GQA, Flash Attention, MoE — that define every modern LLM.
Normalization: BatchNorm, LayerNorm, and Why Transformers Need a Different One

6 May, 2026

Why activations drift as they pass through deep networks, and how BatchNorm and LayerNorm fix it in different ways.
Optimizers: SGD, Momentum, Adam, and AdamW

5 May, 2026

Why plain gradient descent isn't enough, and how SGD, momentum, Adam, and AdamW each fix a problem the previous one had.
Gradient Descent and Backpropagation: How a Network Actually Learns

5 May, 2026

How gradient descent uses the loss to update weights, and how backpropagation computes the gradients that make it possible.
Loss Functions: How a Neural Network Knows It's Wrong

4 May, 2026

What loss functions are, how MSE and cross-entropy work, and why picking the wrong one breaks your model even if everything else is right.
Activation Functions: Why ReLU, GELU, and SiLU Exist

3 May, 2026

Why stacking linear layers isn't enough, and how activation functions like ReLU, GELU, and SiLU give neural networks their power.
What is a Neural Network?

2 May, 2026

A neural network explained from scratch - neurons, weights, layers, and the forward pass - no ML background required.
GenZ to AI Enz: Series Index

1 May, 2026

Full table of contents for the GenZ to AI Enz series - every post and walkthrough in order.
GenZ to AI Enz: A Roadmap for CS Grads Breaking into AI

30 Apr, 2026

A complete series taking CS students and early-career engineers from zero ML knowledge to building real AI systems with LLMs and agents.

2024 ¹

March ¹

Fine-tuning Phi-2 with DPO on the Anthropic HH Dataset

29 Feb, 2024

Fine-tuning Microsoft's Phi-2 using Direct Preference Optimization (DPO) on the Anthropic Helpful and Harmless dataset with LoRA and 8-bit quantization.

2023 ¹

June ¹

How We Cut ML Inference Latency by 40% on Kubernetes

31 May, 2023

The architecture behind our async model serving platform at Instabase — async workers, RabbitMQ, multi-level caching, and sticky routing to cut inference time by 40%.

2021 ¹

November ¹

GupShup: Summarizing Code-Switched Conversations

6 Nov, 2021

Our EMNLP 2021 paper on abstractive summarization of Hindi-English code-switched conversations — introducing the GupShup dataset.

Archives

Distributed Training at Scale: Parallelism, Memory, and the Optimizer Stack

Bias, Variance, and the Tradeoff Every Model Faces

Dropout and Overfitting: Teaching a Network Not to Cheat

Transformer Architecture & Key Design Decisions

Normalization: BatchNorm, LayerNorm, and Why Transformers Need a Different One

Optimizers: SGD, Momentum, Adam, and AdamW

Gradient Descent and Backpropagation: How a Network Actually Learns

Loss Functions: How a Neural Network Knows It's Wrong

Activation Functions: Why ReLU, GELU, and SiLU Exist

What is a Neural Network?

GenZ to AI Enz: Series Index

GenZ to AI Enz: A Roadmap for CS Grads Breaking into AI

Fine-tuning Phi-2 with DPO on the Anthropic HH Dataset

How We Cut ML Inference Latency by 40% on Kubernetes

GupShup: Summarizing Code-Switched Conversations