Skip to content
Amardeep Kumar
Go back

How We Cut ML Inference Latency by 40% on Kubernetes

When I joined Instabase’s Model Service team, our ML inference pipeline was synchronous, single-level, and running on compute-limited Kubernetes nodes. Under load, latency was rough. Here’s what we built to fix it.

The Problem

Document AI inference is expensive. Each request runs a large model (sometimes LLM-scale) against a document that might be dozens of pages. In a Kubernetes environment with limited GPU resources, synchronous request handling meant:

The Architecture

We built a fully async serving platform with four key components:

1. Async Workers

Instead of blocking on each inference request, we moved to an async worker pool. Requests are dispatched to available workers without holding a connection. This alone reduced p99 latency significantly under load.

2. RabbitMQ for Request Queuing

We introduced RabbitMQ as the message broker between the API layer and inference workers. This decouples request ingestion from processing, enabling backpressure handling and reliable retries without dropping requests.

3. Two-Level Caching

We implemented a two-level cache:

For document-heavy workflows, cache hit rates above 40% were common — especially for repeat documents in audit/compliance pipelines.

4. Sticky Routing

Model loading is expensive. We implemented sticky routing so that requests for the same model are preferentially routed to workers that already have it loaded in memory. This dramatically reduced model cold-start overhead.

Hardware Acceleration

On top of the architecture changes, we applied ONNX export and model pruning to reduce per-inference compute. This compounded the latency gains.

Results

40% reduction in inference time on a compute-limited Kubernetes environment. During peak load periods the improvement was even more pronounced due to the queueing and caching effects.

Lessons

If you’re building ML inference infrastructure and want to talk through the details, reach out.


Share this post on:

Previous Post
Fine-tuning Phi-2 with DPO on the Anthropic HH Dataset
Next Post
GupShup: Summarizing Code-Switched Conversations