LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

June 2, 2025 · 5 min read

Lead ML Engineer @ Meesho

BharatMLStack Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack—from memory management to kernel execution.

1. Advanced Memory Management: Paged & Prefix KV Caching

The most significant bottleneck in LLM inference is not always compute, but memory bandwidth—specifically managing the Key-Value (KV) cache.

Paged KV caching

Standard caching suffers from fragmentation. We use Paged KV caching, which operates similarly to an operating system's virtual memory: the KV cache is divided into non-contiguous blocks. This lets us serve larger batch sizes without running out of memory.

KV cache quantization

To further maximize available memory, we implement KV cache quantization (e.g., FP8). By compressing stored attention keys and values from 16-bit to 8-bit, we nearly double the effective context window capacity of the GPU, allowing longer conversations or larger batches without materially degrading quality.

Prefix caching (the "voice bot" optimizer)

For use cases like GenAI voice bots where the system prompt (e.g., "You are a helpful assistant...") is static across thousands of requests, we enable prefix caching.

Impact: By reusing pre-computed KV states for common prefixes, we achieve a cache hit rate of ~90%. This reduces Time To First Token (TTFT) by skipping redundant computation of the system prompt.

2. Aggressive Quantization (INT4 AWQ & FP8)

Running models in their native 16-bit precision (BF16) restricts maximum batch size and throughput. We use quantization to shrink model weights without sacrificing accuracy.

INT4 AWQ (Activation-aware Weight Quantization)

For the Llama 3 family, we use AWQ to compress weights to 4 bits. This reduces model size by ~75%, allowing larger models to fit into L4 GPU memory and significantly improving token generation speed.

FP8 precision

For NVIDIA Hopper (H100) architectures, we are exploring FP8 quantization, leveraging native FP8 tensor cores to accelerate matrix multiplications while maintaining a higher dynamic range than integer quantization.

Verification: We validate quantized models by comparing dot-product similarity of embeddings against the FP16 baseline, consistently achieving >99% similarity.

3. Kernel Fusion & Custom Plugins

To minimize overhead from launching thousands of small GPU operations, we fuse them into monolithic kernels using NVIDIA TensorRT plugins.

Flash attention & FMHA: We enable Fused Multi-Head Attention (FMHA) combined with flash attention to reduce memory reads/writes.
GEMM plugins: We use specialized GEMM plugins to accelerate transformer linear layers.
Removing input padding: Instead of padding short sequences to match the longest, we remove input padding so the GPU processes only valid tokens.

4. Inflight (Continuous) Batching

Traditional static batching waits for all requests in a batch to finish before returning results—so one long response delays everyone else.

We implement inflight batching: as soon as one request completes, its slot is freed and filled by a new request from the queue. This keeps GPUs saturated and decouples latency of short queries from long ones.

5. Parallelism Strategies: Scaling Beyond One GPU

For large models (e.g., 70B+ parameters) that cannot fit into the VRAM of a single GPU, we use parallelism strategies.

Tensor parallelism (TP): Split weight matrices across multiple GPUs (e.g., 4× L4 or 8× A100). Each GPU computes a shard and outputs are reduced at every layer.
Pipeline parallelism (PP): Split model layers across GPUs to pipeline compute (e.g., while one GPU computes later layers for Request A, another starts early layers for Request B).

6. Speculative Decoding

To reduce inter-token latency (ITL), we explore speculative decoding.

Mechanism: A smaller, faster "draft" model speculatively generates a short token sequence (e.g., 5 tokens).
Verification: The larger target model verifies those tokens in one parallel forward pass. If correct, we effectively generate multiple tokens per large-model step; if not, we discard and regenerate. This is effective for predictable text, improving perceived generation speed.

Few Benchmarks

Below are a couple of representative use cases and performance numbers.

Search query rewriting

LLM: Fine-tuned llama-3.2-1B
Input & output token length: ~10–20
Response type: Non-streaming

Inference runtime	Hardware	Max requests/sec	Max p99 latency
TensorRT-LLM	4 × L4 GPUs (multi-GPU)	1000	95 ms
TensorRT-LLM	1 × A100 40 GB GPU	1000	69 ms

Voice bot query

LLM: Llama-3.1-8B
Input token length: ~1900–2000
Output token length: ~200
Response type: Streaming

Inference runtime	Concurrency	p99 TTFT (ms)	p99 ITL (ms)	Token throughput (tokens/sec)	Request throughput (req/sec)	Hardware
TensorRT-LLM	1	36.27	22.78	45.66	0.23	L4
TensorRT-LLM	2	49.81	23.21	89.37	0.45	L4
TensorRT-LLM	4	55.33	36.62	153.39	0.78	L4
TensorRT-LLM	8	66.5	39.11	279.88	1.47	L4
TensorRT-LLM	16	131.8	30.39	547.8	2.77	L4
TensorRT-LLM	32	277.22	48.02	925.7	4.78	L4
TensorRT-LLM	64	498.52	71.62	1,164.40	6.2	L4
TensorRT-LLM	128	677.31	120.37	1,445.18	7.69	L4
TensorRT-LLM	256	1,926.31	216.88	1,600.81	8.52	L4
TensorRT-LLM	1	21.17	9.24	130.05	0.68	A100
TensorRT-LLM	2	25.78	9.21	264.5	1.35	A100
TensorRT-LLM	4	28.52	10.99	437.69	2.27	A100
TensorRT-LLM	8	34.4	12.61	760.49	3.96	A100
TensorRT-LLM	16	68.03	14.32	1,343.80	7.01	A100
TensorRT-LLM	32	185.96	16.82	2,287.30	11.92	A100
TensorRT-LLM	64	136.87	21.17	3,625.22	18.89	A100
TensorRT-LLM	128	463.78	34.15	4,456.51	23.24	A100
TensorRT-LLM	256	890.12	59.18	5,188.24	27.05	A100

Conclusion

High-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure.

These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

1. Advanced Memory Management: Paged & Prefix KV Caching​

Paged KV caching​

KV cache quantization​

Prefix caching (the "voice bot" optimizer)​

2. Aggressive Quantization (INT4 AWQ & FP8)​

INT4 AWQ (Activation-aware Weight Quantization)​

FP8 precision​

3. Kernel Fusion & Custom Plugins​

4. Inflight (Continuous) Batching​

5. Parallelism Strategies: Scaling Beyond One GPU​

6. Speculative Decoding​

Few Benchmarks​

Search query rewriting​

Voice bot query​

Conclusion​