Skip to main content

LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

· 5 min read
Jaya Kumar
Lead ML Engineer @ Meesho

BharatMLStack Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack—from memory management to kernel execution.

1. Advanced Memory Management: Paged & Prefix KV Caching

The most significant bottleneck in LLM inference is not always compute, but memory bandwidth—specifically managing the Key-Value (KV) cache.

Paged KV caching

Standard caching suffers from fragmentation. We use Paged KV caching, which operates similarly to an operating system's virtual memory: the KV cache is divided into non-contiguous blocks. This lets us serve larger batch sizes without running out of memory.

KV cache quantization

To further maximize available memory, we implement KV cache quantization (e.g., FP8). By compressing stored attention keys and values from 16-bit to 8-bit, we nearly double the effective context window capacity of the GPU, allowing longer conversations or larger batches without materially degrading quality.

Prefix caching (the "voice bot" optimizer)

For use cases like GenAI voice bots where the system prompt (e.g., "You are a helpful assistant...") is static across thousands of requests, we enable prefix caching.

  • Impact: By reusing pre-computed KV states for common prefixes, we achieve a cache hit rate of ~90%. This reduces Time To First Token (TTFT) by skipping redundant computation of the system prompt.

2. Aggressive Quantization (INT4 AWQ & FP8)

Running models in their native 16-bit precision (BF16) restricts maximum batch size and throughput. We use quantization to shrink model weights without sacrificing accuracy.

INT4 AWQ (Activation-aware Weight Quantization)

For the Llama 3 family, we use AWQ to compress weights to 4 bits. This reduces model size by ~75%, allowing larger models to fit into L4 GPU memory and significantly improving token generation speed.

FP8 precision

For NVIDIA Hopper (H100) architectures, we are exploring FP8 quantization, leveraging native FP8 tensor cores to accelerate matrix multiplications while maintaining a higher dynamic range than integer quantization.

  • Verification: We validate quantized models by comparing dot-product similarity of embeddings against the FP16 baseline, consistently achieving >99% similarity.

3. Kernel Fusion & Custom Plugins

To minimize overhead from launching thousands of small GPU operations, we fuse them into monolithic kernels using NVIDIA TensorRT plugins.

  • Flash attention & FMHA: We enable Fused Multi-Head Attention (FMHA) combined with flash attention to reduce memory reads/writes.
  • GEMM plugins: We use specialized GEMM plugins to accelerate transformer linear layers.
  • Removing input padding: Instead of padding short sequences to match the longest, we remove input padding so the GPU processes only valid tokens.

4. Inflight (Continuous) Batching

Traditional static batching waits for all requests in a batch to finish before returning results—so one long response delays everyone else.

We implement inflight batching: as soon as one request completes, its slot is freed and filled by a new request from the queue. This keeps GPUs saturated and decouples latency of short queries from long ones.

5. Parallelism Strategies: Scaling Beyond One GPU

For large models (e.g., 70B+ parameters) that cannot fit into the VRAM of a single GPU, we use parallelism strategies.

  • Tensor parallelism (TP): Split weight matrices across multiple GPUs (e.g., 4× L4 or 8× A100). Each GPU computes a shard and outputs are reduced at every layer.
  • Pipeline parallelism (PP): Split model layers across GPUs to pipeline compute (e.g., while one GPU computes later layers for Request A, another starts early layers for Request B).

6. Speculative Decoding

To reduce inter-token latency (ITL), we explore speculative decoding.

  • Mechanism: A smaller, faster "draft" model speculatively generates a short token sequence (e.g., 5 tokens).
  • Verification: The larger target model verifies those tokens in one parallel forward pass. If correct, we effectively generate multiple tokens per large-model step; if not, we discard and regenerate. This is effective for predictable text, improving perceived generation speed.

Few Benchmarks

Below are a couple of representative use cases and performance numbers.

Search query rewriting

  • LLM: Fine-tuned llama-3.2-1B
  • Input & output token length: ~10–20
  • Response type: Non-streaming
Inference runtimeHardwareMax requests/secMax p99 latency
TensorRT-LLM4 × L4 GPUs (multi-GPU)100095 ms
TensorRT-LLM1 × A100 40 GB GPU100069 ms

Voice bot query

  • LLM: Llama-3.1-8B
  • Input token length: ~1900–2000
  • Output token length: ~200
  • Response type: Streaming
Inference runtimeConcurrencyp99 TTFT (ms)p99 ITL (ms)Token throughput (tokens/sec)Request throughput (req/sec)Hardware
TensorRT-LLM136.2722.7845.660.23L4
TensorRT-LLM249.8123.2189.370.45L4
TensorRT-LLM455.3336.62153.390.78L4
TensorRT-LLM866.539.11279.881.47L4
TensorRT-LLM16131.830.39547.82.77L4
TensorRT-LLM32277.2248.02925.74.78L4
TensorRT-LLM64498.5271.621,164.406.2L4
TensorRT-LLM128677.31120.371,445.187.69L4
TensorRT-LLM2561,926.31216.881,600.818.52L4
TensorRT-LLM121.179.24130.050.68A100
TensorRT-LLM225.789.21264.51.35A100
TensorRT-LLM428.5210.99437.692.27A100
TensorRT-LLM834.412.61760.493.96A100
TensorRT-LLM1668.0314.321,343.807.01A100
TensorRT-LLM32185.9616.822,287.3011.92A100
TensorRT-LLM64136.8721.173,625.2218.89A100
TensorRT-LLM128463.7834.154,456.5123.24A100
TensorRT-LLM256890.1259.185,188.2427.05A100

Conclusion

High-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure.

These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.