Skip to main content

5 posts tagged with "meesho"

View All Tags

LLM Inference Optimization Techniques: Engineering Sub-Second Latency at Scale

· 5 min read
Jaya Kumar
Lead ML Engineer @ Meesho

BharatMLStack Raw execution of Large Language Models is inherently expensive and memory-intensive. To achieve sub-second latency and high throughput, we implement a multi-layered optimization strategy that targets the entire inference stack—from memory management to kernel execution.

1. Advanced Memory Management: Paged & Prefix KV Caching

The most significant bottleneck in LLM inference is not always compute, but memory bandwidth—specifically managing the Key-Value (KV) cache.

Paged KV caching

Standard caching suffers from fragmentation. We use Paged KV caching, which operates similarly to an operating system's virtual memory: the KV cache is divided into non-contiguous blocks. This lets us serve larger batch sizes without running out of memory.

KV cache quantization

To further maximize available memory, we implement KV cache quantization (e.g., FP8). By compressing stored attention keys and values from 16-bit to 8-bit, we nearly double the effective context window capacity of the GPU, allowing longer conversations or larger batches without materially degrading quality.

Prefix caching (the "voice bot" optimizer)

For use cases like GenAI voice bots where the system prompt (e.g., "You are a helpful assistant...") is static across thousands of requests, we enable prefix caching.

  • Impact: By reusing pre-computed KV states for common prefixes, we achieve a cache hit rate of ~90%. This reduces Time To First Token (TTFT) by skipping redundant computation of the system prompt.

2. Aggressive Quantization (INT4 AWQ & FP8)

Running models in their native 16-bit precision (BF16) restricts maximum batch size and throughput. We use quantization to shrink model weights without sacrificing accuracy.

INT4 AWQ (Activation-aware Weight Quantization)

For the Llama 3 family, we use AWQ to compress weights to 4 bits. This reduces model size by ~75%, allowing larger models to fit into L4 GPU memory and significantly improving token generation speed.

FP8 precision

For NVIDIA Hopper (H100) architectures, we are exploring FP8 quantization, leveraging native FP8 tensor cores to accelerate matrix multiplications while maintaining a higher dynamic range than integer quantization.

  • Verification: We validate quantized models by comparing dot-product similarity of embeddings against the FP16 baseline, consistently achieving >99% similarity.

3. Kernel Fusion & Custom Plugins

To minimize overhead from launching thousands of small GPU operations, we fuse them into monolithic kernels using NVIDIA TensorRT plugins.

  • Flash attention & FMHA: We enable Fused Multi-Head Attention (FMHA) combined with flash attention to reduce memory reads/writes.
  • GEMM plugins: We use specialized GEMM plugins to accelerate transformer linear layers.
  • Removing input padding: Instead of padding short sequences to match the longest, we remove input padding so the GPU processes only valid tokens.

4. Inflight (Continuous) Batching

Traditional static batching waits for all requests in a batch to finish before returning results—so one long response delays everyone else.

We implement inflight batching: as soon as one request completes, its slot is freed and filled by a new request from the queue. This keeps GPUs saturated and decouples latency of short queries from long ones.

5. Parallelism Strategies: Scaling Beyond One GPU

For large models (e.g., 70B+ parameters) that cannot fit into the VRAM of a single GPU, we use parallelism strategies.

  • Tensor parallelism (TP): Split weight matrices across multiple GPUs (e.g., 4× L4 or 8× A100). Each GPU computes a shard and outputs are reduced at every layer.
  • Pipeline parallelism (PP): Split model layers across GPUs to pipeline compute (e.g., while one GPU computes later layers for Request A, another starts early layers for Request B).

6. Speculative Decoding

To reduce inter-token latency (ITL), we explore speculative decoding.

  • Mechanism: A smaller, faster "draft" model speculatively generates a short token sequence (e.g., 5 tokens).
  • Verification: The larger target model verifies those tokens in one parallel forward pass. If correct, we effectively generate multiple tokens per large-model step; if not, we discard and regenerate. This is effective for predictable text, improving perceived generation speed.

Few Benchmarks

Below are a couple of representative use cases and performance numbers.

Search query rewriting

  • LLM: Fine-tuned llama-3.2-1B
  • Input & output token length: ~10–20
  • Response type: Non-streaming
Inference runtimeHardwareMax requests/secMax p99 latency
TensorRT-LLM4 × L4 GPUs (multi-GPU)100095 ms
TensorRT-LLM1 × A100 40 GB GPU100069 ms

Voice bot query

  • LLM: Llama-3.1-8B
  • Input token length: ~1900–2000
  • Output token length: ~200
  • Response type: Streaming
Inference runtimeConcurrencyp99 TTFT (ms)p99 ITL (ms)Token throughput (tokens/sec)Request throughput (req/sec)Hardware
TensorRT-LLM136.2722.7845.660.23L4
TensorRT-LLM249.8123.2189.370.45L4
TensorRT-LLM455.3336.62153.390.78L4
TensorRT-LLM866.539.11279.881.47L4
TensorRT-LLM16131.830.39547.82.77L4
TensorRT-LLM32277.2248.02925.74.78L4
TensorRT-LLM64498.5271.621,164.406.2L4
TensorRT-LLM128677.31120.371,445.187.69L4
TensorRT-LLM2561,926.31216.881,600.818.52L4
TensorRT-LLM121.179.24130.050.68A100
TensorRT-LLM225.789.21264.51.35A100
TensorRT-LLM428.5210.99437.692.27A100
TensorRT-LLM834.412.61760.493.96A100
TensorRT-LLM1668.0314.321,343.807.01A100
TensorRT-LLM32185.9616.822,287.3011.92A100
TensorRT-LLM64136.8721.173,625.2218.89A100
TensorRT-LLM128463.7834.154,456.5123.24A100
TensorRT-LLM256890.1259.185,188.2427.05A100

Conclusion

High-performance LLM inference is fundamentally a systems engineering problem: memory efficiency, kernel execution, batching strategy, and parallelism determine real-world latency and throughput. Techniques such as paged KV caching, aggressive quantization, kernel fusion, and inflight batching improve GPU utilization while reducing latency and memory pressure.

These optimizations enable the platform to deliver sub-second responses, sustain high concurrency, and efficiently serve both lightweight and long-context workloads. By continuously optimizing across the full inference stack, we keep LLM serving scalable, cost-efficient, and production-ready for real-time AI applications.

Designing a Production-Grade LLM Inference Platform: From Model Weights to Scalable GPU Serving

· 14 min read
Jaya Kumar
Lead ML Engineer @ Meesho

BharatMLStack Serving large language models in production introduces new challenges across infrastructure, performance optimization, and operational lifecycle management. The LLM Inference Platform addresses these challenges by providing a unified system for deploying and managing open-source and fine-tuned LLMs at scale.

The platform implements a complete LLMOps lifecycle — from model registration and automated compilation to deployment, runtime optimization, and monitoring. Designed as a self-service environment, users can onboard models directly from open repositories such as Hugging Face or upload custom fine-tuned models, and deploy them using a single-click workflow with no manual infrastructure or configuration steps required.

In addition to fully automated deployment, the platform allows users to select and apply custom inference optimization techniques — such as quantization strategies, batching configurations, and runtime-specific performance enhancements — enabling teams to balance latency, throughput, and cost based on their use case. The goal is to reduce operational friction while enabling high-performance, production-grade LLM inference.

Why LLM Inference Is not just bigger ML model serving

Large language model (LLM) inference introduces a fundamentally different set of challenges compared to traditional machine learning inference. While classical ML models typically perform a single forward pass to produce a fixed prediction, LLMs operate as autoregressive systems, generating outputs token by token based on previously generated context. This difference dramatically changes how inference systems must be designed, optimized, and scaled.

Autoregressive Generation and Sequential Computation:

Unlike traditional models such as classifiers or recommenders — where inference cost is relatively constant — LLMs generate responses incrementally. Each new token depends on all previously generated tokens, making inference inherently sequential and dynamic. This means latency and compute requirements vary significantly depending on prompt length and output size, introducing complexity in scheduling and resource allocation. Because tokens cannot be generated fully in parallel during decoding, GPUs may become underutilized without specialized batching and scheduling strategies. This has led to the development of dedicated LLM inference engines optimized for token-level execution.

Prefill and Decode Phases:

LLM inference typically consists of two distinct stages:

  • Prefill phase — the model processes the input prompt and builds internal representations. This stage is compute-heavy and highly parallelizable.
  • Decode phase — the model generates tokens sequentially, predicting one token at a time using previously generated context.

The decode stage often becomes memory-bound rather than compute-bound, which creates new performance bottlenecks compared to traditional ML workloads.

Context Management and KV Caching:

Another fundamental difference lies in how LLMs maintain context. Transformer-based models rely on attention mechanisms that require access to past token representations. To avoid recomputing these representations repeatedly, inference engines use key-value (KV) caching, which stores intermediate activations from previous tokens. KV caching significantly improves performance by eliminating redundant computation, but it introduces new challenges:

  • Memory consumption grows with sequence length and batch size
  • GPU memory becomes a critical bottleneck
  • Efficient memory management becomes essential for scaling concurrent requests

This tradeoff between compute efficiency and memory usage is unique to LLM inference workloads.

Dynamic and Irregular Workloads:

Traditional ML inference typically operates on fixed-size inputs with predictable latency. In contrast, LLM requests vary widely in prompt length, output length, and runtime behavior. As a result:

  • Batch sizes must be dynamic rather than static
  • Requests may enter and leave batches asynchronously
  • Scheduling systems must continuously rebalance workloads to maximize GPU utilization

These characteristics require specialized serving architectures that differ significantly from standard ML serving pipelines.

Streaming and User Experience Constraints:

Another distinguishing factor is the expectation of real-time streaming responses. Instead of returning a single output, LLM systems often stream tokens to users as they are generated. Because of these differences — sequential generation, growing memory requirements, dynamic workloads, and streaming constraints — LLM inference cannot be treated as a simple extension of existing ML serving systems. Production platforms must incorporate specialized runtime engines, advanced optimization techniques, and observability tailored specifically to LLM workloads.

LLMOps: High-Level Architecture

LLM Architecture

The LLM Inference Framework is designed as a fully automated, end-to-end system for deploying and operating open-source and fine-tuned large language models at scale. The architecture abstracts the complexity of model optimization, hardware selection, deployment, and runtime management into a unified workflow that enables users to move from raw model weights to production-ready inference endpoints with minimal manual intervention.

Our LLM Inference Framework is architected not just as a serving engine, but as a complete lifecycle management system. As illustrated in the high-level design below, the platform automates the journey of a model through seven distinct stages, ensuring reproducibility, performance, and scalability.

  1. Onboarding & Registration (The Source of Truth)

    The lifecycle begins with the Data Scientist or engineer.

    • Model Ingestion: Users onboard models—whether open-source (Hugging Face, NeMo) or internally fine-tuned—via the Truffle Box SDK/UI.
    • LLM + Prompt Registry: Unlike traditional systems that only track model weights, our registry is a unified control plane. It stores both the Model Artifacts and the Prompt Templates. This allows Data Scientists to register and version-control prompts (e.g., "customer_support_v2") independently of the application code.
  2. The "Black Box" Build Engine

    Once a model is registered, the Automated LLM Compiler + Quantizer Module kicks off a background job on ephemeral GPU resources.

    • Transformation: The raw model is converted into a TRT-LLM Checkpoint.
    • Quantization: The system automatically applies quantization algorithms (like INT4 AWQ or FP8) to reduce memory footprint.
    • Engine Building: Finally, it compiles a highly optimized TRT Engine specifically tuned for the target hardware.
  3. Intelligent Profiling & Validation

    Before deployment, the new engine passes through the Hardware & Inference Runtime Profiler.

    • Benchmarking: This module empirically tests the engine against various hardware configurations (L4 vs. A100) and runtimes (TRT-LLM vs. vLLM).
    • Optimization: It recommends the optimal configuration that meets latency SLAs (Time-To-First-Token) while minimizing cost.
  4. Smart Artifact Generation & Distribution

    To solve the Kubernetes "Cold Start" problem, the LLM Serving Artifacts Generation module packages the model using a bifurcated strategy:

    • Standard Models: Artifacts are uploaded to Cloud Storage (GCS) and downloaded by pods at startup.
    • Very Large Models: For massive models (>8GB) where network downloads are too slow, the system pre-caches the model onto Secondary Boot Disks. These disks are attached directly to new GPU nodes during autoscaling, eliminating download wait times.
  5. Image Streaming & Deployment

    Simultaneously, the inference runtime container images are pulled from the Artifact Registry.

    • Image Streaming: We utilize container image streaming to allow pods to start initializing while the massive Triton/Dynamo container layers are still downloading, further shaving seconds off the startup time. link
  6. The Inference Runtime (Kubernetes)

    The workload lands on Kubernetes with Autoscaling.

    • Dynamic Backends: Depending on the profile generated in Stage 3, the pod initializes either TensorRT-LLM (for throughput) or vLLM (for flexibility), or spins up a Dynamo worker for distributed inference.
    • Data Loading: The pod either downloads the model from Cloud Storage or mounts the pre-warmed Secondary Boot Disk ("Pull from Disk").
  7. Client Interaction & Observability

    Finally, the LLM Inference Client executes the request.

    • Prompt Injection: The client pulls the specific prompt template ID from the Registry, ensuring the exact versioned instructions are used.
    • Streaming Response: The request is sent via gRPC, and tokens are streamed back to the user in real-time.
  8. Observability: Monitoring the Pulse of GenAI

    In traditional microservices, success is measured by CPU utilization and request latency (p99). For Large Language Models, these metrics are insufficient. A user doesn't care if the GPU is at 80% utilization; they care about how fast the first word appears and how smoothly the rest of the sentence follows.

    To capture the true user experience, our platform instrumentation focuses on three critical LLM-specific metrics:

    1. Time to First Token (TTFT)

      • Definition: TTFT measures the time elapsed from the moment a request is received until the very first token is generated and streamed back to the user.
      • Why it matters: This represents the "Prefill Phase" latency—the time the model takes to process the input prompt and load weights. A high TTFT makes the application feel unresponsive or "hung."
      • Optimization: We closely monitor TTFT to ensure our Prefix Caching is effective (aiming for high cache hitrates), which drastically lowers this metric by skipping redundant prompt processing.
    2. Inter-Token Latency (ITL)

      • Definition: ITL measures the average time interval between the generation of consecutive tokens during the "Decode Phase".
      • Why it matters: This defines the "perceived speed" of reading. Even if the first token is fast (low TTFT), high ITL makes the text generation look "jerky" or slow to the user.
      • Benchmarks: In our testing with Llama 3.1, we track p99 ITL to ensure it stays below human reading speeds to maintain a natural conversational flow.
    3. Token Throughput vs. Request Throughput

      • We distinguish between two types of throughput to balance system efficiency with user load:
      • Token Throughput (tokens/sec): The total number of tokens generated across all concurrent requests. This measures the raw compute efficiency of the GPU and the effectiveness of batching.
      • Request Throughput (req/sec): The number of distinct user queries served per second. We use this to determine autoscaling thresholds, ensuring we scale out before the queue depth impacts ITL.
    4. The Monitoring Stack

      • Real-time Dashboards: We utilize Grafana to visualize these streaming metrics in real-time, allowing on-call engineers to spot "slow generation" incidents that generic "500 error" alerts would miss.
      • Request Tracing: Since Triton Inference Server does not log request payloads by default, we integrate a Helix Client to asynchronously publish request logs to Log Tables. This allows us to trace a specific "slow" request back to its prompt to understand if a complex input caused the latency spike.

Supported Inference backends (TensorRT LLM, Dynamo & vLLM)

Tailored for the Use Case: We do not believe in a "one-size-fits-all" approach to inference. Different use cases—whether a real-time voice bot requiring ultra-lowsub-second latency or a massive reasoning task requiring huge context windows—demand different runtime characteristics. Our platform is designed to be runtime-agnostic, allowing us to automatically select and tailor the best engine based on the specific requirements of the application:

  1. TensorRT-LLM: The High-Performance Standard

    Suitable for: High-throughput production workloads where latency is critical (e.g., customer support chat, real-time voice bots).

    TensorRT-LLM serves as our default backend for these scenarios. Our internal benchmarks on Llama 3.1 and 3.2 models demonstrated that a tuned TensorRT-LLM engine significantly outperforms standard runtimes, especially when utilizing INT4 AWQ and FP8 quantization .

    Key optimizations we tailor for these high-load cases include:

    • Optimized execution via TensorRT engine compilation
    • Quantization-aware execution for reduced memory usage and improved throughput
    • Inflight Batching: Allowing requests to be processed continuously without waiting for the entire batch to finish, drastically improving GPU utilization .
    • Custom Plugins: Enabling specific NVIDIA plugins like the GEMM plugin and GPT Attention plugin to accelerate matrix multiplications and attention mechanisms .
  2. Dynamo: Distributed Inference for Reasoning Models

    Suitable for: Very large "reasoning" models (70B+) or scenarios requiring massive context windows where a single GPU's memory is insufficient.

    For these memory-bound tasks, we utilize Dynamo, a low-latency distributed inference framework . Unlike monolithic servers, Dynamo disaggregates the inference process to scale resources horizontally:

    • KV Aware Routing: A specialized router directs requests to workers that already hold the relevant Key-Value (KV) cache, minimizing redundant computation .
    • Prefill vs. Decode Split: The workload is divided into Prefill Workers (processing the prompt) and Decode Workers (generating tokens), allowing us to scale the compute-heavy "reading" phase independently from the memory-heavy "writing" phase .
    • Distributed execution across multiple GPU resources
  3. vLLM: The Flexible Baseline

    Suitable for: Rapid prototyping, testing new model architectures, or low-traffic internal tools where ease of deployment outweighs raw throughput.

    While TensorRT-LLM is optimized for maximum speed, vLLM provides a robust and flexible baseline .

    • High throughput through dynamic batching and efficient memory utilization
    • Paged KV cache management for handling long contexts and concurrent requests
    • Strong support for open-source model ecosystems
    • Rapid Adoption: It allows us to onboard new model architectures immediately without waiting for a custom TensorRT build.
    • Benchmarking Insight: In our internal tests, vLLM provided a strong baseline but often lacked the specific max-token optimizations present in our custom TRT engines . We use it strategically for initial testing before committing to a full TensorRT optimization pipeline.

Conclusion

Large language model inference introduces a fundamentally new class of infrastructure challenges—where performance is governed not just by raw compute, but by memory efficiency, intelligent scheduling, runtime specialization, and lifecycle automation. Unlike traditional ML serving, LLM inference requires systems that understand token-level execution, manage rapidly growing context state, and continuously balance latency, throughput, and cost under highly dynamic workloads.

The LLM Inference Framework addresses these challenges by transforming inference into a fully automated, reproducible lifecycle—from model onboarding and compilation to deployment, optimization, and observability. By integrating automated quantization and engine compilation, intelligent runtime selection, cold-start mitigation strategies, and LLM-specific observability metrics such as Time-to-First-Token and Inter-Token Latency, the platform ensures both high performance and operational simplicity.

Equally important, the framework is designed with flexibility and future evolution in mind. Its runtime-agnostic architecture enables seamless adoption of emerging inference engines, hardware accelerators, and optimization techniques without requiring platform redesign. This ensures that teams can continuously leverage advancements in the rapidly evolving LLM ecosystem while maintaining consistent operational workflows.

Ultimately, the goal of the platform is to make production-scale LLM deployment as seamless and reliable as traditional software deployment—allowing teams to focus on building intelligent applications rather than managing infrastructure complexity. By combining lifecycle automation, runtime optimization, and deep observability, the LLM Inference Framework provides a scalable foundation for delivering fast, cost-efficient, and production-ready LLM experiences.

Future Explorations

While we have achieved significant milestones in latency and throughput, the landscape of GenAI is evolving rapidly. Our roadmap focuses on increasing flexibility, reducing costs, and enhancing reliability for enterprise-grade workloads. Here is what we are building next:

  • TPU Support: To diversify our hardware supply chain and further optimize cost-per-token, we are evaluating Google Cloud TPUs to bake it into our platform. By leveraging the JAX and PyTorch/XLA ecosystems, we aim to unlock the massive throughput potential of TPU v5e chips, particularly for our open-source Llama models. This will allow the hardware profiler to dynamically choose between NVIDIA GPUs and Google TPUs based on real-time availability and price-performance metrics.
  • Multi-LoRA Serving (Serverless Experience): Currently, deploying a fine-tuned model requires a dedicated GPU. We are building support for Multi-LoRA serving, which will allow us to serve hundreds of unique, fine-tuned adapters on top of a single frozen base model. This will drastically reduce costs for multi-tenant applications, enabling a "serverless" experience where specific fine-tunes are hot-swapped instantly per request.
  • Spot Instance Orchestration: To further optimize cloud costs, we are developing fault-tolerant mechanisms to run inference workloads on Spot Instances. By implementing aggressive checkpointing and seamless request draining, we aim to leverage cheaper, preemptible compute capacity without interrupting the user's streaming experience.
  • Semantic Caching Layer: We plan to move beyond standard Prefix Caching to implement Semantic Caching. By using a vector database to fetch responses for semantically similar queries (e.g., "How do I reset my password?" vs. "Password reset steps"), we can bypass the GPU entirely for repetitive queries, reducing latency to near-zero.
  • Context-Aware Autoscaling: Standard CPU/GPU utilization metrics are often insufficient signals for scaling LLMs. We are working on KV-cache pressure metrics for autoscaling. This ensures that we scale out before the memory fills up, preventing eviction-based slowdowns during traffic spikes.
  • Online Evaluation & Guardrails: We are integrating a lightweight "Trust Layer" into the proxy. This will allow for low-latency input/output filtering (Guardrails) and asynchronous "LLM-as-a-Judge" evaluation pipelines to monitor response quality in production, not just system health.

Cracking the Code: Scaling Model Inference & Real-Time Embedding Search

· 4 min read
Aditya Kumar
Lead Software Engineer @ Meesho
Jaya Kumar
Lead ML Engineer @ Meesho
Adarsha Das
Senior Architect @ Meesho

BharatMLStack By mid-2023, we had transformed our ML stack—building a real-time feature store, optimizing model retrieval, and fine-tuning ranking. But two critical gaps remained:

  • 🔹 Scaling model inference without hitting infrastructure roadblocks
  • 🔹 Moving embedding search from batch to real-time for candidate generation

Here’s how we tackled these last-mile challenges, broke free from infrastructure constraints, and built a cost-efficient, high-performance system.

Breaking Free from the Scalability Ceiling

The Model Serving Bottleneck—A Wake-Up Call

July 2023. With just months left for the Mega Blockbuster Sale (MBS), we noticed a serious issue—scaling our model-serving infrastructure was taking 10–15 minutes. In real-time ML, that’s an eternity. In one of our war rooms, we ran a quick experiment:

  • 🚀 We deployed an XGBoost model on a self-hosted Triton Inference Server running on a 16-core machine.
  • 🚀 Fired requests and compared the outputs with our existing cloud-hosted setup.
  • 🚀 The results matched—perfectly.

That moment changed everything. We prepped a backup Triton setup on EKS, just in case our cloud provider couldn't allocate enough compute resources in time. Luckily, they did—but the seed was planted. Then in October, just two weeks before MBS, we got an alarming response from our infrastructure team: "Node availability may be an issue." With no time to waste, we moved 30% of real-time ML traffic to our self-hosted Triton cluster. The results?

  • ✅ p99 latency dropped from 90–100ms to 30–40ms
  • ✅ Triton handled significantly higher throughput on fewer resources
  • ✅ No model changes were needed

MBS ran without a hitch, proving that self-hosted inference was the way forward.

Scaling Triton on GKE

This left us with two choices:

  • 1️⃣ Port models to a managed cloud inference service, investing time in learning a new deployment stack
  • 2️⃣ Scale our existing Triton setup on GKE, optimizing for cost and performance

We went with Option 2—and it slashed inference costs to 35% of what we previously paid, while giving us full control over scaling and optimizations.

Fixing the Cold Start Problem

As we onboarded more deep learning (DL) models, we hit a new bottleneck, new inference pods took 7–9 minutes to spin up.

After profiling, we found the culprits:

  • Triton’s base image—a massive 5GB
  • Model binaries—often 1GB+
  • Startup delay—mostly due to downloading and initializing these assets

To fix this, we built a lightweight Triton image, stripping unused components and shrinking the size to 900MB. This cut cold start times drastically, making auto-scaling faster and smoother.

Embedding Search: The Last Piece of the Puzzle

By mid-2023, most of our ML stack had gone real-time—except for Candidate Generation (CG), which still ran in batch mode. To truly power real-time recommendations, we needed an online embedding search system.

Choosing the Right Vector Database

We benchmarked three production-ready vector DBs across key parameters:

  • Milvus
  • Qdrant
  • Weaviate

After extensive POCs, Qdrant stood out for its:

  • ✅ Blazing-fast search latency on high-dimensional vectors
  • ✅ Efficient memory usage, crucial for in-memory workloads
  • ✅ Support for upserts and soft deletes, vital for Ads use cases
  • ✅ gRPC + REST APIs, making integration seamless
  • ✅ Powerful filtering, allowing fine-tuned retrieval (e.g., filtering Ads by category, active status, etc.)

At its core, Qdrant uses HNSW indexing, delivering both high recall and low-latency nearest-neighbor search—a perfect fit for our needs.

Embedding Freshness & Real-Time Updates

To ensure embeddings stayed up to date, we built a dual ingestion pipeline:

  • 📌 Daily Refresh: A bulk pipeline updated embeddings overnight
  • 📌 Real-Time Updates: Ads events triggered immediate upserts/deletes

This setup powered real-time "Similar Products" recommendations on the product page and became the foundation for Ads Candidate Generation, ensuring the right ads surfaced in milliseconds.

Skye

Final Takeaways: Scaling Smartly for Real-Time ML

  • 🚀 Self-hosted inference on Triton gave us lower cost, faster scaling, and better performance than managed services
  • 🚀 Building a custom Triton image reduced cold starts, improving responsiveness
  • 🚀 Qdrant-based embedding search enabled real-time personalization at scale
  • 🚀 Real-time updates for embeddings unlocked dynamic, up-to-date recommendations

By early 2024, Meesho’s ML stack had evolved into a fully real-time, scalable, and cost-efficient system, setting the foundation for even bigger leaps ahead.

Building Meesho’s ML Platform: Lessons from the First-Gen System (Part 2)

· 7 min read
Bhawani Singh
Architect @ Meesho
Jigar Dave
Lead Software Engineer @ Meesho
Adarsha Das
Senior Architect @ Meesho

BharatMLStack By late 2022, we had built something we were truly proud of—a real-time ML serving system with a DAG-based executor, a feature store, and an interaction store powering key ranking and personalization models. It was a major milestone, the culmination of months of effort from data scientists, ML engineers, and backend teams. Our system was live, and we were ready to push the boundaries of experimentation. And it worked. Mostly. But soon, cracks appeared. Every new model needed custom feature retrieval logic, DAGs became dense and unmanageable, and scaling turned into a constant firefight. Costs surged, and infra bottlenecks slowed experimentation. Our system worked, but it wasn’t built for scale. This is the story of how we tackled these challenges—building Inferflow for seamless feature retrieval, optimizing real-time infra, and cutting costs while scaling to millions of QPS.

The Cost of Success

Every new Ranker model required its own feature set, often pulling from different entities. Each addition meant:

  • Adding new DAG nodes in IOP
  • Writing custom logic to fetch features from multiple sources (e.g., user, product, user × category)
  • Inferring intermediate features (e.g., extracting category from a product to fetch user × category data)
  • Optimizing I/O and dealing with the inevitable bugs

What began as clean DAGs soon turned into a tangled web of cross-dependent graphs. Every experimentation cycle meant new nodes, new dependencies, and slower iterations.

Scaling Pains (and Cassandra’s Limits)

At some point, we were hitting:

  • 250–300K reads/sec
  • 1M writes/sec (during lean hours)

All of this ran on Cassandra. While its distributed architecture had been proven in production, operating large-scale clusters came with considerable infrastructure overhead. Our proof-of-concept (POC) demonstrated throughput of around 100K ops/sec, but as we scaled further, the challenges grew. Ensuring node health, optimizing compaction, and maintaining storage balance became increasingly demanding. We also observed latency spikes under heavy load, alongside a sharp increase in total cost of ownership.

Interaction Store Woes

Our interaction store was another ticking time bomb:

  • 🚨 Clusters kept growing in size and cost
  • 🚨 Latency spikes became increasingly frequent
  • 🚨 The DMC proxy occasionally lost locality of nodes against shards, causing cross-node communication and degraded performance

Each time this happened, we had to manually rebalance shards just to restore stable latency, making operations unsustainable at scale.

Silver Linings

Despite the chaos, the system was live and delivering value:

  • Real-time infrastructure was in production
  • Costs dropped by 60–70% compared to offline personalization
  • New experiments rolled out faster and more successfully
  • User engagement metrics improved

It wasn’t perfect. It was far from easy. But it worked—and that counted for a lot.

Round Two: Solving the Top 2 Bottlenecks

With the first-gen system stretched to its limits, we stepped back. Conversations with data scientists and backend engineers revealed three recurring pain points:

  1. Coding feature retrieval logic for every new model was becoming unsustainable
  2. ML scale was exploding—bringing rising infra costs with it
  3. Real-time embedding search was the next big unlock

We tackled them one by one—starting with the biggest pain point.

Problem 1: No-Code Feature Retrieval for Model Inference

We noticed a pattern: for personalized ranking, models needed features from:

  • ✅ Product
  • ✅ User
  • ✅ User × Category
  • ✅ Region, cohort, sub-category, etc.

A key insight emerged: Entities that contribute features for a model always map back to the context entities.

MP Dag

With this, we designed Inferflow, a graph-driven feature retrieval and model orchestration system:

  • 1️⃣ Inferflow takes a modelId and context IDs (e.g., userId, productIds)
  • 2️⃣ Loads a pre-defined feature retrieval graph from ZooKeeper
  • 3️⃣ Executes the graph to resolve entity relationships dynamically
  • 4️⃣ Outputs a 2D matrix of feature vectors

💡 The impact?

  • 🚀 No more custom feature retrieval code—just graph updates in config
  • 🚀 Feature consistency across experiments
  • 🚀 Faster iteration cycles for ranking, fraud detection, and beyond

Here’s a visual example that shows how this graph plays out during execution. We further extended the graph to call multiple models as needed: MP matrix We built Inferflow in GoLang, using gRPC and Proto3 serialization for efficiency.

Problem 2: Scaling Without Breaking the Bank

With more ML use cases coming online, we needed to cut costs without compromising performance. We focused on:

  • 🔹 Online Feature Store
  • 🔹 Interaction Store

Optimizing the Online Feature Store

Our costs were concentrated in:

  • 📌 Database (Cassandra)
  • 📌 Cache (Redis)
  • 📌 Running Pods (Java services)

1️⃣ Replacing Cassandra with ScyllaDB As we hit the operational limits of large Cassandra clusters, we transitioned to ScyllaDB, which offered a seamless drop-in replacement without major code changes. The switch brought significant benefits:

  • Throughput: Matched or exceeded Cassandra's performance under identical workloads, even under high concurrency.
  • Latency: Achieved consistently lower P99 latencies due to ScyllaDB's shard-per-core architecture and better I/O utilization.
  • Cost Efficiency: Reduced infra footprint by ~70% through better CPU and memory efficiency, eliminating the need for over-provisioned nodes.

2️⃣ Finding the Right Cache To reduce backend load and improve response times, we benchmarked multiple caching solutions—Memcached, KeyDB, and Dragonfly—under real production traffic patterns. Dragonfly stood out due to its robust architecture and operational simplicity:

  • Data Skew Handling: Efficiently managed extreme key hotness and uneven access patterns without performance degradation.
  • Throughput: Delivered consistently high throughput, even with large object sizes and concurrent access.
  • Ease of Adoption: Acted as a drop-in Redis replacement with full protocol compatibility—no changes needed in application code or client libraries.

3️⃣ Moving to GoLang for Cost-Efficient Serving Java services were memory-heavy—so we rewrote core services in GoLang. The results?

✅ Memory usage dropped by ~80% ✅ CPU utilization was significantly lower ✅ Faster, more efficient deployments

Optimizing the Interaction Store

We realized that we only need a user’s interaction data in Redis when they open the app. So, we implemented a tiered storage approach:

  • 📌 Cold Tier (ScyllaDB)—Stores click, order, wishlist events
  • 📌 Hot Tier (Redis)—Loads a user’s past interactions only when they open the app

Smart Offloading: We introduced an inactivity tracker to detect when a user session ends. At that point, Redis data was flushed back to Scylla, reducing unnecessary writes.

InteractionStore

Results

  • Online Feature Store hit 1M QPS for the first time during the 2023 Mega Blockbuster Sale—without breaking a sweat
  • Infra costs for Online Feature Store and Interaction Store dropped by ~60%

The Catch: Our ML Hosting Hit a Hard Limit

While planning for 2023 MBS, we ran into a critical scalability bottleneck:

  • ❌ Insufficient compute availability in our region for ML instances
  • ❌ Couldn’t provision enough nodes to handle real-time inference at scale

This forced us to rethink where and how we hosted our models. The existing setup was great for prototyping—but it wasn’t built to handle the bursty, high-QPS demands of real-world production workloads.

Conclusion: From Firefighting to Future-Proofing

What started as an ambitious experiment turned into a real-time ML infrastructure that powered millions of requests per second. We battled scaling pains, rethought feature retrieval with Inferflow, and rebuilt our infra stack for efficiency—driving down costs while improving experimentation velocity. But new challenges emerged. Our infrastructure could now handle scale, but our ML model hosting setup hit a hard limit. With compute availability bottlenecks threatening real-time inference, we faced a critical decision: how do we make model serving as scalable and cost-efficient as the rest of our stack? That’s the next piece of the puzzle—and the story of Part 3.

Building Meesho’s ML Platform: From Chaos to Cutting-Edge (Part 1)

· 11 min read
Adarsha Das
Senior Architect @ Meesho
Aditya Kumar
Lead Software Engineer @ Meesho
Bhawani Singh
Architect @ Meesho
Jigar Dave
Lead Software Engineer @ Meesho

BharatMLStack It all started in early 2022, over a casual Friday evening catch-up. Like many great origin stories, this one began with friendly banter between a group of backend engineers and data scientists. As the conversations unfolded, so did the roasting—until one remark hit a little too close to home:

"Why are we still crunching data for Monthly Active Users (MAU) when the next day it’s all about Daily Active Users (DAU)?"

The laughter died down, and the question lingered. When we regrouped on Monday—clear-headed and slightly reflective—we decided to dig into the numbers. What they discovered was quite revealing: a large portion of compute resources wasn’t being put to good use. Much of the system’s effort was spent supporting users who weren’t actively engaging, and even for new users, the experience wasn’t optimized to make a meaningful impact.

At the same time, Meesho had just launched a company-wide initiative to reduce costs—and every team had to contribute. This realization sparked the journey that would eventually lead to the Meesho ML Platform, known today as BharatMLStack.

Alt Text

Before the ML Platform, our recommendation and ranking pipelines followed a batch processing approach:

  • Data Ingestion: The Data Platform team executed ETL jobs to ingest raw user data—including user profiles, interaction logs, and product impressions—into designated S3 buckets.
  • Layer 1: Embedding Generation: On the Data Science side, Spark jobs pulled data from multiple S3 sources, cleaned and preprocessed it, and applied matrix factorization to generate user and item embeddings. The processed data and embeddings were then stored back in S3 in a structured format.
  • Layer 2: Candidate Generation (CG): In this stage, Spark jobs leveraged embeddings and historical interaction data to generate candidate recommendations for users. These candidate lists were subsequently written to S3.
  • Layer 3: Ranking and Merging – A final round of processing ranked the generated candidates using ML models, combined different candidate lists, and stored the final ranked recommendations in a caching system.
  • Serving: A microservice retrieved ranked recommendations from an in-memory data store via exposed APIs, delivering personalized listings across key surfaces such as "For You" and Category Landing Pages (CLP).

This approach held up well—until Meesho started seeing a significant surge in traffic.

The Turning Point: From Batch to Real-Time

At this time, the team was iterating on new Ranker models, and real-time inference seemed like the next logical step. But Rankers needed real-time feature retrieval, which meant an online feature store had to be built first.

Exploring open-source options led to cost vs. performance trade-offs, but Meesho’s surging traffic meant that latency and stability were non-negotiable. After multiple debates and stakeholder discussions, a bold decision was made:

We would build our own feature store.

Meanwhile, efforts began to bring Candidate Generators (CGs) to real-time. The challenge? Storing and retrieving user interactions quickly enough to power real-time recommendations.

As the team dove deeper, a new roadblock emerged:
Our ML jobs were orchestrated using Airflow DAGs, giving data scientists flexibility in experimentation. But transitioning to real-time execution threatened this agility. Every change would now require backend engineering support, slowing down iteration cycles.

That’s when the idea struck:
We needed a framework for real-time DAG execution—one that preserved the same flexibility as Airflow but worked for streaming data.

This moment shaped the next phase of our journey.

First Generation Design

Alt Text

Laying the Groundwork: The First-Gen ML Platform

To solve these challenges, the team built three foundational components:

1. IOP Framework: A Real-Time DAG Executor

  • Reusable Nodes: Each DAG node (e.g., an invocation to a CG service, a ranker, or a filter) had to be implemented only once. After that, it could be reused across any workflow by referencing it in config.
  • Config-driven Dynamic Graphs: Execution graphs were defined as adjacency lists stored in ZooKeeper, allowing teams to modify the sequence or structure of operations without touching application code.
  • Plug-and-play CGs: The Candidate Generator interface was preserved, so a single CG node could call any CG service by passing cg_name in the request. This drastically reduced the code surface area and improved maintainability.
  • Production-Grade DAGs: DAGs were designed to execute in low-latency real-time environments, with support for parallel execution, retries, and branching.
More about IOP DAG

2. Online Feature Store - 0th Version

  • Used Cassandra and Redis for low-latency feature serving.
  • Maintained feature consistency using Feature Groups with TTL-based expiry.
  • A hybrid schema was used: feature keys stored in ZooKeeper, data stored in compact arrays.

3. Interaction Store - 0th Version

  • Captured real-time user interactions like clicks, orders, and add-to-cart events.
  • Stored event data in Redis ZSETs (sorted sets) to enable fast lookups for recommendation engines.
  • Provided an API to fetch a user's last k interactions or interactions within a time window.

With these components in place, real-time ML at Meesho became a reality.

This was just the beginning.

Building the Online Feature Store - 0th Version

Alt text

Choosing the Right Tech Stack

We spent considerable time evaluating various databases, caches, and communication protocols for our online feature store. After carefully weighing cost, latency, throughput, and operational stability, we settled on a combination of:

  • Cassandra and Redis for storage
  • gRPC + Proto3 as our communication layer

Streamlining the Data Flow

To keep things simple in the initial version:

  • Feature engineering jobs wrote raw outputs to an S3 bucket
  • A daily feature push job:
    • Read from S3
    • Grouped related features into Feature Groups (ensuring consistency)
    • Pushed them to Kafka

For features requiring frequent updates:

  • Ad-hoc jobs computed features in higher frequency
  • These jobs pushed to both Kafka and S3 (S3 preserved historical data for future model training)

The Challenges: Data Format and Storage

One of the most critical design challenges was how to store feature data efficiently and consistently, especially in databases like Cassandra and Redis, which come with unique storage constraints.

We had to solve for three key requirements:

  • Feature Consistency

    When a feature group contains features like order_count_1h and click_count_1h, both must reflect the same time window. Inconsistent updates would lead to unreliable model predictions.

  • TTL Granularity

    Each feature group required an expiry timestamp, so that all features within it expired together—preserving consistency during reads.

  • Extensibility Across Databases

    We anticipated that infra needs would evolve. To future-proof our system, the data format was designed to be decoupled from DB-specific layouts, enabling portability to systems like ScyllaDB, DynamoDB, HBase, or BigTable.


Overcoming Technical Constraints

At the time, we were using Cassandra, which not only imposed a soft limit of 75 columns per row, but also exhibited significant performance degradation as the number of columns increased further, particularly in memory constrained machines. Wide rows caused high memory usage during reads, unpredictable latencies due to heavy deserialization overhead, and inefficiencies during compactions and repairs. This ruled out the naive "one column per feature" approach. We needed a format that was compact, minimized the number of columns, and remained efficient and portable across different storage systems.

The Solution: Schema Separation

We introduced the concept of Feature Groups—logical groupings of features that must remain consistent with one another. To represent these groups efficiently, we adopted a layered storage approach:

  • Feature Labels (Keys) were stored in ZooKeeper, serving as the schema.
  • Feature Values were stored as a comma-separated string array in Cassandra or Redis.
  • Expiry Timestamp and Schema Version were appended using a semi-colon delimiter at the end of the string.

Example:

feature_1_value,feature_2_value,feature_3_value;expiry_ts

This format allowed:

  • Consistent writes and reads at the group level
  • Easy parsing of feature values using the schema lookup from ZooKeeper
  • Efficient storage with minimal DB column usage
  • Support for per-group TTLs and schema evolution

Tracking Changes in Feature Groups

Feature groups don’t stay static. As models evolve, features get added, renamed, or removed. But schema changes often go live before the data is ready—and stopping ingestion just to wait for everything to align isn't feasible.

Common Real-World Scenarios:

  • A new feature is added to the schema, but ingestion jobs still use the older schema version.
  • Ongoing writes don’t include the newly added feature, and stopping ingestion would break freshness for existing features.
  • During serving, models request a mix of old and new features, depending on rollout stages.

The Solution: Schema Versioning

We solved this with versioned feature group schemas, which unlocked several capabilities:

  • Backward Compatibility

    Older ingestion jobs can continue writing using older schema versions. During reads, the system uses the schema version embedded in the value to interpret the data correctly.
  • Partial Availability Handling

    During inference, if some features in the request aren’t available (due to rollout delays or missing data), the system serves default values, ensuring the inference call doesn’t fail.
  • Safe Writes Without Pipeline Pauses

    With schema versioning, we no longer had to stop ingestion pipelines for schema updates. Writes using previous versions can continue safely, and downstream consumers evolve independently. This design gave us the flexibility to move fast without breaking things—preserving data quality, enabling experimentation, and ensuring reliability at scale.

Alt Text

Interaction Store - 0th Version

Alt Text

To power real-time Candidate Generators (CGs), we needed fast access to user behavior signals—like what a user recently clicked, ordered, or added to their cart. These interactions form the basis for many real-time recommendations, such as Similar Products, People Also Viewed, or Recently Ordered Again. For the 0th version of the Interaction Store, we focused on a design that was simple, fast, and reliable — optimized for high-throughput ingestion and low-latency lookups.

Event Ingestion

We instrumented our backend services to emit key user interaction events to Kafka in real time. These included:

  • Click
  • Order
  • Add to Cart
  • Wishlist
  • Share

Each event carried essential metadata:

  • userId — uniquely identifies the user
  • productId — the item being interacted with
  • timestamp — the moment the interaction occurred

This decoupled the interaction logging from storage, allowing ingestion and consumption to scale independently.

Storage Design

To store these events, we built Kafka consumers that processed the incoming streams and wrote the data into Redis, using sorted sets (ZSETs) as the primary data structure.

Why Redis?

Redis gave us:

  • Low-latency reads and writes
  • Time-ordered data using ZSETs (via score = timestamp)
  • Native TTL support, if needed in later versions
  • In-memory performance —ideal for real-time CGs

Storage Structure

Each user’s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:

userId_eventType → ZSET[...(pid, ts)...]

Within each ZSET:

  • The timestamp served as the score, maintaining temporal order
  • The productId (optionally with metadata) was the value

This allowed us to efficiently retrieve the interactions with HTTP-based API server with two query modes:

  • Fetch the last k interactions of a specific type for a given user with ZREVRANGE(userId_eventType, count)
  • Retrieve all interactions within a time range (e.g., last 24 hours) with ZREVRANGEBYSCORE(userId_eventType, timeRange)

Built-in Guardrails

Since Redis was the sole store, we implemented High Availability (HA) to prevent data loss. To optimize memory usage, we also enforced size limits per event type—only storing the last k interactions per user, with older entries getting truncated.

Conclusion: Laying the Foundation for Real-Time ML

In this first phase, we tackled the fundamentals—shifting from batch-based recommendations to a real-time Recommendation using ML platform that could keep up with Meesho’s growth.

With the IOP Framework, Online Feature Store, and Interaction Store, we built the core infrastructure to support real-time personalization at scale. These wins have already unlocked:

  • ✅ Faster, more dynamic recommendations for millions of users.
  • ✅ Better infrastructure efficiency, reducing wasted compute power.
  • ✅ A flexible, modular system that allows for further experimentation.

But this is just the beginning. While we've solved key challenges, certain roadblocks remain —from optimizing cost-performance trade-offs to seamlessly evolving schemas.

This foundational work laid the path for a reliable and scalable real-time feature serving layer.