Skip to main content

2 posts tagged with "interaction-store"

View All Tags

Building Meesho’s ML Platform: Lessons from the First-Gen System (Part 2)

· 7 min read
Bhawani Singh
Architect @ Meesho
Jigar Dave
Lead Software Engineer @ Meesho
Adarsha Das
Senior Architect @ Meesho

BharatMLStack By late 2022, we had built something we were truly proud of—a real-time ML serving system with a DAG-based executor, a feature store, and an interaction store powering key ranking and personalization models. It was a major milestone, the culmination of months of effort from data scientists, ML engineers, and backend teams. Our system was live, and we were ready to push the boundaries of experimentation. And it worked. Mostly. But soon, cracks appeared. Every new model needed custom feature retrieval logic, DAGs became dense and unmanageable, and scaling turned into a constant firefight. Costs surged, and infra bottlenecks slowed experimentation. Our system worked, but it wasn’t built for scale. This is the story of how we tackled these challenges—building Inferflow for seamless feature retrieval, optimizing real-time infra, and cutting costs while scaling to millions of QPS.

The Cost of Success

Every new Ranker model required its own feature set, often pulling from different entities. Each addition meant:

  • Adding new DAG nodes in IOP
  • Writing custom logic to fetch features from multiple sources (e.g., user, product, user × category)
  • Inferring intermediate features (e.g., extracting category from a product to fetch user × category data)
  • Optimizing I/O and dealing with the inevitable bugs

What began as clean DAGs soon turned into a tangled web of cross-dependent graphs. Every experimentation cycle meant new nodes, new dependencies, and slower iterations.

Scaling Pains (and Cassandra’s Limits)

At some point, we were hitting:

  • 250–300K reads/sec
  • 1M writes/sec (during lean hours)

All of this ran on Cassandra. While its distributed architecture had been proven in production, operating large-scale clusters came with considerable infrastructure overhead. Our proof-of-concept (POC) demonstrated throughput of around 100K ops/sec, but as we scaled further, the challenges grew. Ensuring node health, optimizing compaction, and maintaining storage balance became increasingly demanding. We also observed latency spikes under heavy load, alongside a sharp increase in total cost of ownership.

Interaction Store Woes

Our interaction store was another ticking time bomb:

  • 🚨 Clusters kept growing in size and cost
  • 🚨 Latency spikes became increasingly frequent
  • 🚨 The DMC proxy occasionally lost locality of nodes against shards, causing cross-node communication and degraded performance

Each time this happened, we had to manually rebalance shards just to restore stable latency, making operations unsustainable at scale.

Silver Linings

Despite the chaos, the system was live and delivering value:

  • Real-time infrastructure was in production
  • Costs dropped by 60–70% compared to offline personalization
  • New experiments rolled out faster and more successfully
  • User engagement metrics improved

It wasn’t perfect. It was far from easy. But it worked—and that counted for a lot.

Round Two: Solving the Top 2 Bottlenecks

With the first-gen system stretched to its limits, we stepped back. Conversations with data scientists and backend engineers revealed three recurring pain points:

  1. Coding feature retrieval logic for every new model was becoming unsustainable
  2. ML scale was exploding—bringing rising infra costs with it
  3. Real-time embedding search was the next big unlock

We tackled them one by one—starting with the biggest pain point.

Problem 1: No-Code Feature Retrieval for Model Inference

We noticed a pattern: for personalized ranking, models needed features from:

  • ✅ Product
  • ✅ User
  • ✅ User × Category
  • ✅ Region, cohort, sub-category, etc.

A key insight emerged: Entities that contribute features for a model always map back to the context entities.

MP Dag

With this, we designed Inferflow, a graph-driven feature retrieval and model orchestration system:

  • 1️⃣ Inferflow takes a modelId and context IDs (e.g., userId, productIds)
  • 2️⃣ Loads a pre-defined feature retrieval graph from ZooKeeper
  • 3️⃣ Executes the graph to resolve entity relationships dynamically
  • 4️⃣ Outputs a 2D matrix of feature vectors

💡 The impact?

  • 🚀 No more custom feature retrieval code—just graph updates in config
  • 🚀 Feature consistency across experiments
  • 🚀 Faster iteration cycles for ranking, fraud detection, and beyond

Here’s a visual example that shows how this graph plays out during execution. We further extended the graph to call multiple models as needed: MP matrix We built Inferflow in GoLang, using gRPC and Proto3 serialization for efficiency.

Problem 2: Scaling Without Breaking the Bank

With more ML use cases coming online, we needed to cut costs without compromising performance. We focused on:

  • 🔹 Online Feature Store
  • 🔹 Interaction Store

Optimizing the Online Feature Store

Our costs were concentrated in:

  • 📌 Database (Cassandra)
  • 📌 Cache (Redis)
  • 📌 Running Pods (Java services)

1️⃣ Replacing Cassandra with ScyllaDB As we hit the operational limits of large Cassandra clusters, we transitioned to ScyllaDB, which offered a seamless drop-in replacement without major code changes. The switch brought significant benefits:

  • Throughput: Matched or exceeded Cassandra's performance under identical workloads, even under high concurrency.
  • Latency: Achieved consistently lower P99 latencies due to ScyllaDB's shard-per-core architecture and better I/O utilization.
  • Cost Efficiency: Reduced infra footprint by ~70% through better CPU and memory efficiency, eliminating the need for over-provisioned nodes.

2️⃣ Finding the Right Cache To reduce backend load and improve response times, we benchmarked multiple caching solutions—Memcached, KeyDB, and Dragonfly—under real production traffic patterns. Dragonfly stood out due to its robust architecture and operational simplicity:

  • Data Skew Handling: Efficiently managed extreme key hotness and uneven access patterns without performance degradation.
  • Throughput: Delivered consistently high throughput, even with large object sizes and concurrent access.
  • Ease of Adoption: Acted as a drop-in Redis replacement with full protocol compatibility—no changes needed in application code or client libraries.

3️⃣ Moving to GoLang for Cost-Efficient Serving Java services were memory-heavy—so we rewrote core services in GoLang. The results?

✅ Memory usage dropped by ~80% ✅ CPU utilization was significantly lower ✅ Faster, more efficient deployments

Optimizing the Interaction Store

We realized that we only need a user’s interaction data in Redis when they open the app. So, we implemented a tiered storage approach:

  • 📌 Cold Tier (ScyllaDB)—Stores click, order, wishlist events
  • 📌 Hot Tier (Redis)—Loads a user’s past interactions only when they open the app

Smart Offloading: We introduced an inactivity tracker to detect when a user session ends. At that point, Redis data was flushed back to Scylla, reducing unnecessary writes.

InteractionStore

Results

  • Online Feature Store hit 1M QPS for the first time during the 2023 Mega Blockbuster Sale—without breaking a sweat
  • Infra costs for Online Feature Store and Interaction Store dropped by ~60%

The Catch: Our ML Hosting Hit a Hard Limit

While planning for 2023 MBS, we ran into a critical scalability bottleneck:

  • ❌ Insufficient compute availability in our region for ML instances
  • ❌ Couldn’t provision enough nodes to handle real-time inference at scale

This forced us to rethink where and how we hosted our models. The existing setup was great for prototyping—but it wasn’t built to handle the bursty, high-QPS demands of real-world production workloads.

Conclusion: From Firefighting to Future-Proofing

What started as an ambitious experiment turned into a real-time ML infrastructure that powered millions of requests per second. We battled scaling pains, rethought feature retrieval with Inferflow, and rebuilt our infra stack for efficiency—driving down costs while improving experimentation velocity. But new challenges emerged. Our infrastructure could now handle scale, but our ML model hosting setup hit a hard limit. With compute availability bottlenecks threatening real-time inference, we faced a critical decision: how do we make model serving as scalable and cost-efficient as the rest of our stack? That’s the next piece of the puzzle—and the story of Part 3.

Building Meesho’s ML Platform: From Chaos to Cutting-Edge (Part 1)

· 11 min read
Adarsha Das
Senior Architect @ Meesho
Aditya Kumar
Lead Software Engineer @ Meesho
Bhawani Singh
Architect @ Meesho
Jigar Dave
Lead Software Engineer @ Meesho

BharatMLStack It all started in early 2022, over a casual Friday evening catch-up. Like many great origin stories, this one began with friendly banter between a group of backend engineers and data scientists. As the conversations unfolded, so did the roasting—until one remark hit a little too close to home:

"Why are we still crunching data for Monthly Active Users (MAU) when the next day it’s all about Daily Active Users (DAU)?"

The laughter died down, and the question lingered. When we regrouped on Monday—clear-headed and slightly reflective—we decided to dig into the numbers. What they discovered was quite revealing: a large portion of compute resources wasn’t being put to good use. Much of the system’s effort was spent supporting users who weren’t actively engaging, and even for new users, the experience wasn’t optimized to make a meaningful impact.

At the same time, Meesho had just launched a company-wide initiative to reduce costs—and every team had to contribute. This realization sparked the journey that would eventually lead to the Meesho ML Platform, known today as BharatMLStack.

Alt Text

Before the ML Platform, our recommendation and ranking pipelines followed a batch processing approach:

  • Data Ingestion: The Data Platform team executed ETL jobs to ingest raw user data—including user profiles, interaction logs, and product impressions—into designated S3 buckets.
  • Layer 1: Embedding Generation: On the Data Science side, Spark jobs pulled data from multiple S3 sources, cleaned and preprocessed it, and applied matrix factorization to generate user and item embeddings. The processed data and embeddings were then stored back in S3 in a structured format.
  • Layer 2: Candidate Generation (CG): In this stage, Spark jobs leveraged embeddings and historical interaction data to generate candidate recommendations for users. These candidate lists were subsequently written to S3.
  • Layer 3: Ranking and Merging – A final round of processing ranked the generated candidates using ML models, combined different candidate lists, and stored the final ranked recommendations in a caching system.
  • Serving: A microservice retrieved ranked recommendations from an in-memory data store via exposed APIs, delivering personalized listings across key surfaces such as "For You" and Category Landing Pages (CLP).

This approach held up well—until Meesho started seeing a significant surge in traffic.

The Turning Point: From Batch to Real-Time

At this time, the team was iterating on new Ranker models, and real-time inference seemed like the next logical step. But Rankers needed real-time feature retrieval, which meant an online feature store had to be built first.

Exploring open-source options led to cost vs. performance trade-offs, but Meesho’s surging traffic meant that latency and stability were non-negotiable. After multiple debates and stakeholder discussions, a bold decision was made:

We would build our own feature store.

Meanwhile, efforts began to bring Candidate Generators (CGs) to real-time. The challenge? Storing and retrieving user interactions quickly enough to power real-time recommendations.

As the team dove deeper, a new roadblock emerged:
Our ML jobs were orchestrated using Airflow DAGs, giving data scientists flexibility in experimentation. But transitioning to real-time execution threatened this agility. Every change would now require backend engineering support, slowing down iteration cycles.

That’s when the idea struck:
We needed a framework for real-time DAG execution—one that preserved the same flexibility as Airflow but worked for streaming data.

This moment shaped the next phase of our journey.

First Generation Design

Alt Text

Laying the Groundwork: The First-Gen ML Platform

To solve these challenges, the team built three foundational components:

1. IOP Framework: A Real-Time DAG Executor

  • Reusable Nodes: Each DAG node (e.g., an invocation to a CG service, a ranker, or a filter) had to be implemented only once. After that, it could be reused across any workflow by referencing it in config.
  • Config-driven Dynamic Graphs: Execution graphs were defined as adjacency lists stored in ZooKeeper, allowing teams to modify the sequence or structure of operations without touching application code.
  • Plug-and-play CGs: The Candidate Generator interface was preserved, so a single CG node could call any CG service by passing cg_name in the request. This drastically reduced the code surface area and improved maintainability.
  • Production-Grade DAGs: DAGs were designed to execute in low-latency real-time environments, with support for parallel execution, retries, and branching.
More about IOP DAG

2. Online Feature Store - 0th Version

  • Used Cassandra and Redis for low-latency feature serving.
  • Maintained feature consistency using Feature Groups with TTL-based expiry.
  • A hybrid schema was used: feature keys stored in ZooKeeper, data stored in compact arrays.

3. Interaction Store - 0th Version

  • Captured real-time user interactions like clicks, orders, and add-to-cart events.
  • Stored event data in Redis ZSETs (sorted sets) to enable fast lookups for recommendation engines.
  • Provided an API to fetch a user's last k interactions or interactions within a time window.

With these components in place, real-time ML at Meesho became a reality.

This was just the beginning.

Building the Online Feature Store - 0th Version

Alt text

Choosing the Right Tech Stack

We spent considerable time evaluating various databases, caches, and communication protocols for our online feature store. After carefully weighing cost, latency, throughput, and operational stability, we settled on a combination of:

  • Cassandra and Redis for storage
  • gRPC + Proto3 as our communication layer

Streamlining the Data Flow

To keep things simple in the initial version:

  • Feature engineering jobs wrote raw outputs to an S3 bucket
  • A daily feature push job:
    • Read from S3
    • Grouped related features into Feature Groups (ensuring consistency)
    • Pushed them to Kafka

For features requiring frequent updates:

  • Ad-hoc jobs computed features in higher frequency
  • These jobs pushed to both Kafka and S3 (S3 preserved historical data for future model training)

The Challenges: Data Format and Storage

One of the most critical design challenges was how to store feature data efficiently and consistently, especially in databases like Cassandra and Redis, which come with unique storage constraints.

We had to solve for three key requirements:

  • Feature Consistency

    When a feature group contains features like order_count_1h and click_count_1h, both must reflect the same time window. Inconsistent updates would lead to unreliable model predictions.

  • TTL Granularity

    Each feature group required an expiry timestamp, so that all features within it expired together—preserving consistency during reads.

  • Extensibility Across Databases

    We anticipated that infra needs would evolve. To future-proof our system, the data format was designed to be decoupled from DB-specific layouts, enabling portability to systems like ScyllaDB, DynamoDB, HBase, or BigTable.


Overcoming Technical Constraints

At the time, we were using Cassandra, which not only imposed a soft limit of 75 columns per row, but also exhibited significant performance degradation as the number of columns increased further, particularly in memory constrained machines. Wide rows caused high memory usage during reads, unpredictable latencies due to heavy deserialization overhead, and inefficiencies during compactions and repairs. This ruled out the naive "one column per feature" approach. We needed a format that was compact, minimized the number of columns, and remained efficient and portable across different storage systems.

The Solution: Schema Separation

We introduced the concept of Feature Groups—logical groupings of features that must remain consistent with one another. To represent these groups efficiently, we adopted a layered storage approach:

  • Feature Labels (Keys) were stored in ZooKeeper, serving as the schema.
  • Feature Values were stored as a comma-separated string array in Cassandra or Redis.
  • Expiry Timestamp and Schema Version were appended using a semi-colon delimiter at the end of the string.

Example:

feature_1_value,feature_2_value,feature_3_value;expiry_ts

This format allowed:

  • Consistent writes and reads at the group level
  • Easy parsing of feature values using the schema lookup from ZooKeeper
  • Efficient storage with minimal DB column usage
  • Support for per-group TTLs and schema evolution

Tracking Changes in Feature Groups

Feature groups don’t stay static. As models evolve, features get added, renamed, or removed. But schema changes often go live before the data is ready—and stopping ingestion just to wait for everything to align isn't feasible.

Common Real-World Scenarios:

  • A new feature is added to the schema, but ingestion jobs still use the older schema version.
  • Ongoing writes don’t include the newly added feature, and stopping ingestion would break freshness for existing features.
  • During serving, models request a mix of old and new features, depending on rollout stages.

The Solution: Schema Versioning

We solved this with versioned feature group schemas, which unlocked several capabilities:

  • Backward Compatibility

    Older ingestion jobs can continue writing using older schema versions. During reads, the system uses the schema version embedded in the value to interpret the data correctly.
  • Partial Availability Handling

    During inference, if some features in the request aren’t available (due to rollout delays or missing data), the system serves default values, ensuring the inference call doesn’t fail.
  • Safe Writes Without Pipeline Pauses

    With schema versioning, we no longer had to stop ingestion pipelines for schema updates. Writes using previous versions can continue safely, and downstream consumers evolve independently. This design gave us the flexibility to move fast without breaking things—preserving data quality, enabling experimentation, and ensuring reliability at scale.

Alt Text

Interaction Store - 0th Version

Alt Text

To power real-time Candidate Generators (CGs), we needed fast access to user behavior signals—like what a user recently clicked, ordered, or added to their cart. These interactions form the basis for many real-time recommendations, such as Similar Products, People Also Viewed, or Recently Ordered Again. For the 0th version of the Interaction Store, we focused on a design that was simple, fast, and reliable — optimized for high-throughput ingestion and low-latency lookups.

Event Ingestion

We instrumented our backend services to emit key user interaction events to Kafka in real time. These included:

  • Click
  • Order
  • Add to Cart
  • Wishlist
  • Share

Each event carried essential metadata:

  • userId — uniquely identifies the user
  • productId — the item being interacted with
  • timestamp — the moment the interaction occurred

This decoupled the interaction logging from storage, allowing ingestion and consumption to scale independently.

Storage Design

To store these events, we built Kafka consumers that processed the incoming streams and wrote the data into Redis, using sorted sets (ZSETs) as the primary data structure.

Why Redis?

Redis gave us:

  • Low-latency reads and writes
  • Time-ordered data using ZSETs (via score = timestamp)
  • Native TTL support, if needed in later versions
  • In-memory performance —ideal for real-time CGs

Storage Structure

Each user’s interactions were stored using a composite key format, uniquely identifying the user and interaction type. This structure allowed efficient organization and quick retrieval of recent activity for recommendation generation:

userId_eventType → ZSET[...(pid, ts)...]

Within each ZSET:

  • The timestamp served as the score, maintaining temporal order
  • The productId (optionally with metadata) was the value

This allowed us to efficiently retrieve the interactions with HTTP-based API server with two query modes:

  • Fetch the last k interactions of a specific type for a given user with ZREVRANGE(userId_eventType, count)
  • Retrieve all interactions within a time range (e.g., last 24 hours) with ZREVRANGEBYSCORE(userId_eventType, timeRange)

Built-in Guardrails

Since Redis was the sole store, we implemented High Availability (HA) to prevent data loss. To optimize memory usage, we also enforced size limits per event type—only storing the last k interactions per user, with older entries getting truncated.

Conclusion: Laying the Foundation for Real-Time ML

In this first phase, we tackled the fundamentals—shifting from batch-based recommendations to a real-time Recommendation using ML platform that could keep up with Meesho’s growth.

With the IOP Framework, Online Feature Store, and Interaction Store, we built the core infrastructure to support real-time personalization at scale. These wins have already unlocked:

  • ✅ Faster, more dynamic recommendations for millions of users.
  • ✅ Better infrastructure efficiency, reducing wasted compute power.
  • ✅ A flexible, modular system that allows for further experimentation.

But this is just the beginning. While we've solved key challenges, certain roadblocks remain —from optimizing cost-performance trade-offs to seamlessly evolving schemas.

This foundational work laid the path for a reliable and scalable real-time feature serving layer.