BharatMLStack - Numerix
Numerix is a Rust-based compute service in BharatMLStack designed for low-latency evaluation of mathematical expressions over feature matrices. Each request carries a compute_id and a matrix of features; Numerix fetches the corresponding postfix expression, maps variables to feature columns (treated as vectors), and evaluates the expression with a stack-based SIMD-optimized runtime.
High-Level Components
- Tonic gRPC server (Rust): exposes
Numerix/Computefor low-latency requests.- Accepts feature data as strings (for ease of use) or byte arrays (for efficient transmission).
- All input data is converted internally to fp32 or fp64 vectors for evaluation.
- Compute Registry (etcd): stores
compute_id (int) → postfix expressionmappings. - Stack-based Evaluator: Runs postfix expressions in linear time using a stack based approach over aligned vectors.
- Vectorized Math Runtime: No handwritten SIMD intrinsics; relies on LLVM autovectorization.
- Operations are intentionally simple and memory-aligned.
- Compiler emits SIMD instructions automatically.
- Portable across CPU architectures (ARM & AMD).
- Metrics and Health
- Latency, RPS, and error rates via Datadog/DogStatsD UDP client.
- Minimal HTTP endpoints (
/health, optional/metrics) for diagnostics.
What is SIMD?
SIMD (Single Instruction, Multiple Data) is a CPU feature that allows a single instruction to operate on multiple data points at once. In Numerix, this means that operations on feature vectors can be executed in parallel, making evaluation of mathematical expressions faster and more predictable.
Why SIMD Matters for Numerix
- Postfix expressions operate on vectors (columns of the input matrix).
- SIMD allows multiple elements of these vectors to be processed in one CPU instruction, rather than element-by-element.
- This results in low-latency, high-throughput computation without the need for handwritten intrinsics — the compiler handles the vectorization automatically.
Why ARM, Why LLVM
During design exploration, we tested SIMD on different architectures and found ARM (AArch64) with NEON/SVE/SVE2 provided excellent performance for our workloads.
Instead of writing custom intrinsics, Numerix compiles with SIMD flags and lets LLVM handle vectorization:
RUSTFLAGS="-C target-feature=+neon,+sve,+sve2" \
cargo build --release --target aarch64-unknown-linux-gnu
-
This approach works well because operations are straightforward, data is aligned, and compiler auto-vectorization is reliable.
-
AMD/x86 builds are equally supported — enabling their SIMD extensions is just a matter of changing build flags.
Request Model and Flow
- Client calls gRPC
numerix.Numerix/Computewith:schema: ordered feature namesentity_scores: per-entity vectors (string or bytes)compute_id: integer identifier for the expressiondata_type(optional): e.g.,fp32orfp64
- Service fetches the postfix expression for
compute_idwhich was pre-fetched frometcd. - Request is validated for schema and data shape.
- The stack-based evaluator executes the expression in O(n) over tokens, with vectorized inner operations.
- Response returns
computation_score_dataor a structurederror.
Why Postfix Expressions
- Stored in etcd as postfix (Reverse Polish) notation.
- Postfix makes evaluation parser-free and linear time.
- Execution uses a stack machine:
- Push operands (feature vectors).
- Pop, compute, and push results for each operator.
- Benefits: predictable runtime, compiler-friendly loops, cache efficiency.
gRPC Interface
- Service:
numerix.Numerix - RPC:
Compute(NumerixRequestProto) → NumerixResponseProto - Request fields:
schema,entity_scores,compute_id, optionaldata_type - Response fields:
computation_score_dataorerror
Example (grpcurl):
grpcurl -plaintext \
-import-path ./numerix/src/protos/proto \
-proto numerix.proto \
-d '{
"entityScoreData": {
"schema": ["feature1", "feature2"],
"entityScores": [ { "stringData": { "values": ["1.0", "2.0"] } } ],
"computeId": "1001",
"dataType": "fp32"
}
}' \
localhost:8080 numerix.Numerix/Compute
Observability
- Datadog (DogStatsD) metrics publication via UDP client:
- Latency (P50/P95/P99), error rate, RPS, internal failures
- Configurable sampling rate via environment variables
- Optional
/metricsHTTP endpoint can be enabled for local debugging.
Environments
- Kubernetes (K8s), including GKE and EKS
- Multi-arch builds: amd64, arm64.
- ARM builds ship with NEON/SVE/SVE2 enabled.
Key Takeaways
- Minimal service surface: gRPC + etcd.
- No custom intrinsics — portable across ARM & AMD via compiler flags.
- Supports both string and byte input, internally converted to aligned fp32/fp64 vectors.
- Stack-based postfix evaluation : linear time, cache-friendly.
- Predictable, ultra-low-latency performance.
Contributing
We welcome contributions from the community! Please see our Contributing Guide for details on how to get started.
Community & Support
- 💬 Discord: Join our community chat
- 🐛 Issues: Report bugs and request features on GitHub Issues
- 📧 Email: Contact us at ml-oss@meesho.com
License
BharatMLStack is open-source software licensed under the BharatMLStack Business Source License 1.1.