Benchmarks (PoC)

This PoC measures the performance of vector addition in Rust with and without compiler SIMD optimizations. Requests consist of repeated fixed-size vector addition operations processed in parallel by the CPU. These results provide perspective on how much faster SIMD makes vectorized computations, and similar improvements are expected for other vectorized operations in Numerix.

System Configuration

Instance Type: c4a-highcpu-16
Processor: Google Axion (ARMv9, 64-bit)
SIMD Extension: SVE2
OS: Linux (Ubuntu 22.04)
Rust Version: rustc 1.80.0
Target Triple: aarch64-unknown-linux-gnu

Vector Addition Performance

With SIMD

Vector Dim	ns per op	Iterations	Throughput (GiB/s)	Total CPU (raw)	Total CPU (normalized)
10	0.39626 ns	170,057,457,941	376.04	1564%	97.75%
50	0.6641 ns	94,342,709,095	1121.9	1590%	99.38%
100	1.1522 ns	51,705,835,397	1286.9	1560%	97.50%
500	5.0649 ns	12,061,753,661	1471	1538%	96.12%
1000	9.648 ns	6,488,848,705	1544.5	1570%	98.12%
5000	52.925 ns	1,169,316,813	1407.8	1590%	99.38%
10000	114.68 ns	555,779,981	1299.4	1592%	99.50%
50000	644.60 ns	94,372,153	1155.9	1560%	97.50%
100000	1.4530 µs	42,502,201	1025.5	1526%	95.38%

Without SIMD

Vector Dim	ns per op	Iterations	Throughput (GiB/s)	Total CPU (raw)	Total CPU (normalized)
10	3.196 ns	1,000,000,000	25.03	1313%	82.06%
50	3.866 ns	1,000,000,000	103.46	1417%	88.56%
100	5.867 ns	1,000,000,000	136.35	1495%	93.44%
500	19.25 ns	1,000,000,000	207.81	1600%	100.00%
1000	33.91 ns	1,000,000,000	235.92	1600%	100.00%
5000	162.1 ns	448,785,386	246.71	1600%	100.00%
10000	332.0 ns	208,428,151	240.94	1600%	100.00%
50000	1,740 ns	39,247,646	229.93	1600%	100.00%
100000	3,401 ns	19,598,293	235.24	1600%	100.00%

Normalization: Total CPU (normalized) = Total CPU (raw) / 16, since 1600% equals full utilization on a 16‑core machine.

Observations

SIMD provides large speedups across all vector sizes: overall throughput improvements range from roughly 4–15× versus Without SIMD.
For small vectors (10–100), throughput gains are about 9–15×, with ns/op reduced proportionally.
For larger vectors (500–100000), speedups stabilize around ~4–7× as memory bandwidth pressure increases.
CPU saturation: Without SIMD reaches 100% normalized CPU at and beyond 500 elements, whereas With SIMD typically operates at ~95–99% normalized CPU yet delivers substantially higher throughput at similar CPU.
Per‑CPU efficiency: With SIMD, throughput per unit of CPU is much higher, reflecting better vector unit utilization and fewer instructions per element.
Absolute values depend on hardware and load; the relative differential reflects the benefit of compiler SIMD optimizations.

⚠ Note: Absolute numbers depend on CPU frequency, memory locality, and system load. These results are meant to show relative SIMD benefits.

System Configuration​

Vector Addition Performance​

With SIMD​

Without SIMD​

Observations​

System Configuration

Vector Addition Performance

With SIMD

Without SIMD

Observations