Skip to main content

Benchmarks (PoC)

This PoC measures the performance of vector addition in Rust with and without compiler SIMD optimizations. Requests consist of repeated fixed-size vector addition operations processed in parallel by the CPU. These results provide perspective on how much faster SIMD makes vectorized computations, and similar improvements are expected for other vectorized operations in Numerix.

System Configuration

  • Instance Type: c4a-highcpu-16
  • Processor: Google Axion (ARMv9, 64-bit)
  • SIMD Extension: SVE2
  • OS: Linux (Ubuntu 22.04)
  • Rust Version: rustc 1.80.0
  • Target Triple: aarch64-unknown-linux-gnu

Vector Addition Performance

With SIMD

Vector Dimns per opIterationsThroughput (GiB/s)Total CPU (raw)Total CPU (normalized)
100.39626 ns170,057,457,941376.041564%97.75%
500.6641 ns94,342,709,0951121.91590%99.38%
1001.1522 ns51,705,835,3971286.91560%97.50%
5005.0649 ns12,061,753,66114711538%96.12%
10009.648 ns6,488,848,7051544.51570%98.12%
500052.925 ns1,169,316,8131407.81590%99.38%
10000114.68 ns555,779,9811299.41592%99.50%
50000644.60 ns94,372,1531155.91560%97.50%
1000001.4530 µs42,502,2011025.51526%95.38%

Without SIMD

Vector Dimns per opIterationsThroughput (GiB/s)Total CPU (raw)Total CPU (normalized)
103.196 ns1,000,000,00025.031313%82.06%
503.866 ns1,000,000,000103.461417%88.56%
1005.867 ns1,000,000,000136.351495%93.44%
50019.25 ns1,000,000,000207.811600%100.00%
100033.91 ns1,000,000,000235.921600%100.00%
5000162.1 ns448,785,386246.711600%100.00%
10000332.0 ns208,428,151240.941600%100.00%
500001,740 ns39,247,646229.931600%100.00%
1000003,401 ns19,598,293235.241600%100.00%

Normalization: Total CPU (normalized) = Total CPU (raw) / 16, since 1600% equals full utilization on a 16‑core machine.

Observations

  • SIMD provides large speedups across all vector sizes: overall throughput improvements range from roughly 4–15× versus Without SIMD.
  • For small vectors (10–100), throughput gains are about 9–15×, with ns/op reduced proportionally.
  • For larger vectors (500–100000), speedups stabilize around ~4–7× as memory bandwidth pressure increases.
  • CPU saturation: Without SIMD reaches 100% normalized CPU at and beyond 500 elements, whereas With SIMD typically operates at ~95–99% normalized CPU yet delivers substantially higher throughput at similar CPU.
  • Per‑CPU efficiency: With SIMD, throughput per unit of CPU is much higher, reflecting better vector unit utilization and fewer instructions per element.
  • Absolute values depend on hardware and load; the relative differential reflects the benefit of compiler SIMD optimizations.

⚠ Note: Absolute numbers depend on CPU frequency, memory locality, and system load. These results are meant to show relative SIMD benefits.