TL;DR
Your STREAM benchmark results look alright, but careful interpretation is needed to reach the correct conclusion. In your particular case (and it's usually the case whenever GCC is used), to obtain the actual memory bandwidth, the Scale
test result must be multiplied by a factor of 1.5, and the Add
and Triad
tests must be multiplied by a factor 1.33.
Before we continue, note that these multiplication factors should only be used when the speed of Copy
is significantly faster than Scale
, Add
or Triad
. In other situations, a case-by-case analysis is required. Now back to the subject.
With these correction factors, your test results would be:
- Copy: 163585.1 x 1.0 = 163585.1 MB/s
- Scale: 102983.4 x 1.5 = 154475.1 MB/s
- Add: 113216.5 x 1.33 = 150577.945 MB/s
- Triad: 113144.9 x 1.33 = 150482.717 MB/s
This is now a self-consistent result. Now, as another reality check, consider the fact that 8-channel DDR4 @ 3200 MT/s has a theoretical bandwidth of 204.8 GB/s, a peak copy bandwidth of 163.5 GB/s shows that your CPU is operating at almost 80% of its theoretical memory bandwidth. For Triad, its efficiency is around 73.4%. Both are typical results, so your CPU is working alright.
A note for bandwidth comparison: My personal experience is that other memory benchmark tools popular among PC enthusiasts may report higher numbers. In particular, AIDA64 often reports a number very close to the hardware's theoretical maximum. Since it's proprietary, how the measurements are made is anyone's guess. I suspect they're using high-tuned assembly kernels and thread arrangements to touch as many cachelines as possible. On the other hand, STREAM is a test that better reflects the upper performance limit of practical memory-bound applications. Thus, AIDA64's results should not be compared directly with STREAM. On the other hand, SiSoftware Sandra's memory bandwidth test appears to be STREAM-like, so you can probably compare your STREAM test with SiSoft Sandra's memory bandwidth tests that occasionally appear in online reviews.
Beware that by convention, in nearly all cases, memory bandwidth is measured in GB/s, not GiB/s unless otherwise noted (because it's also often used together with GFLOPS, so it's more convenient to use the same base-10 SI prefix).
Discussion
STREAM is a simple, crude yet effective 100-line C program designed to be used by programmers working in the field of supercomputing and high-performance computing. If you don't already have prior knowledge about the technical details involved, it's extremely easy to misunderstand the tests. It's not an automatic and foolproof benchmark system for the general public, one can't just press a single button and expect to see the correct results.
What follows is my personal explanation. But to learn more information, take a look the following reference. The reference includes some points that I did not mention in the answer below, such as NUMA affinity.
Understanding Memory and Cache
To learn the origin of the magic numbers 1.5
and 1.33
, one needs to first understand the concepts of write allocate and non-temporal stores in the context of a CPU's memory and cache subsystem.
It boils down to the following facts:
Cacheline: On most modern CPUs, all memory reads and writes are done at the basic unit of a cacheline - usually 64 bytes.
Write-Allocate: When memory is accessed, usually, the entire cacheline is first read into CPU's last-level cache, including write-only memory traffic. A memory write request is often secretly a read + write request. This is done because most memory accesses exhibit temporal locality - more likely than not, you're going to reuse the data you've just written, so it's better to issue an extra read to bring it into cache.
Non-Temporal Stores: What if your memory access does not have temporal locality (in other words, after the data is written to memory, you're not going to reuse it anytime soon)? For the experts, most CPUs provide an escape hatch in the form of special non-temporal assembly instructions. When a non-temporal store instruction is issued, the CPU would write the cacheline involved directly into memory, bypassing the cache entirely. On x86, the SSE instruction MOVNTDQ
is used for this purpose, which is known as a Streaming Store.
memcpy()
: Most compilers don't support the automatic use of these non-temporal instructions (even Intel's icc
won't do it unless it's explicitly told to do so). On the other hand, memcpy()
is often implemented by experts for a particular architecture in hand-written assembly code, and for large memory copies, non-temporal stores are often used to improve performance. Most compilers can also often automatically transform a memcpy()
-like loop into a true memcpy()
. Thus, copying memory is often the fastest memory operation possible.
Understanding STREAM
kernels
Armed with these knowledge, let's examine the source code of the 4 STERAM kernels.
Copy
The Copy kernel is extremely simple.
#pragma omp parallel for
for (j=0; j<STREAM_ARRAY_SIZE; j++)
c[j] = a[j];
This loop copies every element in array a
to array c
, so the traffic here is 1 memory read and 1 memory write. In other words, it performs the computation of:
dst[n] = src[n]
For most modern compilers, they can automatically transform this loop into an optimized version of memcpy()
, often in hand-tuned assembly optimized for a CPU architecture. These optimized memcpy
often uses non-temporal store instructions, such as x86 SSE's MOVNTDQ
. Thus, no extra read traffic is generated when writing to array c
.
This is why STREAM
often show the fastest result for Copy
.
On the other hand, if the compiler does not perform the memcpy()
optimization automatically (for example, for GCC/clang, one can use the option -fno-builtin
), the actual traffic would be 2 reads and 1 write due to write-allocate policy on most CPUs, and one should multiply the results by a factor of:
(2 + 1) / (1 + 1) = 1.5
Scale
The Scale kernel is extremely simple.
#pragma omp parallel for
for (j=0; j<STREAM_ARRAY_SIZE; j++)
b[j] = scalar*c[j];
It multiplies every element in an array by a constant and writes the result to a new array:
dst[n] = k * src[n]
This loop has 1 memory read, 1 floating-point multiplication, and 1 memory write. Due to write-allocate policy, on most CPUs it actually perform 2 memory reads. Thus, without using non-temporal store instructions, the memory bandwidth will be underreported by a factor of:
(2 + 1) / (1 + 1) = 1.5
Add
The Add kernel is extremely simple.
#pragma omp parallel for
for (j=0; j<STREAM_ARRAY_SIZE; j++)
c[j] = a[j]+b[j];
It adds two elements together and writes the result into a new element in another array:
dst[i] = src1[i] + src2[i]
The loop has 2 memory reads, 1 floating-point add, and 1 memory write. Due to write-allocate policy, on most CPUs it actually perform 3 memory reads. Thus, one should multiply the results by a factor of:
(3 + 1) / (2 + 1) = 1.33
Triad
The Triad kernel is extremely simple.
#pragma omp parallel for
for (j=0; j<STREAM_ARRAY_SIZE; j++)
a[j] = b[j]+scalar*c[j];
It multiplies an array element by a constant, add it into another element in the another array, and write the result into a new element of the final array:
dst[i] = src1[i] + k * src2[i]
This loop has 2 memory reads, 1 floating-point multiplication, 1 floating-point add, and 1 memory write. Due to write-allocate policy, on most CPUs, there's actually 3 memory reads, not 2. Thus, one should multiply the results by a factor of:
(3 + 1) / (2 + 1) = 1.33
This is the origin of all the magical numbers.
In HPC, STREAM Triad is usually the standard efficiency test for a CPU and its memory controller, and is reported by many research papers. It measures the gap between the hardware's theoretical bandwidth and the realized bandwidth by the simplest possible software with a read, 2 writes, and a Fused Multiply-Add.
From experience, the throughput is around 80% of the CPU's theoretical peak. This roughly represent the fastest possible speed achievable by any practical software.
Why doesn't STREAM
correct its results automatically?
Why doesn't STREAM
perform the aforementioned corrections automatically?
First, because it's a high-level program written in C (and Fortran), and it's meant to be and indeed has been used across many hardware platforms, from single-board computers to supercomputers. It's not possible to predict the exact behaviors of all possible compilers and the CPUs of all systems (such as whether non-temporal stores are used or not). Sure, clang has __builtin_nontemporal_store()
, but it's clang-specific and not portable. Thus, it's better to faithfully report what the software sees exactly, without making any assumption. The interpretation is left as an exercise to the reader.
For example, an optimization called Write-Allocate Evasion is available on some ARM CPUs used by smartphones, and on Ice Lake and newer Intel CPUs. This feature allows a CPU to heuristically bypass the initial cacheline read to save memory bandwidth under certain conditions. This optimization is known as "SpecI2M" on Intel CPUs, and the exact heuristics involved is anyone's guess and is very inconsistent:
As a result, STREAM
may show higher bandwidth results on these CPUs but only under certain conditions. This would invalidate any pre-applied correction factor. For another example, when using the Intel C compiler, non-temporal stores can be force-enabled using the command-line option -qopt-streaming-stores always
.
Next, because it's a simple and well-understood C program in 100 lines, and it has been the standard since the late 1990s. Any use of automatic detection would undermine its confidence and potentially make the test more fragile.
Alternatives
Because of the error-prone nature of STREAM, I think this test is starting to show its age. Surely, STREAM is the industry-standard that lasted the test of time, one may still want to run it for comparison with other people's benchmarks. But because there are so many pitfalls related to NUMA, compiler, and CPU's write-allocate behavior - which are not automatically handled and can generate misleading results, it's time to consider some alternatives.
To test CPU memory bandwidth, I prefer the use of LIKWID. Likwid is the Swiss army knife of performance tuning in HPC applications. Its primary use is collecting statistics from low-level hardware performance counters to understand and optimize the performance characteristics of your code. But it also includes a micro-benchmacrk tool called likwid-bench
.
likwid-bench
can automatically launch multiple instances of the benchmark programs and pin the process and memory of each instance to its corresponding NUMA node. It also provides many pre-written benchmark kernels (including the equivalent of a STREAM Triad benchmark) in hand-written assembly, allowing accurate and consistent results, such as:
$ likwid-bench -a
copy - Double-precision vector copy, only scalar operations
copy_avx - Double-precision vector copy, optimized for AVX
copy_avx512 - Double-precision vector copy, optimized for AVX-
copy_mem - Double-precision vector copy, only scalar operations but with non-temporal stores
copy_mem_avx - Double-precision vector copy, uses AVX and non-temporal stores
copy_mem_avx512 - Double-precision vector copy, uses AVX-
copy_mem_sse - Double-precision vector copy, uses SSE and non-temporal stores
copy_sse - Double-precision vector copy, optimized for SSE
stream - Double-precision stream triad A(i) = B(i)*c + C(i), only scalar operations
stream_avx - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX
stream_avx512 - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX-
stream_avx512_fma - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX-
stream_avx_fma - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX FMAs
stream_mem - Double-precision stream triad A(i) = B(i)*c + C(i), uses SSE and non-temporal stores
stream_mem_avx - Double-precision stream triad A(i) = B(i)*c + C(i), uses AVX and non-temporal stores
stream_mem_avx512 - Double-precision stream triad A(i) = B(i)*c + C(i), uses AVX-
stream_mem_avx_fma - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX FMAs and non-temporal stores
stream_mem_sse - Double-precision stream triad A(i) = B(i)*c + C(i), uses SSE and non-temporal stores
stream_mem_sse_fma - Double-precision stream triad A(i) = B(i)*c + C(i), uses SSE FMAs and non-temporal stores
Here's an example of using likwid-bench
for measuring memory bandwidth using the STREAM Triad benchmark, using AVX, FMA, and non-temporal stores, using 4 instances with each optimally pinned to a NUMA node.
For reference, this is a dual-socket Intel Xeon E5-2680 v4 machine with 4-channel DDR4 ECC RDIMM @ 2400 MT/s on each socket, so it should behave like a 8-channel machine. The Cluster-On-Die mode is enabled in BIOS, as a result, each CPU is seen as two NUMA nodes - the CPU's silicon is physically partitioned into 2 parts, each with its own 2-channel memory controller interconnected by bridges between two separate Ring Bus, each with half the cores. The entire machine has 4 NUMA nodes. For NUMA-aware application, this offers the highest memory bandwidth.
$ likwid-bench -t stream_mem_avx_fma -w M0:1GB -w M1:1GB -w M2:1GB -w M3:1GB
--------------------------------------------------------------------------------
Cycles: 22089219858
CPU Clock: 2394502471
Cycle Clock: 2394502471
Time: 9.224973e+00 sec
Iterations: 14336
Iterations per thread: 256
Inner loop executions: 186011
Size (Byte): 3999980544
Size per thread: 71428224
Number of Flops: 85332918272
MFlops/s: 9250.21
Data volume (Byte): 1023995019264
MByte/s: 111002.50
Cycles per update: 0.517719
Cycles per cacheline: 4.141749
Loads per update: 2
Stores per update: 1
Load bytes per element: 16
Store bytes per elem.: 8
Load/store ratio: 2.00
Instructions: 39999805457
UOPs: 58666381312
According to likwid-bench
, The STREAM Triad performance of this machine is thus 111 GB/s. Now compare the result with the original STREAM benchmark:
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 96094.9 0.017320 0.016650 0.030050
Scale: 74934.2 0.022183 0.021352 0.033086
Add: 84007.2 0.029319 0.028569 0.041626
Triad: 85827.9 0.028947 0.027963 0.039657
-------------------------------------------------------------
Multiplying 85827.9 MB/s by 1.33x is 114149.91 MB/s. This shows that both results are consistent and within the margin of error.
Finally, note that 8-channel DDR4 2400 MT/s has a theoretical bandwidth of 153.6 GB/s. Thus, this Intel Broadwell-EP CPU's hardware efficiency is 72.2% - a bit on the low side. This is not surprising for a server CPU.