**Describe the bug**
The Raspberry Pi 5 4GB performs slightly better (0-10%) th…an the 8GB version at default 2.4 GHz and the gap widens to >100% for certain workloads when overclocked. These workloads will see a dramatic _reduction_ in performance when overclocking the 8GB board. This reduction in performance is not present at all on the 4GB board.
It's unclear to me whether the small performance difference at default clock frequency has the same root cause as the more dramatic one that emerges as the ARM core frequency is increased. For the time being I'm treating them as related and reporting on both in this issue.
**To reproduce**
I have found two workloads in particular that expose the issue: Geekbench 5 "Text Rendering" multi-core sub-test and stress-ng "numa" stressor. To reproduce this issue, I suggest benchmarking the 4GB and 8GB boards at both 2.4 GHz and 2.8 GHz.
Geekbench 5 is available here: https://www.geekbench.com/preview/
To run stress-ng with profiling info:
1. Install stress-ng and the "perf" tool: `sudo apt install stress-ng linux-perf linux-perf-dbgsym`
2. Run the following to enable perf to work with stress-ng: `sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'`
3. Perform the test and note the bogo ops/s value: `stress-ng --numa 4 --numa-ops 1000 --metrics --perf`
**Expected behaviour**
Ideally the 4GB and 8GB boards would perform close to the same. The smaller difference at stock frequencies could be deemed normal/expected (for example due to different RAM ICs being used), but the dramatically increasing gap in performance as ARM core frequency increases suggests something may be misbehaving.
**Actual behaviour**
The 4GB board is anywhere from a few to more than 100% faster than the 8GB board, depending on clock frequency and workload. Below is a summary of tests I've run. As can be seen, the 4GB uses Samsung RAM and the 8GB uses Micron RAM.
![Pi 5 4GB vs 8GB v2](https://github.com/raspberrypi/firmware/assets/20064358/d43d87f4-c5f0-4ee2-af86-54bdc16e8e65)
These benchmark results are completely reproducible. I've also looked at other people's submission of Geekbench 5 results and can see the same reduction in "Text Rendering" scores on overclocked 8GB boards (but not on overclocked 4GB boards), so this is not limited to my specimen.
Below are the Geekbench 5 results at 2.4 and 2.8 GHz for the runs listed in the table above.
4GB (2400 MHz): https://browser.geekbench.com/v5/cpu/22028307
8GB (2400 MHz): https://browser.geekbench.com/v5/cpu/22028116
4GB (2800 MHz): https://browser.geekbench.com/v5/cpu/22028479
8GB (2800 MHz): https://browser.geekbench.com/v5/cpu/22028225
Below are "perf" tool output for the stress-ng runs at 2.4 and 2.8 GHz for both boards:
**4GB (2.4 GHz):**
```
stress-ng --numa 4 --numa-ops 1000 --metrics --perf
stress-ng: info: [3278] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info: [3278] dispatching hogs: 4 numa
stress-ng: info: [3279] numa: system has 1 of a maximum 4 memory NUMA nodes
stress-ng: metrc: [3278] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [3278] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [3278] numa 1000 8.44 33.33 0.05 118.48 29.96 98.85 7040
stress-ng: info: [3278] numa:
stress-ng: info: [3278] 81,220,668,668 CPU Cycles 9.480 B/sec
stress-ng: info: [3278] 4,757,399,712 Instructions 0.555 B/sec (0.059 instr. per cycle)
stress-ng: info: [3278] 258,736 Branch Misses 30.199 K/sec ( 0.000%)
stress-ng: info: [3278] 122,043,476 Stalled Cycles Frontend 14.244 M/sec
stress-ng: info: [3278] 76,572,843,124 Stalled Cycles Backend 8.937 B/sec
stress-ng: info: [3278] 81,230,134,176 Bus Cycles 9.481 B/sec
stress-ng: info: [3278] 4,349,256,396 Cache References 0.508 B/sec
stress-ng: info: [3278] 1,719,788 Cache Misses 0.201 M/sec ( 0.040%)
stress-ng: info: [3278] 4,436,001,520 Cache L1D Read 0.518 B/sec
stress-ng: info: [3278] 1,745,244 Cache L1D Read Miss 0.204 M/sec ( 0.039%)
stress-ng: info: [3278] 1,235,273,488 Cache L1I Read 0.144 B/sec
stress-ng: info: [3278] 665,872 Cache L1I Read Miss 77.718 K/sec
stress-ng: info: [3278] 1,549,288 Cache LL Read 0.181 M/sec
stress-ng: info: [3278] 1,187,048 Cache LL Read Miss 0.139 M/sec (76.619%)
stress-ng: info: [3278] 4,343,855,792 Cache DTLB Read 0.507 B/sec
stress-ng: info: [3278] 5,006,956 Cache DTLB Read Miss 0.584 M/sec ( 0.115%)
stress-ng: info: [3278] 1,160,470,704 Cache BPU Read 0.135 B/sec
stress-ng: info: [3278] 220,720 Cache BPU Read Miss 25.762 K/sec ( 0.019%)
stress-ng: info: [3278] 33,889,417,812 CPU Clock 3.955 B/sec
stress-ng: info: [3278] 33,889,294,388 Task Clock 3.955 B/sec
stress-ng: info: [3278] 1,060 Page Faults Total 123.720 /sec
stress-ng: info: [3278] 1,060 Page Faults Minor 123.720 /sec
stress-ng: info: [3278] 0 Page Faults Major 0.000 /sec
stress-ng: info: [3278] 228 Context Switches 26.611 /sec
stress-ng: info: [3278] 160 Cgroup Switches 18.675 /sec
stress-ng: info: [3278] 0 CPU Migrations 0.000 /sec
stress-ng: info: [3278] 0 Alignment Faults 0.000 /sec
stress-ng: info: [3278] 0 Emulation Faults 0.000 /sec
stress-ng: info: [3278] successful run completed in 8.57s
```
8GB (2.4 GHz):
```
stress-ng --numa 4 --numa-ops 1000 --metrics --perf
stress-ng: info: [2825] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info: [2825] dispatching hogs: 4 numa
stress-ng: info: [2826] numa: system has 1 of a maximum 4 memory NUMA nodes
stress-ng: metrc: [2825] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [2825] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [2825] numa 1000 9.33 37.08 0.06 107.20 26.93 99.51 7072
stress-ng: info: [2825] numa:
stress-ng: info: [2825] 87,844,236,208 CPU Cycles 9.284 B/sec
stress-ng: info: [2825] 4,775,378,096 Instructions 0.505 B/sec (0.054 instr. per cycle)
stress-ng: info: [2825] 199,680 Branch Misses 21.103 K/sec ( 0.000%)
stress-ng: info: [2825] 148,631,000 Stalled Cycles Frontend 15.708 M/sec
stress-ng: info: [2825] 83,139,642,756 Stalled Cycles Backend 8.786 B/sec
stress-ng: info: [2825] 87,849,436,396 Bus Cycles 9.284 B/sec
stress-ng: info: [2825] 4,360,769,624 Cache References 0.461 B/sec
stress-ng: info: [2825] 3,826,552 Cache Misses 0.404 M/sec ( 0.088%)
stress-ng: info: [2825] 4,326,723,104 Cache L1D Read 0.457 B/sec
stress-ng: info: [2825] 3,818,744 Cache L1D Read Miss 0.404 M/sec ( 0.088%)
stress-ng: info: [2825] 1,205,851,084 Cache L1I Read 0.127 B/sec
stress-ng: info: [2825] 688,460 Cache L1I Read Miss 72.759 K/sec
stress-ng: info: [2825] 3,388,484 Cache LL Read 0.358 M/sec
stress-ng: info: [2825] 2,846,120 Cache LL Read Miss 0.301 M/sec (83.994%)
stress-ng: info: [2825] 4,366,253,912 Cache DTLB Read 0.461 B/sec
stress-ng: info: [2825] 5,043,928 Cache DTLB Read Miss 0.533 M/sec ( 0.116%)
stress-ng: info: [2825] 1,179,947,616 Cache BPU Read 0.125 B/sec
stress-ng: info: [2825] 194,724 Cache BPU Read Miss 20.579 K/sec ( 0.017%)
stress-ng: info: [2825] 36,669,122,012 CPU Clock 3.875 B/sec
stress-ng: info: [2825] 36,668,453,648 Task Clock 3.875 B/sec
stress-ng: info: [2825] 1,064 Page Faults Total 112.447 /sec
stress-ng: info: [2825] 1,064 Page Faults Minor 112.447 /sec
stress-ng: info: [2825] 0 Page Faults Major 0.000 /sec
stress-ng: info: [2825] 360 Context Switches 38.046 /sec
stress-ng: info: [2825] 360 Cgroup Switches 38.046 /sec
stress-ng: info: [2825] 0 CPU Migrations 0.000 /sec
stress-ng: info: [2825] 0 Alignment Faults 0.000 /sec
stress-ng: info: [2825] 0 Emulation Faults 0.000 /sec
stress-ng: info: [2825] successful run completed in 9.46s
```
**4GB (2.8 GHz):**
```
stress-ng --numa 4 --numa-ops 1000 --metrics --perf
stress-ng: info: [2897] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info: [2897] dispatching hogs: 4 numa
stress-ng: info: [2898] numa: system has 1 of a maximum 4 memory NUMA nodes
stress-ng: metrc: [2897] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [2897] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [2897] numa 1000 8.43 33.43 0.06 118.66 29.86 99.35 7024
stress-ng: info: [2897] numa:
stress-ng: info: [2897] 94,804,086,272 CPU Cycles 11.088 B/sec
stress-ng: info: [2897] 4,743,732,748 Instructions 0.555 B/sec (0.050 instr. per cycle)
stress-ng: info: [2897] 171,072 Branch Misses 20.008 K/sec ( 0.000%)
stress-ng: info: [2897] 141,005,824 Stalled Cycles Frontend 16.491 M/sec
stress-ng: info: [2897] 90,175,677,856 Stalled Cycles Backend 10.546 B/sec
stress-ng: info: [2897] 94,811,739,556 Bus Cycles 11.089 B/sec
stress-ng: info: [2897] 4,351,381,748 Cache References 0.509 B/sec
stress-ng: info: [2897] 2,356,576 Cache Misses 0.276 M/sec ( 0.054%)
stress-ng: info: [2897] 4,363,258,428 Cache L1D Read 0.510 B/sec
stress-ng: info: [2897] 2,354,384 Cache L1D Read Miss 0.275 M/sec ( 0.054%)
stress-ng: info: [2897] 1,205,560,624 Cache L1I Read 0.141 B/sec
stress-ng: info: [2897] 579,892 Cache L1I Read Miss 67.821 K/sec
stress-ng: info: [2897] 2,040,828 Cache LL Read 0.239 M/sec
stress-ng: info: [2897] 1,692,944 Cache LL Read Miss 0.198 M/sec (82.954%)
stress-ng: info: [2897] 4,372,534,272 Cache DTLB Read 0.511 B/sec
stress-ng: info: [2897] 5,064,992 Cache DTLB Read Miss 0.592 M/sec ( 0.116%)
stress-ng: info: [2897] 1,175,805,832 Cache BPU Read 0.138 B/sec
stress-ng: info: [2897] 140,216 Cache BPU Read Miss 16.399 K/sec ( 0.012%)
stress-ng: info: [2897] 33,905,985,500 CPU Clock 3.965 B/sec
stress-ng: info: [2897] 33,905,692,244 Task Clock 3.965 B/sec
stress-ng: info: [2897] 1,064 Page Faults Total 124.440 /sec
stress-ng: info: [2897] 1,064 Page Faults Minor 124.440 /sec
stress-ng: info: [2897] 0 Page Faults Major 0.000 /sec
stress-ng: info: [2897] 224 Context Switches 26.198 /sec
stress-ng: info: [2897] 204 Cgroup Switches 23.859 /sec
stress-ng: info: [2897] 0 CPU Migrations 0.000 /sec
stress-ng: info: [2897] 0 Alignment Faults 0.000 /sec
stress-ng: info: [2897] 0 Emulation Faults 0.000 /sec
stress-ng: info: [2897] successful run completed in 8.55s
```
**8GB (2.8 GHz):**
```
stress-ng --numa 4 --numa-ops 1000 --metrics --perf
stress-ng: info: [2065] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info: [2065] dispatching hogs: 4 numa
stress-ng: info: [2066] numa: system has 1 of a maximum 4 memory NUMA nodes
stress-ng: metrc: [2065] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [2065] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [2065] numa 1000 19.53 77.52 0.19 51.19 12.87 99.45 7024
stress-ng: info: [2065] numa:
stress-ng: info: [2065] 213,041,661,036 CPU Cycles 10.802 B/sec
stress-ng: info: [2065] 4,850,988,464 Instructions 0.246 B/sec (0.023 instr. per cycle)
stress-ng: info: [2065] 252,376 Branch Misses 12.796 K/sec ( 0.000%)
stress-ng: info: [2065] 635,563,820 Stalled Cycles Frontend 32.225 M/sec
stress-ng: info: [2065] 207,857,257,028 Stalled Cycles Backend 10.539 B/sec
stress-ng: info: [2065] 213,076,738,404 Bus Cycles 10.804 B/sec
stress-ng: info: [2065] 4,413,222,720 Cache References 0.224 B/sec
stress-ng: info: [2065] 67,889,660 Cache Misses 3.442 M/sec ( 1.538%)
stress-ng: info: [2065] 4,424,491,460 Cache L1D Read 0.224 B/sec
stress-ng: info: [2065] 67,486,816 Cache L1D Read Miss 3.422 M/sec ( 1.525%)
stress-ng: info: [2065] 1,239,622,236 Cache L1I Read 62.852 M/sec
stress-ng: info: [2065] 1,194,376 Cache L1I Read Miss 60.558 K/sec
stress-ng: info: [2065] 73,587,652 Cache LL Read 3.731 M/sec
stress-ng: info: [2065] 67,162,612 Cache LL Read Miss 3.405 M/sec (91.269%)
stress-ng: info: [2065] 4,374,670,156 Cache DTLB Read 0.222 B/sec
stress-ng: info: [2065] 5,522,360 Cache DTLB Read Miss 0.280 M/sec ( 0.126%)
stress-ng: info: [2065] 1,194,694,580 Cache BPU Read 60.574 M/sec
stress-ng: info: [2065] 210,600 Cache BPU Read Miss 10.678 K/sec ( 0.018%)
stress-ng: info: [2065] 76,637,242,788 CPU Clock 3.886 B/sec
stress-ng: info: [2065] 76,635,993,260 Task Clock 3.886 B/sec
stress-ng: info: [2065] 1,064 Page Faults Total 53.948 /sec
stress-ng: info: [2065] 1,064 Page Faults Minor 53.948 /sec
stress-ng: info: [2065] 0 Page Faults Major 0.000 /sec
stress-ng: info: [2065] 556 Context Switches 28.191 /sec
stress-ng: info: [2065] 536 Cgroup Switches 27.177 /sec
stress-ng: info: [2065] 0 CPU Migrations 0.000 /sec
stress-ng: info: [2065] 0 Alignment Faults 0.000 /sec
stress-ng: info: [2065] 0 Emulation Faults 0.000 /sec
stress-ng: info: [2065] successful run completed in 19.72s
```
The 8GB 2.8 GHz result sticks out when compared to the same board running at 2.4 GHz, due to:
- Instructions per cycle is reduced to less than half: 0.050 vs 0.023 instr. per cycle
- Stalled Cycles Frontend (per second) doubles: 16.491 M/sec vs 32.225 M/sec
- Cache L1D Read Miss increases from 275 000 per second to 3 422 000 per second
- Cache LL Read & Cache LL Read Miss go from 239 M/sec & 198 M/sec to 3.7 M/sec & 3.4 M/sec
Finally, I should mention that RAM bandwidth and latency tests do not show any issues.
**System**
raspinfo output: https://gist.github.com/Brunnis/4d8242cf757f28e1d5331b3f73b3a446