Skip to content

Commit 68de789

Browse files
committed
0D: Instruction caching and better benchmark function.
The previous benchmark function had a few flaws. First of all, it wasn't idiomatic Rust, because we used a loop construct that you would expect in C. Revamped that by using an iterator. Also, the previous benchmark got heavily optimized by the compiler, which unrolled the inner loop it into a huge sequence of consecutive loads and stores, resulting in lots of instructions that needed to be fetched from DRAM. Additionally, instruction caching was not turned on. The new code compiles into two tight loops, fully leveraging the power of the I and D caches, and providing an great showcase.
1 parent c65e2e5 commit 68de789

File tree

8 files changed

+19
-20
lines changed

8 files changed

+19
-20
lines changed

0C_virtual_memory/kernel8

0 Bytes
Binary file not shown.

0C_virtual_memory/kernel8.img

0 Bytes
Binary file not shown.

0C_virtual_memory/src/mmu.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -219,8 +219,8 @@ pub unsafe fn init() {
219219
// First, force all previous changes to be seen before the MMU is enabled.
220220
barrier::isb(barrier::SY);
221221

222-
// Enable the MMU and turn on caching
223-
SCTLR_EL1.modify(SCTLR_EL1::M::Enable + SCTLR_EL1::C::Cacheable);
222+
// Enable the MMU and turn on data and instruction caching.
223+
SCTLR_EL1.modify(SCTLR_EL1::M::Enable + SCTLR_EL1::C::Cacheable + SCTLR_EL1::I::Cacheable);
224224

225225
// Force MMU init to complete before next instruction
226226
barrier::isb(barrier::SY);

0D_cache_performance/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ performance.
1313
## Benchmark
1414

1515
Let's write a tiny, arbitrary micro-benchmark to showcase the performance of
16-
operating on the same DRAM with caching enabled and disabled.
16+
operating with data on the same DRAM with caching enabled and disabled.
1717

1818
### mmu.rs
1919

@@ -31,7 +31,7 @@ block). This time, the block is configured as cacheable.
3131
We write a little function that iteratively reads memory of five times the size
3232
of a `cacheline`, in steps of 8 bytes, aka one processor register at a time. We
3333
read the value, add 1, and write it back. This whole process is repeated
34-
`100_000` times.
34+
`20_000` times.
3535

3636
### main.rs
3737

@@ -46,12 +46,12 @@ On my Raspberry, I get the following results:
4646

4747
```text
4848
Benchmarking non-cacheable DRAM modifications at virtual 0x00200000, physical 0x00400000:
49-
664 miliseconds.
49+
1040 miliseconds.
5050
5151
Benchmarking cacheable DRAM modifications at virtual 0x00400000, physical 0x00400000:
52-
148 miliseconds.
52+
53 miliseconds.
5353
54-
With caching, the function is 348% faster!
54+
With caching, the function is 1862% faster!
5555
```
5656

5757
Impressive, isn't it?

0D_cache_performance/kernel8

48 Bytes
Binary file not shown.

0D_cache_performance/kernel8.img

-896 Bytes
Binary file not shown.

0D_cache_performance/src/benchmark.rs

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,26 +3,25 @@ use cortex_a::{barrier, regs::*};
33

44
/// We assume that addr is cacheline aligned
55
pub fn batch_modify(addr: u64) -> u32 {
6-
const CACHELINE_SIZE_BYTES: u64 = 64; // TODO: retrieve this from a system register
7-
const NUM_CACHELINES_TOUCHED: u64 = 5;
8-
const BYTES_PER_U64_REG: usize = 8;
9-
const NUM_BENCH_ITERATIONS: u64 = 100_000;
6+
const CACHELINE_SIZE_BYTES: usize = 64; // TODO: retrieve this from a system register
7+
const NUM_CACHELINES_TOUCHED: usize = 5;
8+
const NUM_BENCH_ITERATIONS: usize = 20_000;
109

11-
const NUM_BYTES_TOUCHED: u64 = CACHELINE_SIZE_BYTES * NUM_CACHELINES_TOUCHED;
10+
const NUM_BYTES_TOUCHED: usize = CACHELINE_SIZE_BYTES * NUM_CACHELINES_TOUCHED;
1211

12+
let mem = unsafe { core::slice::from_raw_parts_mut(addr as *mut u64, NUM_BYTES_TOUCHED) };
13+
14+
// Benchmark starts here
1315
let t1 = CNTPCT_EL0.get();
1416

1517
compiler_fence(Ordering::SeqCst);
1618

17-
let mut data_ptr: *mut u64;
1819
let mut temp: u64;
1920
for _ in 0..NUM_BENCH_ITERATIONS {
20-
for i in (addr..(addr + NUM_BYTES_TOUCHED)).step_by(BYTES_PER_U64_REG) {
21-
data_ptr = i as *mut u64;
22-
21+
for qword in mem.iter_mut() {
2322
unsafe {
24-
temp = core::ptr::read_volatile(data_ptr);
25-
core::ptr::write_volatile(data_ptr, temp + 1);
23+
temp = core::ptr::read_volatile(qword);
24+
core::ptr::write_volatile(qword, temp + 1);
2625
}
2726
}
2827
}

0D_cache_performance/src/mmu.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -211,8 +211,8 @@ pub unsafe fn init() {
211211
// First, force all previous changes to be seen before the MMU is enabled.
212212
barrier::isb(barrier::SY);
213213

214-
// Enable the MMU and turn on caching
215-
SCTLR_EL1.modify(SCTLR_EL1::M::Enable + SCTLR_EL1::C::Cacheable);
214+
// Enable the MMU and turn on data and instruction caching.
215+
SCTLR_EL1.modify(SCTLR_EL1::M::Enable + SCTLR_EL1::C::Cacheable + SCTLR_EL1::I::Cacheable);
216216

217217
// Force MMU init to complete before next instruction
218218
barrier::isb(barrier::SY);

0 commit comments

Comments
 (0)