Memory Interference Characterization Between CPU
Memory Interference Characterization Between CPU
Memory Interference Characterization Between CPU
Abstract—Most of today’s mixed criticality platforms feature be put on the memory traffic originated by the iGPU, as it
Systems on Chip (SoC) where a multi-core CPU complex (the represents a very popular architectural paradigm for computing
host) competes with an integrated Graphic Processor Unit (iGPU, massively parallel workloads at impressive performance per
the device) for accessing central memory. The multi-core host Watt ratios [1]. This architectural choice (commonly referred
and the iGPU share the same memory controller, which has to to as General Purpose GPU computing, GPGPU) is one of the
arbitrate data access to both clients through often undisclosed
or non-priority driven mechanisms. Such aspect becomes crit-
reference architectures for future embedded mixed-criticality
ical when the iGPU is a high performance massively parallel applications, such as autonomous driving and unmanned aerial
computing complex potentially able to saturate the available vehicle control.
DRAM bandwidth of the considered SoC. The contribution of
this paper is to qualitatively analyze and characterize the conflicts In order to maximize the validity of the presented results,
due to parallel accesses to main memory by both CPU cores we considered different platforms featuring different memory
and iGPU, so to motivate the need of novel paradigms for controllers, instruction sets, data bus width, cache hierarchy
memory centric scheduling mechanisms. We analyzed different configurations and programming models:
well known and commercially available platforms in order to
i NVIDIA Tegra K1 SoC (TK1), using CUDA 6.5 [2] for
estimate variations in throughput and latencies within various
memory access patterns, both at host and device side. GPGPU applications;
ii NVIDIA Tegra X1 SoC (TX1), using CUDA 8.0; and
I. I NTRODUCTION iii Intel i7-6700 SoC featuring HD 530 Integrated GPU, using
Modern Systems on Chips (SoCs) integrate within a single OpenCL 2.0 [3].
chip substrate many functionalities that are usually fabricated This paper is organized as follows: section II presents an
as distinct entities on more traditional designs, such as laptops up-to-date brief review on previous studies regarding memory
or desktop computers. Examples of these integrated function- contention in integrated devices. Section III includes a thor-
alities commonly featured in embedded boards are represented ough description of the platforms that are being characterized.
by the CPU complex (i.e Multi-Core processors), the integrated Section IV describes the experimental framework and the
GPU and their respective memory interfaces. Each core of the obtained results. Section VI concludes the paper.
CPU complex and the iGPU can process tasks in parallel as
they are independent compute units. However, contention may
occur at the memory interface level. More specifically, CPU II. R ELATED W ORK
cores and the iGPU might share common cache levels, hence
experiencing self-eviction phenomena. In addition, the system As soon as the processors industry introduced the concept
memory (usually DRAM) also represents a contented resource of Multi-Core CPUs, memory contention was observed to
for all memory controller clients experiencing cache misses at become a potential bottleneck, mostly due to bus contention
their Last Level Cache (LLC). It is mandatory to accurately and cache pollution phenomena [4]. Later studies [5], [6],
measure the impact of such contention in mixed-criticality [7] successfully identified methodologies to bound the delay
platforms, as memory contention poses a significant threat to times of Real Time tasks due to memory access contention.
Worst Case Execution Times (WCETs) of memory bounded Memory arbitration mechanisms have been recently proposed
applications, as will be shown in this study. The contribution to decrease the impact of memory contention in the design of
of this paper is to provide accurate measurements of both intra critical applications, implementing co-scheduling mechanism
CPU complex memory interference and iGPU activity. We will of memory and processing bandwidth by multiple host cores
show how such interference impacts throughput and latency [8]. Examples of such memory arbitration mechanisms are
both on the host side and iGPU device. represented by MEMGUARD [9], BWLOCK [10] and the
PREM execution model [11]. The previously cited contri-
The ultimate purpose of this analysis is to highlight the butions are instrumental to our work, because they aim at
need for accurate memory-centric scheduling mechanisms to understanding the location of the contention points within
be set up for guaranteeing prioritized memory accesses to the memory hierarchy with respect to the platforms we used
Real-Time critical parts of the system. Special emphasis will in our tests. Moreover, such contention points might differ
978-1-5090-6505-9/17/$31.00
c 2017 IEEE according to the analyzed COTS (Commercial Off The Shelf)
systems, hence the need to qualitatively characterize the re- processor to have the same advanced features and architecture
cently commercialized platforms we used in our experiments as a modern desktop GPU while still using the low power
(detailed in the next section). Since the integrated GPU of the draw of a mobile chip (365 GFlops single precision peak
observed SoCs are most likely used to perform computations performance at < 11 W). The most relevant parts of this
with strict Real Time requirements, it is important to estimate platform and notable contention points are visible in Figure
the impact of unregulated CPU memory accesses during the 1(a).
execution of Real Time GPU applications. It is also trivial
to understand that the iGPU, in a mixed-criticality system can The K1 SoC consists of a quad-core 2.3GHz ARM Cortex-
execute non critical applications during the same time windows A15 CPU (32kb I-cache + 32kb D-cache L1 per core, 2MB L2
in which one or more CPU cores are executing Real Time cache common to all cores); ARM A15 belongs to the ARMv7-
tasks, hence the need to observe the impact of unregulated A family of RISC processors and features a 32bit architecture.
GPU activity towards CPU memory access latencies. CPU and Although not shown in the picture, an ARM Cortex A15
GPU co-run interference is a topic that was not treated in the shadow-core is also present for power saving policies. We
previously cited works, but was briefly explored in [12]. In a consider all cores be clocked at their maximum operative
simulated environment, the authors of this latter contribution frequencies. A single CPU core can utilize the maximum band-
highlighted a decrease of instruction per cycle w.r.t. CPU width available for the whole CPU complex which amounts to
activity together with a decrease in memory bandwidth on almost 3.5 GB/s for sequential reading operations. The iGPU is
the GPU side. In our contribution, we therefore promise more a Kepler generation “GK20a” with 192 CUDA cores grouped
accurate measurements on commercially available SoCs. in a single Streaming Multi-processor (SM). As visible from
Figure 1(a), the compute pipeline of an NVIDIA GPU includes
III. S O C S SPECIFICATIONS AND CONTENTION POINTS engines responsible for computations (Execution Engine, EE)
and engines responsible for high bandwidth DMA memory
In this section, we provide the necessary details on the transfers (Copy Engine, CE), as explained in [2] and [17]. In
platforms we selected for our analysis. This is instrumental the TK1 case, the EE is composed of a single SM which is able
for understanding where memory access contention might to access system memory in case of L2 cache misses. The CE
happen for each SoC. The preliminary investigation of plau- is meant to replicate CPU visible buffers to another area of the
sible contention points will help us better understanding the same system DRAM that is only visible to the GPU device. It
experimental results to identify future solutions [13]. The does that by exploiting high bandwidth DMA transfers that can
delays due to the private memory of the iGPU is not considered reach up to 12 GB/s out of the 14 GB/s theoretical bandwidth
in this study, focusing only on the memory shared with the available by the system DRAM. Therefore, the GPU alone
CPU complex. Private memory contention may be a significant is able to saturate the available system DRAM bandwidth.
issue for discrete GPUs that feature a significant number of Specifically regarding the system DRAM, the K1 features 2GB
execution units (i.e., Streaming Multiprocessors in NVIDIA of LPDDR3 64bit SDRAM working at (maximum) 933MHz.
GPUs). Today’s iGPUs, however, are much smaller than their
discrete counterparts. As a consequence, parallel kernels are The first contention point is represented by the LLC for
typically executed one at a time, without generating contention the CPU complex. This cache is a 16-way associative cache
at private memory level. with a line size of 64B. Contention may happen when more
than one core fills the L2 cache evicting pre-existing cache
Another important abstract component at iGPU side is the lines used by other cores. Another contention point for LLC
Copy Engine (CE), i.e., a DMA used to bring data from CPU is indicated as point 3 in Figure 1(a), and it is due to coherency
to GPU memory space. In discrete devices, this basically trans- mechanisms between CPU and iGPU when they share the same
lates into copying memory from system DRAM through PCIe address space. In the GK20a iGPU, such coherence is taken
towards the on-board RAM of the graphics adapter (Video care in a unclear and undisclosed software mechanisms by the
RAM, VRAM). In case of embedded platforms with shared (NVIDIA proprietary) GPU driver. Hardware cache coherence
system DRAM, using the CE basically means duplicating mechanisms take place only at CPU complex level.
the same buffer twice on the same memory device. Both
CUDA and OpenCL programming models specify alternatives The remaining contention points (2 and 4 in Figure 1(a))
to the CE approach to avoid explicit memory transfers and are represented by memory bus and EMC (Embedded Memory
unnecessary buffer replications, such as CUDA UVM (Unified Controller) for accessing the underlying DRAM banks. Such
Virtual Memory [14]) and OpenCL 2.0 SVM (Shared Virtual contention is of uttermost importance as it is caused by parallel
Memory [15]). However, these approaches introduce CPU- memory accesses of single CPU cores and the iGPU. We refer
iGPU memory coherency problems when accessing the same to [18] for finer grained discussions regarding the effect of
shared memory buffer, so that avoiding copy engines does bank parallelism in DRAM devices.
not necessarily lead to performance improvements1 For this
reason, we will characterize the contention originated in both B. NVIDIA Tegra X1
CE- and non-CE-based models.
The NVIDIA Tegra X1 [19] is a hybrid System on Module
A. NVIDIA Tegra K1 (SoM) featured in the newest NVIDIA Jetson Development
board. It is the first mobile processor to feature a chip powerful
The NVIDIA Tegra K1 [16] is an hybrid SoC featured in
enough to sustain the visual computing load for autonomous
the NVIDIA Jetson Development board. It is the first mobile
and assisted driving applications, still presenting a contained
1 Visible as results of these experiments https://github.com/Sarahild/ power consumption (1 TFlops single precision peak perfor-
CudaMemoryExperiments/tree/master/MemCpyExperiments mance drawing from 6 to 15 W). Also for this platform, the
(a) (b)
Fig. 1: A simplified overview of the Tegra K1 (a) and X1 (b) SoCs, with notable memory access contention points numbered
from 1 to 4: (1) contention on L2 cache shared by the 4 cores; (2) contention on bus to central memory from different clients;
(3) coherency protocol on LLC; (4) access arbitration on main memory controller.
most relevant components and notable contention points are L2 cache is shared between two logical cores of the same
shown in Figure 1(b). physical computing unit. According to Intel HyperThreading’s
proprietary design [20], a number of logical cores that is
The X1 CPU and GPU complex consists of a quad-core
twice the number of the physical cores is made available to
1.9GHz ARM Cortex-A57 CPU (48kb I-cache + 32kb D-cache
the operating system. Hence, 8 threads can be scheduled to
L1 per core, 2MB L2 cache common to all cores); ARM
run concurrently. This is allowed by exploiting aggressive out
A57 belongs to the ARMv8-A family of RISC processors
of order execution policies and other optimization techniques
and features a 64bit architecture. Even if not visible in the
aimed at improving the instruction level parallelism of a single
Figure, the CPU complex features a big.LITTLE architecture,
physical core. However, HyperThreading is also known to
with also four ARM Cortex A53 little cores for power saving
increase the competition for shared memory resources [21].
purposes. As for the K1, we will not analyze the performance
Figure 2 shows the interconnections between the iGPU, each
of this board under power saving regimes. A single CPU
CPU core, and the interface to the Memory Controller (MC),
core can utilize the maximum bandwidth available for the
Display Controller (DC) and PCI express bus (PCIe). In our
whole CPU complex, which amounts to almost 4.5 GB/s for
experimental setup, the i7 interfaces with 16GB of 64bit DDR4
sequential reading operations. The iGPU is a Maxwell second
DRAM clocked at 2133 MHz, allowing the CPU complex
generation “GM20b” with 256 CUDA cores grouped in two
to reach a theoretical bandwidth of 34 GB/s, resulting in
Streaming Multi-processors (SMs). The L2 is twice the size of
a measured bandwidth of 29 GB/s using Intel MLC2 for
its Kepler based predecessor. The EE and CE can access central
sequential reads. Both the CPU complex and the iGPU reach
memory with a maximum bandwidth close to 20 GB/s. As
the memory controller through the SoC ring interconnection
with the K1, also this high performance iGPU can saturate the
fabric depicted in Figure 2. The iGPU in this SoC is an
whole DRAM bandwidth. The system DRAM consists of 4GB
Intel HD 530 belonging to the 9Gen Intel graphic architecture
of LPDDR4 64bit SDRAM working at (maximum) 1.6GHz,
[22]. Differently from the NVIDIA architectures, the compute
reaching a peak ideal bandwidth of 25.6 GB/s. With relation
pipeline is composed of an execution engine composed of a
to the contention points, there are no substantial differences
slice, divided into three sub-slices. Each of these partitions
with the K1.
relies on 8 Execution Units (EU from 0 to 7 as shown in
Figure 2). Another difference w.r.t. the previous solutions is
C. Intel i7-6700
represented by the cache hierarchy, where the LLC is shared
The Intel i7-6700 processor presents noticeable differences between the CPU and the GPU, with related coherency HW
between the two boards described in the previous paragraphs. mechanisms. A cache miss in the GPU L3 will imply using the
This SoC features a quad-core CPU complex with Hyper- SoC ring interconnect to access CPU L3, and, if needed, the
Threading (HT) technology built above the well known x86 external memory controller. Moreover, as detailed in [22], L1
64 bit CISC architecture. Technical points of interests for this and L2 cache are read-only caches only used for the graphics
platform are depicted in Figure 2. pipeline, not representing a significant contention point. The
HD 530 also has an abstraction of a Copy Engine used to
This specific processor belongs to the Skylake (6th genera-
replicate data from a CPU-only visible address space to a
tion) of Intel CPUs. Such processors are common for desktop
space accessible by the iGPU. However, thanks to the CPU-
or laptop configurations, with a SoC power consumption
GPU shared cache level, the best practices of using the iGPU
around 65W, higher than in the previously described boards.
in such designs is to exploit the unified memory architecture
Another peculiar difference refers to the cache hierarchy,
(SVM model in OpenCL 2.0).
where, unlike the ARM based design, L2 cache is not shared
among physical cores. This is an architectural choice that Intel 2 Intel Memory Latency Checker v. 3.1 available at
adopts since the Nehalem generation (released in late 2008). https://software.intel.com/en-us/articles/intelr-memory-latency-checker
Fig. 2: A simplified overview of the Intel i7-6700 Skylake with notable contention points: (1) represents contention on shared
L3 cache among all CPU cores; (2) contention on the shared CPU-iGPU LLC; (3) parallel access to system DRAM by all the
presented actors. (1) and (2) also include coherency overhead for CPU and iGPU traffic.
(g) C1 (h) C2
Fig. 3: Test results for Tegra K1: WSS [Byte] (log) vs. Latencies [ns]. Vertical lines in (a)-(f) correspond to L1 and L2 size.
situations, i.e., sequential non-interfered reads (15-18 ns) and C. Latencies on i7-6700
random reads with CUDA memset interference (463 ns).
Interpreting the results on the i7 is much more difficult
than the previous solutions: the x86 architecture is substantially
more complex than the RISC based architectures analyzed so
far. This implies having much more aggressive and counter
intuitive optimization strategies for prefetching, speculative
In Figure 4 (g) and (h), we see the effect of CPU activity
execution and other mechanisms (e.g., System Mode Manage-
over iGPU related tasks (Test Case C). The relative impact
ment [24] and non trivial cache mapping policies [25]) that
on iGPU execution times is larger than the one observed with
are not analyzed in this paper but that might affect the results.
the K1. In case of three sequentially interfering cores, CUDA
memset and memcpy execution times increase by 52% (it was Still, some meaningful aspects can be identified. Experi-
30% in the K1 case). This is mostly due to the enhanced mem- ments for Test Case A are depicted in Figure 5. The latencies
ory bandwidth of the CPU cluster in the X1 platform, together with no interference are significantly smaller than in the
with the more aggressive prefetching mechanisms in the A57’s, previous SoCs. For sequential reads, increasing the working
as will be shown in section V-D. With random interference, set size beyond L3 only adds a few ns with respect to the
the deterioration is somewhat smaller (12%), although not as latency measured within L1 (from 1 to 4/5 ns). For random
small as it was in the K1 case (24%), confirming the largest reads, a significant performance decrease (about 11x) can
memory bandwidth utilized by the CPU cluster in the X1. be noticed for WSS beyond L3. Increasing the number of
(a) A1 (b) A2 (c) B1
(g) C1 (h) C2
Fig. 4: Test results for Tegra X1: WSS [Byte] (log) vs. Latencies [ns]. Vertical lines in (a)-(f) correspond to L1 and L2 size.
interfering cores, contention on L1 and L2 is almost absent are shown in Figure 5 (c) and (f). As already pointed out
in case of random interference (A3 and A4), while it is in section III-C, a larger impact on latencies is expected
more noticeable in case of sequential interference (A1 and due to the cache shared between the CPU complex and the
A2). When reaching L3 boundaries, a dramatic performance iGPU. Somewhat unexpectedly, a performance deterioration
deterioration takes place, with latencies increasing up to 17x is observed already with WSS smaller than L2. However, as
for sequential reads, and up to 6x for random reads, both with pointed out in III-C, L2 is not shared with the GPU. The
sequentially interfering cores. A smaller impact on relative observed deterioration is instead due to the HW coherency
performance deterioration is noticed when the interfering cores mechanisms between L2 and L3, where this latter cache level
performs random reads. is shared with the interfering GPU. Such effect was not
observed on Tegra based platforms, as NVIDIA solutions rely
The uneven spacing between the lines in the graph is due to
on different SW controlled cache coherency mechanisms, and
the fact that half of the cores are not physical cores, but logical
there are no shared caches between the CPU complex and the
cores enabled by Intel HyperThreading technology. These
iGPU.
logical cores share resources with a corresponding physical
core, competing for L1 and L2 access. Therefore, logical cores
Test Case B shows that the interference at L2 boundaries
do not add significant contention in system DRAM w.r.t. their
is already so severe that it almost matches the interference
physical counterparts [20], [21].
experienced when accessing system DRAM, especially for
Experiments for Test Case B related to iGPU interference sequential reads (B1). Differently from the previous boards,
(a) A1 (b) A2 (c) B1
(g) C1 (h) C2
Fig. 5: Test results for I7-6700. Black vertical lines in (a)-(f) correspond to L1, L2 and L3 CPU cache sizes
Fig. 6: (a) HW prefetch impact for the tested platforms. (b) Combined interference in X1
[10] H. Yun, S. Gondi, and S. Biswas, “Protecting memory-performance analysis.” in USENIX annual technical conference. San Diego, CA,
critical sections in soft real-time applications,” arXiv preprint USA, 1996, pp. 279–294.
arXiv:1502.02287, 2015. [24] R. A. Starke and R. S. de Oliveira, “Impact of the x86 system man-
[11] R. Pellizzoni, E. Betti, S. Bak, G. Yao, J. Criswell, M. Caccamo, and agement mode in real-time systems,” in Computing System Engineering
R. Kegley, “A predictable execution model for cots-based embedded (SBESC), 2011 Brazilian Symposium on. IEEE, 2011, pp. 151–157.
systems,” in 2011 17th IEEE Real-Time and Embedded Technology and [25] C. Maurice, N. Le Scouarnec, C. Neumann, O. Heen, and A. Francillon,
Applications Symposium. IEEE, 2011, pp. 269–279. “Reverse engineering intel last-level cache complex addressing using
[12] M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver, “A qos-aware memory performance counters,” in International Workshop on Recent Advances
controller for dynamically balancing gpu and cpu bandwidth use in in Intrusion Detection. Springer, 2015, pp. 48–65.
an mpsoc,” in Proceedings of the 49th Annual Design Automation [26] Intel, “Intel 64 and ia-32 architectures. optimization ref-
Conference. ACM, 2012, pp. 850–855. erence manual,” Intel Corporation, 2016. [Online]. Avail-
[13] L. Sha, M. Caccamo, R. Mancuso, J.-E. Kim, M.-K. Yoon, R. Pellizzoni, able: http://www.intel.com/content/dam/www/public/us/en/documents/
H. Yun, R. Kegley, D. Perlman, G. Arundale et al., “Single core manuals/64-ia-32-architectures-optimization-manual.pdf
equivalent virtual machines for hard realtime computing on multicore
processors,” Tech. Rep., 2014.
[14] A. Rao, A. Srivastava, K. Yogesh, A. Douillet, G. Gerfin, M. Kaushik,
N. Shulga, V. Venkataraman, D. Fontaine, M. Hairgrove et al., “Uni-
fied memory systems and methods,” Jan. 20 2015, uS Patent App.
14/601,223.
[15] B. A. Hechtman and D. J. Sorin, “Evaluating cache coherent shared
virtual memory for heterogeneous multicore chips,” in Performance
Analysis of Systems and Software (ISPASS), 2013 IEEE International
Symposium on. IEEE, 2013, pp. 118–119.
[16] NVIDIA, “Nvidia tegra k1 white paper, a new era in mobile computing,”
NVIDIA Corporation, 2014. [Online]. Available: http://www.nvidia.
com/content/pdf/tegra white papers/tegra k1 whitepaper v1.0.pdf
[17] G. A. Elliott, B. C. Ward, and J. H. Anderson, “Gpusync: A framework
for real-time gpu management,” in Real-Time Systems Symposium
(RTSS), 2013 IEEE 34th. IEEE, 2013, pp. 33–44.
[18] S. Goossens, B. Akesson, K. Goossens, and K. Chandrasekar, Memory
Controllers for Mixed-Time-Criticality Systems. Springer, 2016.
[19] NVIDIA, “Nvidia tegra x1 white paper, nvidiaś new mobile superchip,”
NVIDIA Corporation, 2015. [Online]. Available: http://international.
download.nvidia.com/pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf
[20] D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty et al., “Hyper-
threading technology in the netburst
R microarchitecture,” 14th Hot
Chips, 2002.
[21] S. Saini, H. Jin, R. Hood, D. Barker, P. Mehrotra, and R. Biswas,
“The impact of hyper-threading on processor resource utilization in
production applications,” in High Performance Computing (HiPC), 2011
18th International Conference on. IEEE, 2011, pp. 1–10.
[22] Intel, “The compute architecture of intel processor
graphics gen9, v. 1.0,” Intel White Paper, 2015. [Online].
Available: https://software.intel.com/sites/default/files/managed/c5/9a/
The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.
pdf
[23] L. W. McVoy, C. Staelin et al., “lmbench: Portable tools for performance