Memory Interference Characterization Between CPU

Memory Interference Characterization between CPU
cores and integrated GPUs in Mixed-Criticality

Platforms
Roberto Cavicchioli, Nicola Capodieci and Marko Bertogna

University of Modena And Reggio Emilia, Department of Physics, Informatics and Mathematics, Modena, Italy
{name.surname}@unimore.it
Abstract—Most of today’s mixed criticality platforms feature be put on the memory traffic originated by the iGPU, as it
Systems on Chip (SoC) where a multi-core CPU complex (the represents a very popular architectural paradigm for computing
host) competes with an integrated Graphic Processor Unit (iGPU, massively parallel workloads at impressive performance per
the device) for accessing central memory. The multi-core host Watt ratios [1]. This architectural choice (commonly referred
and the iGPU share the same memory controller, which has to to as General Purpose GPU computing, GPGPU) is one of the
arbitrate data access to both clients through often undisclosed
or non-priority driven mechanisms. Such aspect becomes crit-
reference architectures for future embedded mixed-criticality
ical when the iGPU is a high performance massively parallel applications, such as autonomous driving and unmanned aerial
computing complex potentially able to saturate the available vehicle control.
DRAM bandwidth of the considered SoC. The contribution of
this paper is to qualitatively analyze and characterize the conflicts In order to maximize the validity of the presented results,
due to parallel accesses to main memory by both CPU cores we considered different platforms featuring different memory
and iGPU, so to motivate the need of novel paradigms for controllers, instruction sets, data bus width, cache hierarchy
memory centric scheduling mechanisms. We analyzed different configurations and programming models:
well known and commercially available platforms in order to
i NVIDIA Tegra K1 SoC (TK1), using CUDA 6.5 [2] for
estimate variations in throughput and latencies within various
memory access patterns, both at host and device side. GPGPU applications;
ii NVIDIA Tegra X1 SoC (TX1), using CUDA 8.0; and
I. I NTRODUCTION iii Intel i7-6700 SoC featuring HD 530 Integrated GPU, using
Modern Systems on Chips (SoCs) integrate within a single OpenCL 2.0 [3].
chip substrate many functionalities that are usually fabricated This paper is organized as follows: section II presents an
as distinct entities on more traditional designs, such as laptops up-to-date brief review on previous studies regarding memory
or desktop computers. Examples of these integrated function- contention in integrated devices. Section III includes a thor-
alities commonly featured in embedded boards are represented ough description of the platforms that are being characterized.
by the CPU complex (i.e Multi-Core processors), the integrated Section IV describes the experimental framework and the
GPU and their respective memory interfaces. Each core of the obtained results. Section VI concludes the paper.
CPU complex and the iGPU can process tasks in parallel as
they are independent compute units. However, contention may
occur at the memory interface level. More specifically, CPU II. R ELATED W ORK
cores and the iGPU might share common cache levels, hence
experiencing self-eviction phenomena. In addition, the system As soon as the processors industry introduced the concept
memory (usually DRAM) also represents a contented resource of Multi-Core CPUs, memory contention was observed to
for all memory controller clients experiencing cache misses at become a potential bottleneck, mostly due to bus contention
their Last Level Cache (LLC). It is mandatory to accurately and cache pollution phenomena [4]. Later studies [5], [6],
measure the impact of such contention in mixed-criticality [7] successfully identified methodologies to bound the delay
platforms, as memory contention poses a significant threat to times of Real Time tasks due to memory access contention.
Worst Case Execution Times (WCETs) of memory bounded Memory arbitration mechanisms have been recently proposed
applications, as will be shown in this study. The contribution to decrease the impact of memory contention in the design of
of this paper is to provide accurate measurements of both intra critical applications, implementing co-scheduling mechanism
CPU complex memory interference and iGPU activity. We will of memory and processing bandwidth by multiple host cores
show how such interference impacts throughput and latency [8]. Examples of such memory arbitration mechanisms are
both on the host side and iGPU device. represented by MEMGUARD [9], BWLOCK [10] and the
PREM execution model [11]. The previously cited contri-
The ultimate purpose of this analysis is to highlight the butions are instrumental to our work, because they aim at
need for accurate memory-centric scheduling mechanisms to understanding the location of the contention points within
be set up for guaranteeing prioritized memory accesses to the memory hierarchy with respect to the platforms we used
Real-Time critical parts of the system. Special emphasis will in our tests. Moreover, such contention points might differ
978-1-5090-6505-9/17/$31.00
c 2017 IEEE according to the analyzed COTS (Commercial Off The Shelf)
systems, hence the need to qualitatively characterize the re- processor to have the same advanced features and architecture
cently commercialized platforms we used in our experiments as a modern desktop GPU while still using the low power
(detailed in the next section). Since the integrated GPU of the draw of a mobile chip (365 GFlops single precision peak
observed SoCs are most likely used to perform computations performance at < 11 W). The most relevant parts of this
with strict Real Time requirements, it is important to estimate platform and notable contention points are visible in Figure
the impact of unregulated CPU memory accesses during the 1(a).
execution of Real Time GPU applications. It is also trivial
to understand that the iGPU, in a mixed-criticality system can The K1 SoC consists of a quad-core 2.3GHz ARM Cortex-
execute non critical applications during the same time windows A15 CPU (32kb I-cache + 32kb D-cache L1 per core, 2MB L2
in which one or more CPU cores are executing Real Time cache common to all cores); ARM A15 belongs to the ARMv7-
tasks, hence the need to observe the impact of unregulated A family of RISC processors and features a 32bit architecture.
GPU activity towards CPU memory access latencies. CPU and Although not shown in the picture, an ARM Cortex A15
GPU co-run interference is a topic that was not treated in the shadow-core is also present for power saving policies. We
previously cited works, but was briefly explored in [12]. In a consider all cores be clocked at their maximum operative
simulated environment, the authors of this latter contribution frequencies. A single CPU core can utilize the maximum band-
highlighted a decrease of instruction per cycle w.r.t. CPU width available for the whole CPU complex which amounts to
activity together with a decrease in memory bandwidth on almost 3.5 GB/s for sequential reading operations. The iGPU is
the GPU side. In our contribution, we therefore promise more a Kepler generation “GK20a” with 192 CUDA cores grouped
accurate measurements on commercially available SoCs. in a single Streaming Multi-processor (SM). As visible from
Figure 1(a), the compute pipeline of an NVIDIA GPU includes
III. S O C S SPECIFICATIONS AND CONTENTION POINTS engines responsible for computations (Execution Engine, EE)
and engines responsible for high bandwidth DMA memory
In this section, we provide the necessary details on the transfers (Copy Engine, CE), as explained in [2] and [17]. In
platforms we selected for our analysis. This is instrumental the TK1 case, the EE is composed of a single SM which is able
for understanding where memory access contention might to access system memory in case of L2 cache misses. The CE
happen for each SoC. The preliminary investigation of plau- is meant to replicate CPU visible buffers to another area of the
sible contention points will help us better understanding the same system DRAM that is only visible to the GPU device. It
experimental results to identify future solutions [13]. The does that by exploiting high bandwidth DMA transfers that can
delays due to the private memory of the iGPU is not considered reach up to 12 GB/s out of the 14 GB/s theoretical bandwidth
in this study, focusing only on the memory shared with the available by the system DRAM. Therefore, the GPU alone
CPU complex. Private memory contention may be a significant is able to saturate the available system DRAM bandwidth.
issue for discrete GPUs that feature a significant number of Specifically regarding the system DRAM, the K1 features 2GB
execution units (i.e., Streaming Multiprocessors in NVIDIA of LPDDR3 64bit SDRAM working at (maximum) 933MHz.
GPUs). Today’s iGPUs, however, are much smaller than their
discrete counterparts. As a consequence, parallel kernels are The first contention point is represented by the LLC for
typically executed one at a time, without generating contention the CPU complex. This cache is a 16-way associative cache
at private memory level. with a line size of 64B. Contention may happen when more
than one core fills the L2 cache evicting pre-existing cache
Another important abstract component at iGPU side is the lines used by other cores. Another contention point for LLC
Copy Engine (CE), i.e., a DMA used to bring data from CPU is indicated as point 3 in Figure 1(a), and it is due to coherency
to GPU memory space. In discrete devices, this basically trans- mechanisms between CPU and iGPU when they share the same
lates into copying memory from system DRAM through PCIe address space. In the GK20a iGPU, such coherence is taken
towards the on-board RAM of the graphics adapter (Video care in a unclear and undisclosed software mechanisms by the
RAM, VRAM). In case of embedded platforms with shared (NVIDIA proprietary) GPU driver. Hardware cache coherence
system DRAM, using the CE basically means duplicating mechanisms take place only at CPU complex level.
the same buffer twice on the same memory device. Both
CUDA and OpenCL programming models specify alternatives The remaining contention points (2 and 4 in Figure 1(a))
to the CE approach to avoid explicit memory transfers and are represented by memory bus and EMC (Embedded Memory
unnecessary buffer replications, such as CUDA UVM (Unified Controller) for accessing the underlying DRAM banks. Such
Virtual Memory [14]) and OpenCL 2.0 SVM (Shared Virtual contention is of uttermost importance as it is caused by parallel
Memory [15]). However, these approaches introduce CPU- memory accesses of single CPU cores and the iGPU. We refer
iGPU memory coherency problems when accessing the same to [18] for finer grained discussions regarding the effect of
shared memory buffer, so that avoiding copy engines does bank parallelism in DRAM devices.
not necessarily lead to performance improvements1 For this
reason, we will characterize the contention originated in both B. NVIDIA Tegra X1
CE- and non-CE-based models.
The NVIDIA Tegra X1 [19] is a hybrid System on Module
A. NVIDIA Tegra K1 (SoM) featured in the newest NVIDIA Jetson Development
board. It is the first mobile processor to feature a chip powerful
The NVIDIA Tegra K1 [16] is an hybrid SoC featured in
enough to sustain the visual computing load for autonomous
the NVIDIA Jetson Development board. It is the first mobile
and assisted driving applications, still presenting a contained
1 Visible as results of these experiments https://github.com/Sarahild/ power consumption (1 TFlops single precision peak perfor-
CudaMemoryExperiments/tree/master/MemCpyExperiments mance drawing from 6 to 15 W). Also for this platform, the
(a) (b)
Fig. 1: A simplified overview of the Tegra K1 (a) and X1 (b) SoCs, with notable memory access contention points numbered
from 1 to 4: (1) contention on L2 cache shared by the 4 cores; (2) contention on bus to central memory from different clients;
(3) coherency protocol on LLC; (4) access arbitration on main memory controller.
most relevant components and notable contention points are L2 cache is shared between two logical cores of the same
shown in Figure 1(b). physical computing unit. According to Intel HyperThreading’s
proprietary design [20], a number of logical cores that is
The X1 CPU and GPU complex consists of a quad-core
twice the number of the physical cores is made available to
1.9GHz ARM Cortex-A57 CPU (48kb I-cache + 32kb D-cache
the operating system. Hence, 8 threads can be scheduled to
L1 per core, 2MB L2 cache common to all cores); ARM
run concurrently. This is allowed by exploiting aggressive out
A57 belongs to the ARMv8-A family of RISC processors
of order execution policies and other optimization techniques
and features a 64bit architecture. Even if not visible in the
aimed at improving the instruction level parallelism of a single
Figure, the CPU complex features a big.LITTLE architecture,
physical core. However, HyperThreading is also known to
with also four ARM Cortex A53 little cores for power saving
increase the competition for shared memory resources [21].
purposes. As for the K1, we will not analyze the performance
Figure 2 shows the interconnections between the iGPU, each
of this board under power saving regimes. A single CPU
CPU core, and the interface to the Memory Controller (MC),
core can utilize the maximum bandwidth available for the
Display Controller (DC) and PCI express bus (PCIe). In our
whole CPU complex, which amounts to almost 4.5 GB/s for
experimental setup, the i7 interfaces with 16GB of 64bit DDR4
sequential reading operations. The iGPU is a Maxwell second
DRAM clocked at 2133 MHz, allowing the CPU complex
generation “GM20b” with 256 CUDA cores grouped in two
to reach a theoretical bandwidth of 34 GB/s, resulting in
Streaming Multi-processors (SMs). The L2 is twice the size of
a measured bandwidth of 29 GB/s using Intel MLC2 for
its Kepler based predecessor. The EE and CE can access central
sequential reads. Both the CPU complex and the iGPU reach
memory with a maximum bandwidth close to 20 GB/s. As
the memory controller through the SoC ring interconnection
with the K1, also this high performance iGPU can saturate the
fabric depicted in Figure 2. The iGPU in this SoC is an
whole DRAM bandwidth. The system DRAM consists of 4GB
Intel HD 530 belonging to the 9Gen Intel graphic architecture
of LPDDR4 64bit SDRAM working at (maximum) 1.6GHz,
[22]. Differently from the NVIDIA architectures, the compute
reaching a peak ideal bandwidth of 25.6 GB/s. With relation
pipeline is composed of an execution engine composed of a
to the contention points, there are no substantial differences
slice, divided into three sub-slices. Each of these partitions
with the K1.
relies on 8 Execution Units (EU from 0 to 7 as shown in
Figure 2). Another difference w.r.t. the previous solutions is
C. Intel i7-6700
represented by the cache hierarchy, where the LLC is shared
The Intel i7-6700 processor presents noticeable differences between the CPU and the GPU, with related coherency HW
between the two boards described in the previous paragraphs. mechanisms. A cache miss in the GPU L3 will imply using the
This SoC features a quad-core CPU complex with Hyper- SoC ring interconnect to access CPU L3, and, if needed, the
Threading (HT) technology built above the well known x86 external memory controller. Moreover, as detailed in [22], L1
64 bit CISC architecture. Technical points of interests for this and L2 cache are read-only caches only used for the graphics
platform are depicted in Figure 2. pipeline, not representing a significant contention point. The
HD 530 also has an abstraction of a Copy Engine used to
This specific processor belongs to the Skylake (6th genera-
replicate data from a CPU-only visible address space to a
tion) of Intel CPUs. Such processors are common for desktop
space accessible by the iGPU. However, thanks to the CPU-
or laptop configurations, with a SoC power consumption
GPU shared cache level, the best practices of using the iGPU
around 65W, higher than in the previously described boards.
in such designs is to exploit the unified memory architecture
Another peculiar difference refers to the cache hierarchy,
(SVM model in OpenCL 2.0).
where, unlike the ARM based design, L2 cache is not shared
among physical cores. This is an architectural choice that Intel 2 Intel Memory Latency Checker v. 3.1 available at
adopts since the Nehalem generation (released in late 2008). https://software.intel.com/en-us/articles/intelr-memory-latency-checker
Fig. 2: A simplified overview of the Intel i7-6700 Skylake with notable contention points: (1) represents contention on shared
L3 cache among all CPU cores; (2) contention on the shared CPU-iGPU LLC; (3) parallel access to system DRAM by all the
presented actors. (1) and (2) also include coherency overhead for CPU and iGPU traffic.
IV. E XPERIMENTAL SETTING

A3: the observed core reads sequentially, while the other cores
We are interested in measuring the effect of memory iteratively read 64B with a random stride within a 128MB
interference as indicated by the previously detected contention array (random interference). The array size has been chosen
points, both at CPU and iGPU side. Latencies variations of to statistically prevent fetching already cached data.
single memory accesses are highlighted. Idle latencies of single
A4: the observed core performs random reads, while the
CPU cores are first measured, before adding the interference
interfering cores performs random interference.
from the other cores, and then also from the GPU (CE,
EE and unified memory models). On the iGPU side, we Test Case B: iGPU interference to CPU.
measure the variation of execution times of tasks using the B1: the observed core reads sequentially, while the GPU
CE, EE and unified memory models. Once again, the case accesses memory according to different paradigms:
with no interference is first measured, before then adding the
interference from the different cores within the CPU complex. - launching a copy kernel3 between GPU buffers;
Another key parameter to vary was the memory access pattern - launching a copy kernel involving unified memory located
for the CPU complex. For both the measured core and the buffers (CUDA UVM for X1 and K1, OpenCL SVM for the
interfering cores, we considered either sequential memory i7);
accesses, exploiting the hardware pre-fetching abilities of the
CPU, or random accesses. On the GPU side, only sequential - copying a buffer by means of the copy engine, using pinned
accesses were profiled, since (i) CE memory transfers are memory; and
inherently sequential, and (ii) typical memory access patterns - zeroing a device buffer.
on the EE side are also sequential to allow the lock-step
mechanism to work at the maximum efficiency. Different B2: the observed core reads randomly, while the GPU activity
measuring tools have been used in our experiments. The is the same as B1.
Intel MLC (Memory Latency Checker v3.1) was used for Test Case C (CPU interference to iGPU).
the i7-6700 to calculate the available theoretical and practical
bandwidths. For all platforms, we used LMBench [23] and a C1: the GPU accesses memory according to the different
custom-made program to measure latencies with (i) sequential, paradigms detailed in test case B, while the host cores perform
(ii) variable, and (iii) random stride reads, with all possible the interfering patterns described in A1.
interfering memory access patterns, both in read and write. C2: same as above, but host cores perform the interfering
Summarizing, the following tests have been performed: patterns described in A3.
Test Case A: intra CPU complex interference. V. E XPERIMENTAL RESULTS
A1: the observed core reads sequentially within a variable In this section, we present the results of the test cases
sized working set (henceforth sequential read), while the identified above for each considered platform.
other cores are interfering sequentially (henceforth, sequential
interference). Each iteration performs a memcpy of 100MB, A. Latencies on Tegra K1
so to involve every element within the CPU complex memory
hierarchy. Test Case A measures the impact on latencies due to shared
L2 and shared memory bus to system DRAM (points 1, 2 and 4
A2: the observed core reads with a random stride within a
variable sized working set (henceforth random reads), while 3 A copy kernel is a GPU program that performs an element-wise data copy
the interfering cores performs sequential interference. between two buffers.
in Figure 1(a)), without accounting for GPU activity. Working core. The cache coherency traffic due to UVM is not noticeable
Set Size (WSS) goes from 1kB to 25MB. in this setting. This is because the CPU accesses different
memory region than those accessed at iGPU side, without
Figure 3 (a) and (b) show the average latency for se- triggering coherency traffic. Implementing a test case that
quentially accessing one word (32 bit) varying the WSS, for highlights the effect of coherency traffic between CPU and
subtests A1 and A2, i.e., with sequential interference. In all test iGPU would require using recent versions of the NVIDIA
cases, memory latencies with a random access pattern are only CUDA proprietary profiler (nvprof)4 , which allows a more
slightly higher than with a sequential access, showing limited accurate analysis of page faults handling with UVM. For space
burst capabilities. A significant performance degradation is constraints, we postpone UVM related tests to a future work.
noticeable for WSS larger than L1 size in all interfered cases
(Interf 1 to 3, in both graphs). The maximum gap between Results of Test Case C are shown in Figure 3 (g) and (h),
the interfered and non-interfered cases can be noticed for measuring the interference from CPU cores to iGPU tasks. As
WSS slightly comparable to L2 size. This is explained by can be noticed comparing the histograms, sequentially interfer-
the cache evictions performed by interfering cores, reducing ing tasks cause more visible delays to iGPU activities than with
the useful cache blocks available for the measured core. After random interfering patterns. Performance deterioration ranges
the L2 boundary, the measured latency converges to a delay from a minimum of 7% (CUDA UVM with one interfering
proportional to the number of interfering cores, both for core) to 35% (CUDA memset with 2 or 3 interfering sequential
sequential and random reads. With three interfering cores, a cores) in the sequential case, and from 3 to 12% in the random
performance loss of about 72% and 84% is measured with case. This is consistent with the observation made about the
respect to the non-interfered setting, for sequential and random re-ordering features of the EMC engine, prioritizing sequential
reads respectively. accesses over random strides.
A similar behavior is obtained for random interfering Only a limited increase in the execution times of GPU tasks
patterns, as shown in Figure 3 (d) and (e) for subtests A3 is measured when increasing the number of interfering CPU
and A4. In this case, delays are more weakly proportional cores, showing that a single CPU core is able to almost fully
to the number of interfering cores. Both for sequential and utilize the available memory bandwidth. A larger increase with
random reads, the latency introduced by interfering cores is the number of cores is noticeable in the memset and memcpy
smaller than in the previous cases: with three interfering cores, cases, because these commands exploit the memory bus in
a performance loss of about 60% and 75% is measured with both read and write directions, allowing a higher bandwidth
respect to the non-interfered setting, for sequential and random utilization.
reads respectively.
Summarizing the results of Test Case A, the interfering B. Latencies on Tegra X1
sources are perfectly consistent with the contention points
identified in Figure 1(a). The bandwidth available to each The second NVIDIA-based platform share the same archi-
core of the CPU complex when accessing system DRAM is tectural paradigm of the K1 platform. Thus, the same memory
fairly distributed among cores in the sequentially interfered contention points highlighted in Test Cases A, B and C are
scenario, while a less fair distribution happens in the randomly involved. Results are shown in in Figure 4.
interfered one. The largest delay is measured for random reads The first noticeable difference w.r.t. Tegra K1 is related
and sequentially interfering tasks. This can be motivated by to the much better performance for sequential reads (up to
the re-ordering mechanism implemented at EMC level, which 15ns and 150ns for the X1 and K1, respectively). Random
tends to favor sequential accesses with respect to random ones. read delays are instead comparable for both platforms. This
Results for Test Case B are shown in Figure 3 (c) and (f), shows significantly improved burst transfer features in the X1
measuring the interference from the iGPU to the CPU complex architecture, allowing a tenfold improvement with respect to
(points 2,3 and 4 in Figure 1(a)). Only one core is observed, random access patterns (as we will detail also in Section V-D).
while the remaining ones are inactive. Memory access patterns The contention points identified in section III-B are con-
for the observed CPU core are sequential (subtest B1) and firmed in the experiments. Results for Test Case A shown in
random (subtest B2), while only sequential interference is insets (a) and (b) present a trend comparable to that shown for
considered from the iGPU. the K1, with large performance deterioration occurring with
In Figure 3 (c) and (f), we notice that differently from Test working set sizes between L1 and L2, and a linear degradation
Case A, performance degradation mostly takes place after the increasing the number of cores. Latency spikes are visible in
CPU L2 boundaries. This happens because separate caches subtests A1 and A3 (insets (a) and (d)) around L2 boundaries.
are adopted at CPU and iGPU level. The highest performance The origin of such spikes, that were not observed in the K1
degradation is observed with the memset operation, where the case, is not clear, and it will be investigated in future works.
iGPU causes the observed core to experience a 426% higher Results for Test Case B are shown in Figure 4 (c) and
delay. Smaller degradations are measured for the memcpy (f). The effect of unregulated iGPU activity is similar to the
(377%) and UVM/CUDA kernel operations (220%). This K1 case, with two notable differences: (i) the presence of the
shows that memset operations saturate to a larger extent the spikes in the sequential case, as in the previous experiment; and
memory bandwidth, while UVM and CUDA kernel operations (ii) a much larger performance gap between the two extreme
are not able to fully exploit the available memory bandwidth
from the iGPU side. Interestingly, UVM and CUDA kernel 4 For more info see https://devblogs.nvidia.com/parallelforall/
memory transfers impose the same interference to the observed beyond-gpu-memory-limits-unified-memory-pascal/
(a) A1 (b) A2 (c) B1
(d) A3 (e) A4 (f) B2
(g) C1 (h) C2
Fig. 3: Test results for Tegra K1: WSS [Byte] (log) vs. Latencies [ns]. Vertical lines in (a)-(f) correspond to L1 and L2 size.
situations, i.e., sequential non-interfered reads (15-18 ns) and C. Latencies on i7-6700
random reads with CUDA memset interference (463 ns).
Interpreting the results on the i7 is much more difficult
than the previous solutions: the x86 architecture is substantially
more complex than the RISC based architectures analyzed so
far. This implies having much more aggressive and counter
intuitive optimization strategies for prefetching, speculative
In Figure 4 (g) and (h), we see the effect of CPU activity
execution and other mechanisms (e.g., System Mode Manage-
over iGPU related tasks (Test Case C). The relative impact
ment [24] and non trivial cache mapping policies [25]) that
on iGPU execution times is larger than the one observed with
are not analyzed in this paper but that might affect the results.
the K1. In case of three sequentially interfering cores, CUDA
memset and memcpy execution times increase by 52% (it was Still, some meaningful aspects can be identified. Experi-
30% in the K1 case). This is mostly due to the enhanced mem- ments for Test Case A are depicted in Figure 5. The latencies
ory bandwidth of the CPU cluster in the X1 platform, together with no interference are significantly smaller than in the
with the more aggressive prefetching mechanisms in the A57’s, previous SoCs. For sequential reads, increasing the working
as will be shown in section V-D. With random interference, set size beyond L3 only adds a few ns with respect to the
the deterioration is somewhat smaller (12%), although not as latency measured within L1 (from 1 to 4/5 ns). For random
small as it was in the K1 case (24%), confirming the largest reads, a significant performance decrease (about 11x) can
memory bandwidth utilized by the CPU cluster in the X1. be noticed for WSS beyond L3. Increasing the number of
(a) A1 (b) A2 (c) B1
(d) A3 (e) A4 (f) B2
(g) C1 (h) C2
Fig. 4: Test results for Tegra X1: WSS [Byte] (log) vs. Latencies [ns]. Vertical lines in (a)-(f) correspond to L1 and L2 size.
interfering cores, contention on L1 and L2 is almost absent are shown in Figure 5 (c) and (f). As already pointed out
in case of random interference (A3 and A4), while it is in section III-C, a larger impact on latencies is expected
more noticeable in case of sequential interference (A1 and due to the cache shared between the CPU complex and the
A2). When reaching L3 boundaries, a dramatic performance iGPU. Somewhat unexpectedly, a performance deterioration
deterioration takes place, with latencies increasing up to 17x is observed already with WSS smaller than L2. However, as
for sequential reads, and up to 6x for random reads, both with pointed out in III-C, L2 is not shared with the GPU. The
sequentially interfering cores. A smaller impact on relative observed deterioration is instead due to the HW coherency
performance deterioration is noticed when the interfering cores mechanisms between L2 and L3, where this latter cache level
performs random reads. is shared with the interfering GPU. Such effect was not
observed on Tegra based platforms, as NVIDIA solutions rely
The uneven spacing between the lines in the graph is due to
on different SW controlled cache coherency mechanisms, and
the fact that half of the cores are not physical cores, but logical
there are no shared caches between the CPU complex and the
cores enabled by Intel HyperThreading technology. These
iGPU.
logical cores share resources with a corresponding physical
core, competing for L1 and L2 access. Therefore, logical cores
Test Case B shows that the interference at L2 boundaries
do not add significant contention in system DRAM w.r.t. their
is already so severe that it almost matches the interference
physical counterparts [20], [21].
experienced when accessing system DRAM, especially for
Experiments for Test Case B related to iGPU interference sequential reads (B1). Differently from the previous boards,
(a) A1 (b) A2 (c) B1
(d) A3 (e) A4 (f) B2
(g) C1 (h) C2
Fig. 5: Test results for I7-6700. Black vertical lines in (a)-(f) correspond to L1, L2 and L3 CPU cache sizes
there is no big difference in performance deterioration when cores.

changing the nature of the interfering programs running on
the iGPU, with the OpenCL equivalent of the memset (clEn-
D. Prefetching mechanisms
queueFillBuffer) slightly dominating the interference caused
by other operations. Even in this case, it is difficult to isolate The results from the previous test cases motivated us to
the effect of cache coherency caused by SVM for the same analyze the prefetching mechanisms exploited in the three
reasons discussed when treating CUDA UVM in Tegra-based different SoCs. The substantially different behaviors expressed
architectures. when the CPU was interfering or being interfered highlighted
that prefetching plays a key role in performance degradation in
Figure 5 (g) and (h) show the behavior of the HD530 memory contention scenarios. We stress that in all the tested
iGPU in case of unregulated CPU complex memory operations environments, we left the HW prefetching mechanisms to the
(Test Case C). In case of sequentially interfering cores, relative default SoC value5 . We therefore tried to infer the prefetching
execution times of iGPU tasks may increase up to more than mechanisms of the analyzed platforms by having a single core
200% for all GPU programs. A much smaller interference reading data from system DRAM at an increasing stride to
is instead observed in case of randomly interfering cores. identify read latency variations that depend on the number of
Similarly to NVIDIA architectures, the most affected iGPU pre-loaded LLC lines. Stride values for this test do not exceed
activities are OpenCL memset and memcpy. As observed in
Test Case A, the height of adjacent bars are pairwise similar, 5 According to ARM a15 and a57 reference manual, L2 cache prefetching
due to the fact that half of the interfering cores are logical can pre-load 0, 2, 4, 6 or 8 cache line after a LLC miss.
the page size value (4096B) as LLC prefetchers work between VI. C ONCLUSION AND F UTURE W ORK
memory page boundaries.
In this paper, we presented a fairly complete overview on
The experimental results are shown in Figure 6 (a). Note memory interference effects of Multi-Core CPUs and iGPU.
that the leftmost point in the graph corresponds to the latency We identified the main contention points in three different
measured in the previous experiments with a non-interfered commercially available SoCs, taking accurate measurements
sequential read with a large WSS, while the rightmost points of CPU read latencies and iGPU task execution times under
converge to the latency measured for random accesses. For various stress situations. Our contribution makes it evident
the Tegra K1, varying the strides leads to no latency vari- that the detrimental effect on latencies escalates even more in
ation, showing very limited pre-fetching features. This is in presence of iGPU activity. We also showed how CPU activities
accordance with the small performance gap observed between may increase iGPU task execution times. We discovered how
sequential and random reads in the experiments described in hardware prefetching, related memory access patterns and
Subsection V-A. different iGPU activities play an important role in performance
degradation. Also, we showed how unified CPU-iGPU cache
A specular situation is observed with the X1, where a levels have a larger impact in the measured delays than with
latency degradation occurs increasing the strides, especially separated caches. As a future work, We plan to implement a
beyond 512B. This is consistent with an L2 prefetch size of 8 memory server that acts as a memory access arbiter to avoid
lines (8*64B = 512B, with 64B being the cache line size). An the dramatic latencies reported in this paper. Such a server,
even higher degradation is observed for larger strides, until likely to be implemented at hypervisor level, takes inspiration
reaching the size of DRAM row buffers (2 kB). No further from recently introduced mechanisms, such as MEMGUARD
increase in the latency is observed for strides beyond the row and PREM, but revisited in order to take into account hetero-
buffer size. geneous architectures featuring high-performance iGPUs.
The Intel platform is again the most difficult to interpret, ACKNOWLEDGMENT

due to the presence of multiple prefetching mechanisms at
different memory hierarchy levels, which are only partially This work is part of the Hercules and OPEN-NEXT
disclosed in the related documentation [26]. These aggressive projects, which are respectively funded by the EU Commis-
prefetching mechanisms allow lower latencies at CPU side in sion under the HORIZON 2020 framework programme (GA-
case of sequential reads (around 5ns). Increasing the strides, a 688860) and ERDF-OP CUP E32I16000040007.
10x latency increase is observed with a stride of 256B, after
that no further increase is observed. R EFERENCES
[1] N. Rajovic, A. Rico, J. Vipond, I. Gelado, N. Puzovic, and A. Ramirez,
“Experiences with mobile processors for energy efficient hpc,” in
E. Combined interference Proceedings of the Conference on Design, Automation and Test in
Europe. EDA Consortium, 2013, pp. 464–468.
In the previous experiments, we separately analyzed the [2] C. Nvidia, “Programming guide version 8.0,” Nvidia Corporation,
2016. [Online]. Available: https://docs.nvidia.com/cuda/pdf/CUDA C
interfering impact of CPU cores and iGPU. We conducted Programming Guide.pdf
further experiments to analyze whether the above described [3] O. W. G. Khronos, “The opencl specification version 2.0,” Khronos
interfering effects are additive when a CPU task may be simul- Group, 2015. [Online]. Available: https://www.khronos.org/registry/
taneously interfered by CPU cores and iGPU. We measured the OpenCL/specs/opencl-2.0.pdf
delay experienced by a CPU task that is sequentially interfered [4] L. Chai, Q. Gao, and D. K. Panda, “Understanding the impact of multi-
by two cores, while another core iteratively sends memset core architecture in cluster computing: A case study with intel dual-core
commands to the iGPU (see Figure 6 (b)). system,” in Cluster Computing and the Grid, 2007. CCGRID 2007.
Seventh IEEE International Symposium on. IEEE, 2007, pp. 471–478.
We hereafter outline only the results for the X1 platform. [5] H. Kim, D. De Niz, B. Andersson, M. Klein, O. Mutlu, and R. Ra-
jkumar, “Bounding memory interference delay in cots-based multi-core
The observed latencies were up to 115 ns in case of sequential systems,” in Real-Time and Embedded Technology and Applications
reads, and up to 600 ns in case of random reads. These Symposium (RTAS), 2014 IEEE 20th. IEEE, 2014, pp. 145–154.
numbers are in line with the results expected for an additive [6] R. Pellizzoni, A. Schranzhofer, J.-J. Chen, M. Caccamo, and L. Thiele,
contribution of (i) the A57 sequential read idle latency of 15- “Worst case delay analysis for memory interference in multicore sys-
18 ns, plus (ii) the delay of 40 ns due to two sequentially tems,” in Proceedings of the Conference on Design, Automation and
interfering cores (see Figure 4 (a)), plus (iii) the delay of 50- Test in Europe. European Design and Automation Association, 2010,
pp. 741–746.
60 ns due to the iGPU memset (see Figure 4 (c)). A similar
[7] D. Dasari, B. Andersson, V. Nelis, S. M. Petters, A. Easwaran, and
calculation leads to 600 ns in case the observed core performs J. Lee, “Response time analysis of cots-based multicores considering
random reads. Another interesting effect is observed in both the the contention on the shared memory bus,” in Trust, Security and
random and sequential case, where the latency delay caused by Privacy in Computing and Communications (TrustCom), 2011 IEEE
the iGPU memset is experienced already within L2 boundaries. 10th International Conference on. IEEE, 2011, pp. 1068–1075.
This was not observed in Test Case B because CPU L2 is not [8] G. Yao, R. Pellizzoni, S. Bak, H. Yun, and M. Caccamo, “Global real-
shared with the iGPU. By adding two CPU interfering cores, time memory-centric scheduling for multicore systems,” 2015.
evictions at L2 level caused by the CPU interference cause the [9] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha, “Memguard:
Memory bandwidth reservation system for efficient performance isola-
observed core to resort to main memory sooner than the L2 tion in multi-core platforms,” in Real-Time and Embedded Technology
cache size, while DRAM bandwidth is already saturated by and Applications Symposium (RTAS), 2013 IEEE 19th. IEEE, 2013,
iGPU activity. pp. 55–64.
(a) (b)
Fig. 6: (a) HW prefetch impact for the tested platforms. (b) Combined interference in X1
[10] H. Yun, S. Gondi, and S. Biswas, “Protecting memory-performance analysis.” in USENIX annual technical conference. San Diego, CA,
critical sections in soft real-time applications,” arXiv preprint USA, 1996, pp. 279–294.
arXiv:1502.02287, 2015. [24] R. A. Starke and R. S. de Oliveira, “Impact of the x86 system man-
[11] R. Pellizzoni, E. Betti, S. Bak, G. Yao, J. Criswell, M. Caccamo, and agement mode in real-time systems,” in Computing System Engineering
R. Kegley, “A predictable execution model for cots-based embedded (SBESC), 2011 Brazilian Symposium on. IEEE, 2011, pp. 151–157.
systems,” in 2011 17th IEEE Real-Time and Embedded Technology and [25] C. Maurice, N. Le Scouarnec, C. Neumann, O. Heen, and A. Francillon,
Applications Symposium. IEEE, 2011, pp. 269–279. “Reverse engineering intel last-level cache complex addressing using
[12] M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver, “A qos-aware memory performance counters,” in International Workshop on Recent Advances
controller for dynamically balancing gpu and cpu bandwidth use in in Intrusion Detection. Springer, 2015, pp. 48–65.
an mpsoc,” in Proceedings of the 49th Annual Design Automation [26] Intel, “Intel 64 and ia-32 architectures. optimization ref-
Conference. ACM, 2012, pp. 850–855. erence manual,” Intel Corporation, 2016. [Online]. Avail-
[13] L. Sha, M. Caccamo, R. Mancuso, J.-E. Kim, M.-K. Yoon, R. Pellizzoni, able: http://www.intel.com/content/dam/www/public/us/en/documents/
H. Yun, R. Kegley, D. Perlman, G. Arundale et al., “Single core manuals/64-ia-32-architectures-optimization-manual.pdf
equivalent virtual machines for hard realtime computing on multicore
processors,” Tech. Rep., 2014.
[14] A. Rao, A. Srivastava, K. Yogesh, A. Douillet, G. Gerfin, M. Kaushik,
N. Shulga, V. Venkataraman, D. Fontaine, M. Hairgrove et al., “Uni-
fied memory systems and methods,” Jan. 20 2015, uS Patent App.
14/601,223.
[15] B. A. Hechtman and D. J. Sorin, “Evaluating cache coherent shared
virtual memory for heterogeneous multicore chips,” in Performance
Analysis of Systems and Software (ISPASS), 2013 IEEE International
Symposium on. IEEE, 2013, pp. 118–119.
[16] NVIDIA, “Nvidia tegra k1 white paper, a new era in mobile computing,”
NVIDIA Corporation, 2014. [Online]. Available: http://www.nvidia.
com/content/pdf/tegra white papers/tegra k1 whitepaper v1.0.pdf
[17] G. A. Elliott, B. C. Ward, and J. H. Anderson, “Gpusync: A framework
for real-time gpu management,” in Real-Time Systems Symposium
(RTSS), 2013 IEEE 34th. IEEE, 2013, pp. 33–44.
[18] S. Goossens, B. Akesson, K. Goossens, and K. Chandrasekar, Memory
Controllers for Mixed-Time-Criticality Systems. Springer, 2016.
[19] NVIDIA, “Nvidia tegra x1 white paper, nvidiaś new mobile superchip,”
NVIDIA Corporation, 2015. [Online]. Available: http://international.
download.nvidia.com/pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf
[20] D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty et al., “Hyper-
threading technology in the netburst
R microarchitecture,” 14th Hot
Chips, 2002.
[21] S. Saini, H. Jin, R. Hood, D. Barker, P. Mehrotra, and R. Biswas,
“The impact of hyper-threading on processor resource utilization in
production applications,” in High Performance Computing (HiPC), 2011
18th International Conference on. IEEE, 2011, pp. 1–10.
[22] Intel, “The compute architecture of intel processor
graphics gen9, v. 1.0,” Intel White Paper, 2015. [Online].
Available: https://software.intel.com/sites/default/files/managed/c5/9a/
The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.
pdf
[23] L. W. McVoy, C. Staelin et al., “lmbench: Portable tools for performance

Memory Interference Characterization Between CPU

Uploaded by

Copyright:

Available Formats

Memory Interference Characterization Between CPU

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Memory Interference Characterization Between CPU

Uploaded by

Copyright:

Available Formats

Memory Interference Characterization between CPU

cores and integrated GPUs in Mixed-Criticality

Roberto Cavicchioli, Nicola Capodieci and Marko Bertogna

IV. E XPERIMENTAL SETTING

(d) A3 (e) A4 (f) B2

(d) A3 (e) A4 (f) B2

(d) A3 (e) A4 (f) B2

there is no big difference in performance deterioration when cores.

The Intel platform is again the most difficult to interpret, ACKNOWLEDGMENT

You might also like