Pattern Based Cache Coherency Architectu
Pattern Based Cache Coherency Architectu
Pattern Based Cache Coherency Architectu
embedded manycores
J. Marandola, S. Louise, L. Cudennec
Abstract
Modern parallel programming frameworks like OpenMP often rely on shared memory concepts
to harness the processing power of parallel systems. But for embedded devices, memory coher-
ence protocols tend to account for a sizable portion of chip’s power consumption. This is why
any means to lower this impact is important.
Our idea for this issue is to use the fact that most of usual workloads display a regular
behavior with regards to their memory accesses to prefetch the relevant memory lines in locale
caches of execution cores on a manycore system.
Our contributions are, on one hand the specifications of a hardware IP for prefetching
memory access patterns, and on another hand, a hybrid protocol which extends the classic
MESI/baseline architecture to reduce the control and coherence related traffic by at least an
order of magnitude. Evaluations are done on several benchmark programs and show the poten-
tial of this approach.
Keywords: Chip Multi-Processor, Cache Coherence Protocol, Memory Access Patterns
1 Introduction
Today’s best way to improve processor’s performance without increasing too much power con-
sumption and heat dissipation issues would be to increase the number of processing cores on the
chip. But this means good parallelization of applications. There are only a few programming
models that can cope easily with manycores. Among them, one of the most popular is OpenMP.
Nonetheless, OpenMP relies on an execution model which states that memory is shared
in a consistent state (memory coherence) between cores. For embedded computing it is an
issue as memory coherence is a cause of timing hazards so real-time applications are difficult to
implement with such feature, and also a cause of increased power consumption.
Our approach is based on a hardware/software co-design to optimize the cache coherence
protocol. We worked with baseline architecture, above a Chip MultiProcessing (CMP) archi-
tecture to deploy a model of shared memory on the manycore. In terms of performance, the
1542 Selection and peer-review under responsibility of the Scientific Programme Committee of ICCS 2016
c The Authors. Published by Elsevier B.V.
doi:10.1016/j.procs.2016.05.481
Pattern Based Cache Coherency Architecture Marandola, Louise and Cudennec
main goals was to obtain a better speedup in applications such as video compression, and other
video and signal processing applications, as often met in embedded systems. It is based on the
fact that most of these applications display a very regular access to shared memory, so we can
use pre-compiled memory access pattern tables to achieve this speedup.
Our optimized architecture focused on these memory access patterns, locality and cache,
write policies and optimization cache coherency systems. The main contributions regarding our
Co-Design Cache Coherent Architecture (a.k.a CoCCA) are:
• An Integrated Part (IP) to manage memory access patterns; the pattern table optimizes
the number of messages sent during cache coherency transactions;
• A first example of pattern format specification and pattern table structure;
• A model of transaction messages and read/write operations;
The remainder of this paper is organized as follows: section 2 presents similar projects
in a brief state of the art and our motivations; section 3 explains our architecture and how
we optimized it to take the best advantage of regular memory access patterns that occurs in
applications, the principles, and the associated protocol; Section 4 shows our experimental
evaluations. Section 5 gives an overview of related works before the conclusion. Most of this
work was done as part of a PhD thesis research work at CEA, LIST.
1543
Pattern Based Cache Coherency Architecture Marandola, Louise and Cudennec
data coherence occurs when data are replicated on different cache memories of cores, due to
concurrent read and write operations. Versions of data may differ between cores and with
main memory. In order to maintain consistency, one popular approach resides in the use of a
four-state, directory-based cache coherence protocol. This protocol, called baseline protocol, is
a derivative protocol of the Lazy Release Consistency protocol[7], as found in some Distributed
Shared Memory systems.
For the sake of clarity, we illustrate the baseline protocol through a very simple example
involving one write request transaction. This transaction, as shown in Figure 1, triggers a
sequence of messages transmitted between different cores. 1) The requester sends a message
to the home node in charge of keeping track of the coherency information. 2) The home node
checks the vector of presence and sends an exclusive access request to all the cores owning a copy
of the data. 3) Then, all these cores invalidate their own copy and send an acknowledgment
back to the home node. 4) Finally, the home-node grants the write permission to the requester
and possibly transfers an up-to-date version of the data.
1544
Pattern Based Cache Coherency Architecture Marandola, Louise and Cudennec
Figure 2: Principle of the hybrid architecture with both baseline and CoCCA protocols.
trace the consistency state of the data so that, when a processor needs a piece of data, it first
checks with the HN for the coherency state. This task is handled by the coherence engine which
sends/receives all messages related to shared memory access. The most usual way for allocating
home nodes is a round robin way, by performing a modulo on a subset of the low-order bits of
the address of the accessed data element.
One question is the granularity for state and HN attribution. As memory accesses are not
distributed in homogeneous way for most of the programs, this leads to an uneven bandwidth
consumption where some of the cores can become hot-spots. Therefore, to reduce hot-spot issues
for cache management and to limit the number of messages of cache coherence protocol, we
used a different granularity for the Baseline protocol and our Pattern-based CoCCA protocol.
In our study, the Baseline protocol is fined-grained (64B or line-sized), whereas the ours is
coarse-grained at a 2kB-page level.
1545
Pattern Based Cache Coherency Architecture Marandola, Louise and Cudennec
several levels of memory hierarchy. The main different between CoCCA and others architectures
is that our Integrated Part (IP) stores the addresses in form of a table of patterns. The baseline
protocol is used as a fallback.
Another contribution is the specification of the CoCCA protocol with its model of trans-
action messages that provides support for managing regular memory access patterns. The
associated messages are called speculative messages. The CoCCA protocol is an hybrid proto-
col designed to interleave speculative messages and baseline messages through an IP that has
the following purposes: store patterns and control transaction messages. The optimization of
the CoCCA protocol is based on finding memory addresses of application matching a stored
pattern. The requester sends the speculative message to the CoCCA Home Node (CHN) if it
matches a stored pattern or otherwise, the requester sends the baseline message to the ordinary
Baseline Home Node (BHN). Each CPU has the following IPs of cache hierarchy: L1 caches,
a L2 cache (shared inclusive), a directory of cache coherence and the “CoCCA Pattern Table”
(see Fig. 2).
1 Of course the simplest trigger, and the one we implemented in our evaluation in this paper, is the use of
1546
Pattern Based Cache Coherency Architecture Marandola, Louise and Cudennec
R R
x
Request
y
z @xyz
HN Request @ HN
@x @y @z @x @y @z
Figure 3: Cocca Pattern Table lookup prin- Total: 9 messages exchanged Total: 7 messages exchanged
In case of a cache miss, the flow of message transaction defined by the MESI modified
protocol is sent by requested addresses (misses). When the pattern table and the speculative
protocol (CoCCA specific) is added and a pattern is triggered, the flow of message transaction
can be optimized in term of messages, because the pattern provides a means to use speculative
messages which are in fewer number than in the baseline protocol. As a conclusion, when a
pattern is discovered in our approach, the number of transaction messages is reduced by using
speculation, leading to a better memory access time, less power consumption and an optimized
cache coherency protocol.
In Figure 4, we compared two approaches of cache coherency protocols for a cache miss
case. We present two scenarios: the baseline only approach, where the node requested sends
the sequential addresses x, y, z, totalizing 9 messages; and the pattern approach where the node
send speculatively the pattern with xyz, totalizing only 7 messages, and a early (speculative)
prefetching of data in the cache. It is worth noting the interest of the CoCCA protocol grows
linearly with the number of fetch in the pattern.
1547
Pattern Based Cache Coherency Architecture Marandola, Louise and Cudennec
granularity). For write sequences, the principles are the same and can be extended the same
way. Another case, as seen in Figure 7a, show how a read request can be handled at the CoCCA
HN. Reciprocally, the baseline home node on Figure 7b can handle a request to the CoCCA
HN in case of miss.
4 Experimental evaluation
We looked at several parallel applications mainly from StreamIt benchmark suite [13] (Filter-
bank) which are fine-grained, SDF3 benchmark suite [12] (H.263 decoder, H.263 encoder, Rate
converter) which is really coarse grain, and a few in-house applications (e.g. a Laplacian filter)
from the ΣC programming tutorial which are somewhat in-between.
1548
Pattern Based Cache Coherency Architecture Marandola, Louise and Cudennec
In the following, we will discuss the case of the Laplacian transform from the introduction to
the Sigma-C programming language because it has all the characteristics of a typical embedded
parallel application: It displays a good level of parallelism (both data parallelism and pipeline
parallelism) and a heterogeneous task-set whilst it remains simple enough to explain (at least
in its simplest implementation).
The filtering is done in a unusual way here: access are always done in line (except for the
transpositions) and each way of the 2 parallel way works on one line on two. This choice of
implementation has a peculiar incidence on patterns, as we will see.
1549
Pattern Based Cache Coherency Architecture Marandola, Louise and Cudennec
1550
Pattern Based Cache Coherency Architecture Marandola, Louise and Cudennec
After this step, all you have to do is to discard patterns corresponding to the same cache line,
as they would have been already loaded, and check with the write patterns to avoid premature
reloading of a pattern whose data values is unchanged.
For the Laplacian program, the access on the line is done every two lines for LineFilter tasks
and (Transposed) ColumnFilter tasks, then, for an image of 256 × 256, we have 2 times (one
for read and one for writes) 128 short (4 cache-lines) patterns. Such case is a good candidate
for 2D stride definitions, but if required, we can use a trick to lower drastically the number
of 1D patterns: instead of defining patterns for lines (which are short an numerous), we can
define patterns for columns (i.e. the transposed patterns) and use the fact that cache-line
loading can factorize a handfull of accesses, and therefore lower the effective number of require
patterns. The trick can be easily generalized: as a set of pattern is identified, it is possible
to try and lower the number of patterns by choosing another equivalent (in memory accesses)
set of patterns with a lower cardinality. Therefore accessing to 128 lines of 8 conterminous
cache-lines is equivalent to accessing 8 conterminous columns of 128 cache-lines. One drawback
is that the access are not along a column, so the first line accessed may be plagued with cache
misses. This can be easily overcome by using another pattern for the first line. This way, we
can transform 1D complex pattern set with more than 256 elements to another 1D simple set
with only 18 elements. The results are shown in Table 2.
Most of the time a few 1D patterns is enough to account for the majority of the shared-
memory access we experimented, but a few 2D entries would help a lot in reducing the number
of required entries. The conclusion we can draw with streaming embedded applications show
that we can have a mix of 2D and 1D patterns for the CoCCA IP, with 8 to 16 entries of each
kind. It would easily embrace most of the usual cases, and for the cases which does not, we
would have been limited by the size of the cache anyway.
As a conclusion of this study, for usual streaming application, our prefetch pattern-based
augmented coherence protocol a few dozen entries (e.g. 32) would account for the vast majority
of the cases, even with simple 1D pattern definitions. When using 2D memory-access patterns,
the number can even be reduced further a bit.
We have evaluated that on Xeon Nehalem cores the acceleration of kernels given by the
protocol by comparing the execution without prefetch and with a prefetch. The induced prefetch
done by our protocol can enhance the performance in significant ways: 70% for the Gaussian
blur which is heavily memory bound (the Gaussian task run in a bit less than 4490.103 cycles on
this core, with preloaded caches i.e no cache miss, for 965328 shared memory accesses including
37128 write accesses and 928200 read accesses; without prefetch we have 17283 read misses with
memory and about 13000 cache sharing requests). Even for the Laplacian, which is much less
memory bound, we have around 30% speedup (limited by the worst case tasks LineFilter and
ColumnFilter) thank to the prefetch of our protocol. The other advantage is that the baseline
protocol (acting as fallback) is not used at all in all these programs. We expect that it would
1551
Pattern Based Cache Coherency Architecture Marandola, Louise and Cudennec
be translated as a way to lower the power consumption of the CMP with our IP.
One important point is that we can limit the number of entries in the Pattern Tables,
therefore, it takes a limited amount of transistors to implement it and as it integrates with
the standard MESI protocol, the overhead in both dice size and dedicated consumption should
remain low.
6 Conclusion
In this paper, we proposed to manage memory access patterns at both software and hardware
levels. A regular consistency protocol has been modified to handle speculative requests and a
new hardware component has been designed to store and retrieve patterns.
This architecture offers some significant advantages for embedded manycores, as we have
seen on the evaluation: a low number of pattern can account for the majority or even often all
the shared memory accesses, thus reducing the number of messages on the NoC, and prefetching
the caches of each core with data that are relevant to its computation, and therefore increasing
performance in a significant way, with more than 20% for most of the programs to more than
70% speedup for memory bound programs. We can expect intuitively at the same time a
reduction of the weight of memory coherence on power utilization of embedded CMPs for such
1552
Pattern Based Cache Coherency Architecture Marandola, Louise and Cudennec
typical embedded applications. Our next step is to further evaluate the model, and if possible
find a partnership with a hardware oriented laboratory to find out what we can expect for real
on the power consumption side.
References
[1] Saman Amarasinghe, Michael I. Gordon, Michal Karczmarek, Jasper Lin, David Maze, Rodric M.
Rabbah, and William Thies. Language and compiler design for streaming applications. Interna-
tional Journal of Parallel Programming, 33(2/3):261–278, Jun 2005.
[2] Pascal Aubry, Pierre-Edouard Beaucamps, Frédéric Blanc, Bruno Bodin, Sergiu Carpov, Loı̈c
Cudennec, Vincent David, Philippe Dore, Paul Dubrulle, Benoit Dupont de Dinechin, Francois
Galea, Thierry Goubier, Michel Harrand, Samuel Jones, Jean-Denis Lesage, Stephane Louise,
Nicolas Morey Chaisemartin, Thanh Hai Nguyen, Xavier Raynaud, and Renaud Sirdey. Extended
cyclostatic dataflow program compilation and execution for an integrated manycore processor. In
Vassil N. Alexandrov, Michael Lees, Valeria V. Krzhizhanovskaya, Jack Dongarra, and Peter M. A.
Sloot, editors, ICCS, volume 18 of Procedia Computer Science, pages 1624–1633. Elsevier, 2013.
[3] Jeffery A. Brown, Rakesh Kumar, and Dean Tullsen. Proximity-aware directory-based coherence
for multi-core processor architectures. In Proceedings of the Nineteenth Annual ACM Symposium
on Parallel Algorithms and Architectures, SPAA ’07, pages 126–134. ACM, 2007.
[4] E. Debes, Y-K. Chen, M. J.Holliman, and M. M.Yeung. Using data access patterns. international
business machines corporation, patent us 7,395,407 b2. Technical report, 2008.
[5] Zhuo Huang, Xudong Shi, Ye Xia, and Jih kwon Peir. Apparatus and method for performing data
access in accordance with memory access patterns. Technical report, 2006.
[6] H.H.J. Hum and J.R. Goodman. Forward state for use in cache coherency in a multiprocessor
system, July 26 2005. US Patent 6,922,756.
[7] Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. ACM Trans.
Comput. Syst., 7(4):321–359, November 1989.
[8] Jussara Marandola, Stephane Louise, Loic Cudennec, Jean-Thomas Acquaviva, and David A.
Bader. Enhancing cache coherent architectures with access patterns for embedded manycore
systems. In System on Chip (SoC), 2012 International Symposium on, pages 1 –7, oct. 2012.
[9] Xudong Shi, Zhen Yang, Jih kwon Peir, Lu Peng, Yen kuang Chen, Victor Lee, and Bob Liang.
Coterminous locality and coterminous group data prefetching on chip-multiprocessors. In In: Proc.
of the 20th Intl Parallel and Distributed Processing Symp. Toronto: X-CD Technologies Inc, 2006.
[10] Stephen Somogyi, Thomas F. Wenisch, Anastasia Ailamaki, and Babak Falsafi. Spatio-temporal
memory streaming. SIGARCH Comput. Archit. News, 37(3):69–80, 2009.
[11] Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, and Andreas
Moshovos. Spatial memory streaming. SIGARCH Comput. Archit. News, 34(2):252–263, 2006.
[12] Sander Stuijk, Marc Geilen, and Twan Basten. Sdf3 : Sdf for free. In Proceedings of the Sixth
International Conference on Application of Concurrency to System Design, pages 276–278, 2006.
[13] William Thies and Saman Amarasinghe. An empirical characterization of stream programs and its
implications for language and compiler design. In Proceedings of the 19th International Conference
on Parallel Architectures and Compilation Techniques, pages 365–376, 2010.
[14] Thomas F. Wenisch, Stephen Somogyi, Nikolaos Hardavellas, Jangwoo Kim, Anastassia Ailamaki,
and Babak Falsafi. Temporal streaming of shared memory. SIGARCH Comput. Archit. News,
33(2):222–233, 2005.
1553