Scheduling Threads For Constructive Cache Sharing On Cmps
Scheduling Threads For Constructive Cache Sharing On Cmps
Scheduling Threads For Constructive Cache Sharing On Cmps
ABSTRACT
1.
In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many
multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads
share a largely overlapping working set. In this paper, we
compare the performance of two state-of-the-art schedulers
proposed for fine-grained multithreaded programs: Parallel
Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is
a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.31.6X performance
improvement relative to WS for several fine-grain parallel
benchmarks on projected future CMP configurations; we
also report several issues that may limit the advantage of
PDF in certain applications. These results also indicate that
PDF more effectively utilizes off-chip bandwidth, making it
possible to trade-off on-chip cache for a larger number of
cores. Moreover, we find that task granularity plays a key
role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based
on a new working set profiling algorithm that is an order
of magnitude faster than previous approaches. This is the
first paper demonstrating the effectiveness of PDF on real
benchmarks, providing a direct comparison between PDF
and WS, revealing the limiting factors for PDF in practice,
and presenting an approach for overcoming these factors.
Chip multiprocessors (CMPs) are emerging as the design of choice for harnessing performance from future multibillion transistor chips. All the major chip manufacturers
have made the paradigm shift to focus on producing chips
with multiple cores. Unlike wide-issue single-core designs
that have run into power, thermal flux, and instruction-level
parallelism (ILP) limitations, CMPs allow for both performance and power scalability across future-generation semiconductor fabrication processes. Projections indicate that
by 2015, there will be 64 to 128 cores integrated into a single chip [11].
To effectively exploit available parallelism, CMPs must
address contention for shared resources [18, 42]. In particular, CMPs share two precious hardware resources among the
cores: (i) on-chip memory, and (ii) pin bandwidth. CMPs
are limited by a fixed chip area budget that is divided mainly
between processing cores and memory (i.e. cache). Consequently, techniques which improve the efficiency of cache resources enable processor architects to devote less chip area to
caches and more to processing cores, which, in turn, enables
the CMP to exploit more parallelism. Similarly, in CMPs,
re-use of data cached on-chip is especially important in reducing contention for the limited memory bandwidth which
must be shared among a number of processing cores with
a continued increase in the processor/memory performance
gap, the resulting off-chip accesses incur higher penalties,
making performance increasingly sensitive to effective onchip caching.
Many multithreaded programs provide opportunities for
constructive cache sharing, where concurrently scheduled
threads share a largely overlapping working set, as opposed
to destructive competition for cache spaces among threads.
In this paper, we evaluate the impact of thread scheduling algorithms on on-chip cache sharing for multithreaded
programs. Many parallel programming languages and runtime systems use greedy thread scheduling algorithms to
maximize processor utilization and throughput. We compare the performance of two state-of-the-art greedy schedulers: Parallel Depth First (PDF) [5, 6], a recently proposed
scheduler designed for constructive cache sharing, and Work
Stealing (WS), a popular scheduler that takes a more traditional approach. Analytical bounds [5, 10, 8] imply that
PDF should outperform WS on CMPs with shared caches
however, there has been no direct comparison on real benchmarks. We study a variety of benchmark programs on projected future CMPs through simulations. Our experimental
results show that:
General Terms
Algorithms, Measurement, Performance
Keywords
Chip Multiprocessors, Scheduling Algorithms, Constructive
Cache Sharing, Parallel Depth First, Work Stealing, Thread
Granularity, Working Set Profiling
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SPAA07, June 911, 2007, San Diego, California, USA.
Copyright 2007 ACM 978-1-59593-667-7/07/0006 ...$5.00.
INTRODUCTION
For several application classes, PDF enables significant constructive cache sharing among threads, thus
improving performance and reducing off-chip traffic
compared to WS. In particular, PDF provides a performance speedup of 1.31.6X and an off-chip traffic
reduction of 1341% relative to WS for parallel divideand-conquer programs and bandwidth-limited irregular programs. Perhaps surprisingly, in such cases, PDF
running on a CMP architecture with a relatively slow
monolithic shared L2 cache retains its advantage over
WS even when WS is run on a CMP architecture with
a faster distributed L2 cache.
For several other application classes, PDF and WS
have similar performance, either because there is only
limited data reuse that can be exploited or because
the programs are not limited by off-chip bandwidth.
In the latter case, PDF reduces the working set size,
which has other benefits (see Section 2).
Task granularity plays a key role in CMP cache performance. To help programmers and system designers
cope with this issue, we present an automatic approach
for selecting effective task grain sizes, based on a new
working set profiling algorithm that is an order of magnitude faster than previous approaches.
This is the first paper demonstrating the effectiveness of
PDF for constructive cache sharing on real benchmarks, providing a direct comparison between PDF and WS, revealing
the limiting factors for PDF in practice, and presenting an
approach for overcoming these factors.
The rest of the paper is organized as follows. Section 2
elaborates the benefits of constructive cache sharing and discusses related work. Section 3 describes PDF and WS schedulers in detail. Section 4 provides experimental methodology, then Section 5 presents our detailed experimental study.
Focusing on the task granularity issue observed in the study,
Section 6 proposes and evaluates a technique to automatically select task granularity for a class of parallel applications. Finally, Section 7 concludes the paper.
2.2
Related Work
Work Stealing:
Work Stealing:
P2
P1
P3
P4
P6
P5
P7
P8
P1
P2
P3
P4
P5
L2 cache miss
P6
P7
P8
L2 cache hit
Mixed
L2 cache miss
L2 cache hit
Mixed
(b) Completed
Figure 1: Scheduling parallel Mergesort using WS and PDF: Picturing the misses. Each horizontal box is a
sorted array of records, where at each level, pairs of arrays from the previous level are merged until all the
records are sorted. The L2 hits and misses are shown for sorting an array of CP bytes, where CP is the size
of the shared L2 cache, using 8 cores.
time systems, their micro-benchmark results do indicate that
intelligent co-scheduling of cooperative threads can reduce
the number of L2 misses substantially.
Philbin et al. [30] studied the possibility of reducing cache
misses for sequential programs through intelligent scheduling
of fine-grained threads. Their approach relies on memory access hints in the program to identify threads that should execute in close temporal proximity in order to promote cache
re-use. Although the scheduler is not directly applicable
to parallel scheduling, the approach may be a useful preprocessing step for the PDF scheduler, which relies on the
program having a cache-friendly sequential schedule.
Table 2: Default
Number of cores
1
Technology (nm)
90
L2 cache size (MB) 10
Associativity
20
L2 hit time (cycles) 15
configurations.
2
90
8
16
13
4
90
4
16
11
8
65
8
16
13
24
5
20
13
26
1
16
7
4. METHODOLOGY
In this section, we describe our experimental methodology, focusing on the CMP design space to explore and the
benchmarks to use in our study.
16
45
20
20
19
32
32
40
20
23
4.2 Benchmarks
We studied the effect of scheduling on a number of benchmarks from a variety of domains. However, here we focus on
only three of the benchmarks (LU, Hash Join, and Mergesort), as representative of common classes of benchmarks,
deferring discussions of other benchmarks to Section 5.5.
LU. LU is a representative scientific benchmark, with its
easy parallelization and small working sets. We used the
parallel LU implementation of the Cilk distribution. The
benchmark performs a recursive factorization on a dense
NN matrix. The input matrix is partitioned into four
quadrants recursively, until the size of the quadrant is equal
to the block size B. A smaller block size creates a larger
number of smaller threads. The block size effectively controls the grain of parallelism, thus we did not have to modify
the benchmark. (Due to trace size limitations, the largest
input size we were able to factorize is a 2K2K matrix of
doubles, or 32MB input data. As this is smaller than the L2
cache in the 32-core default configuration, we only report
LU results for up to 16 cores.)
Hash Join. Hash Join is representative of many (commercial or otherwise) programs that use large irregular data
structures and benefit from large caches. We use a state-ofthe-art database hash join code [15]. In an initial I/O partition phase, each of the two large input tables is partitioned
into fragments that fit within the memory buffer allocated
for the join (1GB in our study). We study the second, join
phase, which joins pairs of partitions; in current database
systems, this is the most time-consuming phase. Each partition is divided into sub-partitions that fit within the L2
cache. For each sub-partition, the keys from the build
table are placed in a hash table, which is then probed for
matches from the probe table. The matching build and
probe records are concatenated to produce outputs. While
the original code used one thread per sub-partition, we further divided the probe procedure (which typically dominates
the join phase [15]) for each sub-partition into multiple par-
allel tasks to produce finer-grained threading. Here, we report representative experiments that join a pair of build and
probe partitions that fit tightly in a 1GB memory buffer.
Every build record matches 2 probe records where each
record is 100B and the join attribute is 4B.
Mergesort.
Mergesort is representative of many programs that use a recursive divide-and-conquer paradigm.
Our Mergesort benchmark is structured after libpmsort [40],
a parallel recursive mergesort library, but with the serial
merging of two sorted sub-arrays modified to use a parallel
merge instead. We select a total of k splitting points from
the two sorted sub-arrays, for a suitable value of k seeking
to optimize the cache performance at the given level of recursion (details in Section 5.4 and Section 6.2). For each
chosen value from one array, we locate the closest value in
the other array using binary search. In this way, we create k
pairs of array chunks, which can be merged in parallel. An
example is given in Figure 1.
5.
EXPERIMENTAL STUDY
5.1
1
2
4
8
16
number of cores (default configurations)
0.1
0.05
0
10
pdf
ws
(b) LU
1
2
4
8
16 32
number of cores (default configurations)
12
10
8
6
4
2
0
1
2
4
8
16 32
number of cores (default configurations)
25
pdf
ws
20
15
10
5
0
1
2
4
8
16 32
number of cores (default configurations)
(e) Mergesort
pdf
ws
1.5
x 10
pdf
ws
3
2
1
2
1.5
x 10
1
0.5
0
12 4 6 8 101214161820222426
number of cores (45nm technology)
0.5
0
1
2
4
8
16 32
number of cores (default configurations)
(f) Mergesort
pdf
ws
5.5
5
4.5
4
3.5
10
pdf
ws
x 10
3
10 12 14 16 18 20 22 24 26
number of cores (45nm technology)
0
12 4 6 8 101214161820222426
number of cores (45nm technology)
1
2
4
8
16
number of cores (default configurations)
(a) LU
15
0.15
pdf
ws
2
execution time (cycles)
10
0.2
pdf
ws
15
20
x 10
1.8
pdf
ws
1.6
1.4
1.2
1
0.8
10 12 14 16 18 20 22 24 26
number of cores (45nm technology)
with 2-32 cores of PDF over WS. Figure 2(f) depicts the L2
misses per instruction ratios. Similar to Hash Join, PDF
successfully reduces 13.8%-40.6% L2 misses per instruction
compared to WS. Comparing Figure 2(b), Figure 2(d), and
Figure 2(f), we see that the L2 misses per instruction ratio
of Mergesort (around 0.1%) is much lower than Hash Join
(around 0.6%), but is still significant enough compared to
LU (around 0.01%) to make a difference on performance. We
can clearly see the trend that the larger the ratio of L2 misses
per instruction, the larger impact constructive cache sharing
may have, and therefore the larger relative performance benefits of PDF over WS. Moreover, unlike Hash Join, Mergesort experiences only up to 71.0% memory utilization due
to the lower misses per instruction ratios, and thus the absolute speedup continues to increase dramatically from 16
to 32 cores.
Considering the performance results of the three benchmarks, we conclude that PDF achieves significantly better
performance than WS for a group of important applications that have non-trivially large working sets, as evidenced
by the L2 misses per instruction ratios. Because LU does
not differentiate PDF and WS, we focus on Hash Join and
Mergesort in the rest of the experimental study.
5.2
Figure 3 shows the execution time of Hash Join and Mergesort using PDF and WS, for 1-26 cores under the 45nm process technology. As shown previously in Table 3, the L2
cache size decreases from 48MB with 1 core to 1MB with 26
cores. We examine Figure 3 for two purposes: (i) comparing
the performance of PDF and WS; and (ii) understanding the
impact of PDF on the choices of CMP design points. For
the first purpose, as shown in Figure 3, we see that PDF
wins across all the CMP configurations, achieving over WS
x 10
3
2
1
0
10
7
19
L2 cache hit time (cycles)
5
0
7
19
L2 cache hit time (cycles)
10
execution time (cycles)
pdf
ws
1.5
2.5
pdf
ws
1
0.5
0
(b) Mergesort
x 10
pdf
ws
1.5
1
0.5
100 300 500 700 900 1100
main memory latency (cycles)
(b) Mergesort
x 10
pdf
ws
8
6
4
2
0
Figure 4: Varying L2 cache hit time with the 16core default configuration.
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
x 10
15
execution time (cycles)
pdf
ws
15
pdf
ws
x 10
x 10
10
Figure 6: Varying the task granularity of parallel Mergesort for default configurations.
a factor of 1.06-1.64 fold speedup for Hash Join and a factor
of 1.03-1.11 speedup for Mergesort.
For the second purpose, as shown in Figures 3(a) and (c),
we can see that the major trend of all the curves is to generally decrease as the number of cores increases, meaning
that application performance generally improves with more
cores. When there are 10 or more cores, the curves seem flat.
However, when we zoom into the 10-26 core performance in
Figures 3(b) and (d), we see that the curves actually vary.
The Hash Join curves reach the lowest points around 18
cores then go up, while the Mergesort curves continue to
decrease until 24 or 26 cores. With 18 or more cores, Hash
Join utilizes over 95% of the main memory bandwidth. Increasing the number of cores while decreasing the cache size
only makes the situation worse, leading to the worse performance. In contrast, Mergesort is not bounded by memory
bandwidth and its performance improves with more cores.4
When making a design choice in the CMP design space,
a typical goal is to optimize the performance of a suite of
benchmark applications (e.g. SPEC) measured by aggregate performance metrics. Compared to WS, PDF provides larger freedom in the choice of design points in order to achieve a desired level of performance (e.g., more
than K times faster than a reference configuration) for a
multithreaded application. For example, for Hash Join, 1026 cores with PDF achieve similar or better performance
than the best WS performance. Similarly, 20-26 cores with
PDF achieve similar or better performance than the best
WS performance for Mergesort. In this way, designers are
4
From 24 cores (5MB cache) to 26 cores (1MB cache),
Mergesort with PDF experiences a large jump (41% increase) of the L2 misses per instruction ratio. This results
in the jump of the execution time.
able to make better trade-offs balancing sequential vs. multithreaded programs, and computation-intensive vs. memoryintensive programs.
5.3
5.4
the working set by the number of cores P does not effect performance, because the aggregate working set still fits in the
shared cache, and hence WS matches PDFs performance.
As discussed in Section 5.1, the small working set often manifests itself as low L2 misses per instruction ratios. Our results support that this ratio should be on the order of 0.1%
or more for PDF to make a significant difference in execution time. However, even for these applications, PDF can be
valuable. As pointed out in Section 2, PDFs smaller working set can translate into better use of on-chip real estate
and reduced power consumption. Additionally, when multiple programs are run at the same time, the PDF version is
less of a cache hog and its smaller working set is more likely
to remain in the cache across context switches.
Third, PDF achieves significant performance gains over
WS across a large number of CMP design points and architectural parameters as shown in the above subsections.
Because of this, PDF provides larger freedom in the choice
of design points in order to achieve a desired level of benchmark performance.
Finally, most benchmark programs, as written, use such
coarse-grained threading that there is no opportunity for
cache-friendly scheduling. In fact, many programs use a
scheduler only at the beginning of the program: for P processors, P threads are spawned at the beginning of the program and no further spawning is done. Thus, in our study,
we incorporated much finer-grained threading in the programs (e.g. Hash Join and Mergesort) we studied. Our
work quantifies the benefits of more fine-grained threading.
Because of its importance, in the following section, we focus
on the problem of selecting task granularities.
6.
6.2
)) {
1)));
2)));
k)));
Calling Location
File
Line
....
....
....
....
Param
Threshold
....
....
previous
cache/(2*cores) dag
cache/(2*cores) actual
1.10
1.08
1.06
1.04
1.02
1.00
32 cores 16 cores
8 cores
7.
CONCLUSION
The advent of Chip Multiprocessor (CMP) platforms requires a re-evaluation of standard practices for parallel computing. While traditional (Symmetric MultiProcessors) SMP
configurations (where the main system limitation was often
coherence traffic) encouraged coarse-grained parallelism to
reduce sharing, the off-chip bandwidth bottleneck of CMP
machines encourages a fine-grained parallelism to increase
resource sharing.
This study demonstrates that the Parallel Depth First
(PDF) scheduler, which was designed to encourage cooperative threads to constructively share cache, either matches or
outperforms the Work Stealing (WS) scheduler on a CMP
machine for all the fine-grained parallel programs studied.
By making more effective use of cache resources, the PDF
scheduler also broadens the design space for microprocessor
designerspotentially enabling the inclusion of more cores
at the expense of cache resources that are less critical given
PDF. Finally, task granularity plays a key role in CMP cache
performance, and we present an automatic approach for selecting effective task grain sizes when using PDF.
8.
ACKNOWLEDGMENTS
9.
REFERENCES
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]