Challenging Multi Cores

Challenging Multi-cores
Paul Cockshott1
1
School of Computer Science
Paul Cockshott,
And the SICSA multi-core challenge
Motivation
Moore's Law implies that as the scale of transistors shrinks, the number of gates that can be tted onto a chip of a standard size, say of the order of 1cm2 , will double every two years. Historically this has been used by processor manufactures to increase the complexity of individual processor cores. A reduction in feature sizes potentially allows the speed of gates to rise, allowing a rise in clock speeds. This rise was pretty continuous until the last few years since when it has leveled o.
Paul Cockshott,
How parallelism is changing

1
Higher clock speeds increase the heat dissipation per cm2 due to capacitive losses, at around 3Ghz the heat losses are at the limit of what can be sustained with air cooling, even with heat pipes etc. As clock speeds rise, clock skew accross the die becomes a signicant factor which ultimately limits the ability to construct synchronous machines.
A result of these pressures has been that the mode of elaboration of chips has switched from complexifying individual cores, to the adding of multiple cores to each chip. We can now expect the number of cores to grow exponentially: perhaps doubling roughly every two years. This implies that in 10 years time a mass produced standard PC chip could contain around 256 or 512 cores.
Paul Cockshott, And the SICSA multi-core challenge
The SCC
Paul Cockshott,
Inside the SCC
Paul Cockshott,
The Development Board
Paul Cockshott,
Need new types of languages
This growth in the number of cores and the problems of communicating between arbitrary processors is going to require a fundamental rethink in the way we design programming languages. In this talk I present Lino, a novel notation for programming arbitrarily large arrays of processors, based on abstractions over patterns of process adjacencies.
Paul Cockshott,
Lino Tiles and Tilings

Lino programs describe arrays of square tiles. Figure shows an atomic square tile and an array of tiles. A tile has one input stream and one output stream on each face, with inputs numbered 0..3 and outputs 4..7 in clockwise face order starting at the top\footnote{This is the convention for all face orderings.}. Faces are identied as North, East , South and West .
Paul Cockshott,
Syntax of Lino
comm::= dev | alias comms ::= comm[; comms ] prog ::= coms ; main = exp
def ::= id :faces < path faces ::= ((ty0 ,ty4 ),...(ty3 ,ty7 ) I/O stream types ty ::= ... type path::= ... le path alias ::= id = exp id aliases exp A command is a tile denition or an alias ed expression. A denition provides the tile name, the types of the input and output for each face, and a path to an executable body.
commands command seq a program is a sequence of commands ending with a nominated main expression dene tile id
Paul Cockshott,
Syntax continued
block ::= [ redir [; redir ] ] redir ::= path dirio [dirio ] dirio ::= inout direction inout ::= < | > direction::= North | South | East | West exp ::= ... id
shell block redirected shell comman standard redirections direction names expressions name of dened tile or aliased expression
Paul Cockshott,
More syntax
I Mirror 0 ( exp ) exp1 |exp2 exp1 _ exp2 exp * int exp ^int Flip exp Rotate exp
As in
identity redirects face I/O sink bracketing for priority process row process column horizontal replication vertical replication reection about vertical axis rotate 90 degrees clockwise
5 1 0 4 3 7 6 2 3 7 0 1 4
6 a) flip
a) identity
b) mirror
c) null
Paul Cockshott,
b) rotate
Lino programs
A program is a sequence of commands ending with a nominated main expression. A command is a tile denition or an alias ed expression. A denition provides the tile name, the types of the input and output for each face, and a path to an executable body.
Paul Cockshott,
Transform rules
input 1 2 3 e*1 e*N e^1 e^N Flip I Flip Mirror Flip 0 Flip(e|f) Flip (e_f)
output e e|(e*N-1) e e^(e*N-1) I Mirror 0 Flip f| Flip e (Flip e)_(Flip f)
Paul Cockshott,
Transform rules continued
input 4 Rotate I Rotate Mirror Rotate 0 Rotate (e|f) Rotate (e_f) Flip Flip e Rotate Rotate Rotate Rotate e (a_b)|(c_d)
output I Mirror 0 (Rotate f) _ (Rotate e) (Rotate f)|(Rotate e) e e (a|c)_(b|d)
5 6 7
Paul Cockshott,
What the rules mean

The rules shown apply to expressions. Horizontal and vertical replication apply a xed (and known) number of times (1 and 2). Flip and rotate preserve identity, mirror and null tiles (3 and 4). Flipping a row creates a row of ipped elements in reverse order; ipping a column creates a column of ipped elements (3). Rotating a row creates a column of rotated elements; rotating a column creates a row of rotated elements in reverse order (4) Two ips cancel (5). Four rotates cancel (6). Columns distribute over rows (7).
Paul Cockshott,
Status
A prototype implementation was completed last autumn. Implementation proceeds in two stages. First, the main expression is fully expanded to column-major order. Then, the overall column of rows drives the generation of an equivalent shell script in which, for each tile position, approprite executable calls are made with stream redirection to linking FIFOs. This rst version runs on standard multi-core linux. It translates directly into shell script to generate the parallelism using & operations. A new implementation is to be made targeted explicitly at the SCC.
Paul Cockshott,
An example script
lifecell:((int,int),(int,int),(int,int),(int,int)) <- ./ liferow = Mirror|(Flip (lifecell *3) )|I|Mirror; lifeblock = Flip (liferow ^ 3); mirrorrow = Mirror * 6; main = Rotate(mirrorrow _ lifeblock _ mirrorrow
Paul Cockshott,
SICSA Multicore Challenge

Concordance
The aim of the SICSA MultiCore Challenge is to compare several approaches to parallel computation in terms of achieved performance and ease of implementation. We plan to specify one or more representative applications to be implemented and assessed on state-of-the-art multi-core machines. We invite system developers to apply their systems on these benchmark applications. We invite software developers to put forward their applications as benchmarks for this challenge. The rst application proposed was a le concordance application.
Paul Cockshott,
Specication of the Concordance application.
Given:
Text le containing English text in ASCII encoding. An integer N. Find: For all sequences of words, up to length N, occurring in the input le, the number of occurrences of this sequence in the text, together with a list of start indices. Optionally, sequences with only 1 occurrence should be omitted.
Paul Cockshott,
Is it a good parallel problem?
I think this is a very hard programme to get good parallelism out of. This is because a well designed serial programme to do concordance will spend a large part of its time reading in text or printing results. This was not immediately apparent to the proposers, probably because they started out with a poorly written Haskell serial implementation. When looking at any problem the rst thing to do is get an estimate of the complexity order of the problem. My intuition was that this was roughly O(N).
Paul Cockshott,
Quick Hack
Prior to doing any parallelisation it is advisable to initially set up a good sequential version. I was initially doubtfull that this challenge would provide an eective basis for parallelisation because it seemed such a simple problem. Intutively it seems like a problem that is likely to be of either linear or at worst log linear complexity, and for such problems, especially ones involving text les, the time taken to read in the le and print out the results can easily come to dominate the total time taken. If a problem is disk bound, then there is little advantage in expending eort to run it on multiple cores. However that was only an hypothesis and needed to be veried by experiment. In line with our school motto of programming to an interface not an implementation, the interface above rather than the Haskell implementation was chosen as the starting point. In order to get a bog standard implementation, C was chosen as the implementation language.
Serial results
The rst thing to note is the C is much faster than the initial Haskell. The dierence in speed is far greater than could be accounted for in terms of the relative eciencies of the compilers. Instead it indicates that the Haskell is a poor algorithm. Initial runs on windows version input le size time haskell 3kb 0.82 sec c 3kb 0.24 sec haskell 4.9mb timed out after 3 hours c 4.9mb 3.67 sec
Paul Cockshott,
Algorithm Structure
The algorithm used had the following basic structure Read the inut le to a buer. Tokenize it to a sequence of integers corresponding to words. Make a single pass through the tokenized data building a hashed index. Make a nal pass throught the index printing out the results. It is clear that this algorithm is basically or order N as it has 3 sequential passes. andd that steps 1 and 4 are likely to take a signicant fraction of the time. How can it be parallelised accross cores?
1 2 3 4
Paul Cockshott,
What you can not do
You can not simply split the le into two halves, produce a concordance for each half and merge them. The aim is to produce a list of words and positions for all repated words. If you split the le in two halves you are not guaranteed to nd repetitions unless you do something smart. So how to proceed? Recall we tokenize the le mapping it to integers. How about two processes one of which deals with all the odd words and one with all the even words?
Paul Cockshott,
Two versions
I tried two versions Use pthreads, read le in and tokenize once, then get two processes to go through and build two disjoint indices and print them to two les. Then use cat to join the les. Use the shell & operator to run two copies of the original C lightly modied to process either even or odd words to two les and cat them. How did these perform: Tests were done on Linux using the same processor as the previous example, using the 4.9meg World English Bible as data. This time the C was optimised with -O3 example runtime serial version in C 2.12 secs dual thread version using Pthreads 2.45 secs dual process version using bash 1.93 secs
Lessons
The old bash shell is actually better than pthreads

this is good news for Lino since it compiles to bash shell commands
Overall speedup minor because task i/o limitedhat
Paul Cockshott,
What next
On the SCC
Try to run the concordance bash style on say 32 cores Port Lino to the SCC
On the SICSA front

Encourage you all to try your hand at it Report results in december Propose better examples to SICSA
Mandelbrot Convolution Disparity matcher
Paul Cockshott,

Challenging Multi Cores

Uploaded by

Copyright:

Available Formats

Challenging Multi Cores

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Challenging Multi Cores

Uploaded by

Copyright:

Available Formats

Challenging Multi-cores

School of Computer Science

And the SICSA multi-core challenge

And the SICSA multi-core challenge

How parallelism is changing

And the SICSA multi-core challenge

Inside the SCC

And the SICSA multi-core challenge

The Development Board

And the SICSA multi-core challenge

Need new types of languages

And the SICSA multi-core challenge

Lino Tiles and Tilings

And the SICSA multi-core challenge

And the SICSA multi-core challenge

And the SICSA multi-core challenge

And the SICSA multi-core challenge

And the SICSA multi-core challenge

output e e|(e*N-1) e e^(e*N-1) I Mirror 0 Flip f| Flip e (Flip e)_(Flip f)

And the SICSA multi-core challenge

Transform rules continued

output I Mirror 0 (Rotate f) _ (Rotate e) (Rotate f)|(Rotate e) e e (a|c)_(b|d)

And the SICSA multi-core challenge

What the rules mean

And the SICSA multi-core challenge

And the SICSA multi-core challenge

And the SICSA multi-core challenge

SICSA Multicore Challenge

And the SICSA multi-core challenge

Specication of the Concordance application.

And the SICSA multi-core challenge

Is it a good parallel problem?

And the SICSA multi-core challenge

And the SICSA multi-core challenge

And the SICSA multi-core challenge

What you can not do

And the SICSA multi-core challenge

The old bash shell is actually better than pthreads

Overall speedup minor because task i/o limitedhat

And the SICSA multi-core challenge

On the SICSA front

Mandelbrot Convolution Disparity matcher

And the SICSA multi-core challenge

You might also like

output e e|(eN-1) e e^(eN-1) I Mirror 0 Flip f| Flip e (Flip e)_(Flip f)

Specication of the Concordance application.