Challenging Multi Cores
Challenging Multi Cores
Challenging Multi Cores
Paul Cockshott1
1
Paul Cockshott,
Motivation
Moore's Law implies that as the scale of transistors shrinks, the number of gates that can be tted onto a chip of a standard size, say of the order of 1cm2 , will double every two years. Historically this has been used by processor manufactures to increase the complexity of individual processor cores. A reduction in feature sizes potentially allows the speed of gates to rise, allowing a rise in clock speeds. This rise was pretty continuous until the last few years since when it has leveled o.
Paul Cockshott,
Higher clock speeds increase the heat dissipation per cm2 due to capacitive losses, at around 3Ghz the heat losses are at the limit of what can be sustained with air cooling, even with heat pipes etc. As clock speeds rise, clock skew accross the die becomes a signicant factor which ultimately limits the ability to construct synchronous machines.
A result of these pressures has been that the mode of elaboration of chips has switched from complexifying individual cores, to the adding of multiple cores to each chip. We can now expect the number of cores to grow exponentially: perhaps doubling roughly every two years. This implies that in 10 years time a mass produced standard PC chip could contain around 256 or 512 cores.
Paul Cockshott, And the SICSA multi-core challenge
The SCC
Paul Cockshott,
Paul Cockshott,
Paul Cockshott,
This growth in the number of cores and the problems of communicating between arbitrary processors is going to require a fundamental rethink in the way we design programming languages. In this talk I present Lino, a novel notation for programming arbitrarily large arrays of processors, based on abstractions over patterns of process adjacencies.
Paul Cockshott,
Paul Cockshott,
Syntax of Lino
comm::= dev | alias comms ::= comm[; comms ] prog ::= coms ; main = exp
def ::= id :faces < path faces ::= ((ty0 ,ty4 ),...(ty3 ,ty7 ) I/O stream types ty ::= ... type path::= ... le path alias ::= id = exp id aliases exp A command is a tile denition or an alias ed expression. A denition provides the tile name, the types of the input and output for each face, and a path to an executable body.
commands command seq a program is a sequence of commands ending with a nominated main expression dene tile id
Paul Cockshott,
Syntax continued
block ::= [ redir [; redir ] ] redir ::= path dirio [dirio ] dirio ::= inout direction inout ::= < | > direction::= North | South | East | West exp ::= ... id
shell block redirected shell comman standard redirections direction names expressions name of dened tile or aliased expression
Paul Cockshott,
More syntax
I Mirror 0 ( exp ) exp1 |exp2 exp1 _ exp2 exp * int exp ^int Flip exp Rotate exp
As in
identity redirects face I/O sink bracketing for priority process row process column horizontal replication vertical replication reection about vertical axis rotate 90 degrees clockwise
5 1 0 4 3 7 6 2 3 7 0 1 4
6 a) flip
a) identity
b) mirror
c) null
Paul Cockshott,
b) rotate
Lino programs
A program is a sequence of commands ending with a nominated main expression. A command is a tile denition or an alias ed expression. A denition provides the tile name, the types of the input and output for each face, and a path to an executable body.
Paul Cockshott,
Transform rules
input 1 2 3 e*1 e*N e^1 e^N Flip I Flip Mirror Flip 0 Flip(e|f) Flip (e_f)
Paul Cockshott,
input 4 Rotate I Rotate Mirror Rotate 0 Rotate (e|f) Rotate (e_f) Flip Flip e Rotate Rotate Rotate Rotate e (a_b)|(c_d)
5 6 7
Paul Cockshott,
Paul Cockshott,
Status
A prototype implementation was completed last autumn. Implementation proceeds in two stages. First, the main expression is fully expanded to column-major order. Then, the overall column of rows drives the generation of an equivalent shell script in which, for each tile position, approprite executable calls are made with stream redirection to linking FIFOs. This rst version runs on standard multi-core linux. It translates directly into shell script to generate the parallelism using & operations. A new implementation is to be made targeted explicitly at the SCC.
Paul Cockshott,
An example script
lifecell:((int,int),(int,int),(int,int),(int,int)) <- ./ liferow = Mirror|(Flip (lifecell *3) )|I|Mirror; lifeblock = Flip (liferow ^ 3); mirrorrow = Mirror * 6; main = Rotate(mirrorrow _ lifeblock _ mirrorrow
Paul Cockshott,
The aim of the SICSA MultiCore Challenge is to compare several approaches to parallel computation in terms of achieved performance and ease of implementation. We plan to specify one or more representative applications to be implemented and assessed on state-of-the-art multi-core machines. We invite system developers to apply their systems on these benchmark applications. We invite software developers to put forward their applications as benchmarks for this challenge. The rst application proposed was a le concordance application.
Paul Cockshott,
Given:
Text le containing English text in ASCII encoding. An integer N. Find: For all sequences of words, up to length N, occurring in the input le, the number of occurrences of this sequence in the text, together with a list of start indices. Optionally, sequences with only 1 occurrence should be omitted.
Paul Cockshott,
I think this is a very hard programme to get good parallelism out of. This is because a well designed serial programme to do concordance will spend a large part of its time reading in text or printing results. This was not immediately apparent to the proposers, probably because they started out with a poorly written Haskell serial implementation. When looking at any problem the rst thing to do is get an estimate of the complexity order of the problem. My intuition was that this was roughly O(N).
Paul Cockshott,
Quick Hack
Prior to doing any parallelisation it is advisable to initially set up a good sequential version. I was initially doubtfull that this challenge would provide an eective basis for parallelisation because it seemed such a simple problem. Intutively it seems like a problem that is likely to be of either linear or at worst log linear complexity, and for such problems, especially ones involving text les, the time taken to read in the le and print out the results can easily come to dominate the total time taken. If a problem is disk bound, then there is little advantage in expending eort to run it on multiple cores. However that was only an hypothesis and needed to be veried by experiment. In line with our school motto of programming to an interface not an implementation, the interface above rather than the Haskell implementation was chosen as the starting point. In order to get a bog standard implementation, C was chosen as the implementation language.
Paul Cockshott, And the SICSA multi-core challenge
Serial results
The rst thing to note is the C is much faster than the initial Haskell. The dierence in speed is far greater than could be accounted for in terms of the relative eciencies of the compilers. Instead it indicates that the Haskell is a poor algorithm. Initial runs on windows version input le size time haskell 3kb 0.82 sec c 3kb 0.24 sec haskell 4.9mb timed out after 3 hours c 4.9mb 3.67 sec
Paul Cockshott,
Algorithm Structure
The algorithm used had the following basic structure Read the inut le to a buer. Tokenize it to a sequence of integers corresponding to words. Make a single pass through the tokenized data building a hashed index. Make a nal pass throught the index printing out the results. It is clear that this algorithm is basically or order N as it has 3 sequential passes. andd that steps 1 and 4 are likely to take a signicant fraction of the time. How can it be parallelised accross cores?
1 2 3 4
Paul Cockshott,
You can not simply split the le into two halves, produce a concordance for each half and merge them. The aim is to produce a list of words and positions for all repated words. If you split the le in two halves you are not guaranteed to nd repetitions unless you do something smart. So how to proceed? Recall we tokenize the le mapping it to integers. How about two processes one of which deals with all the odd words and one with all the even words?
Paul Cockshott,
Two versions
I tried two versions Use pthreads, read le in and tokenize once, then get two processes to go through and build two disjoint indices and print them to two les. Then use cat to join the les. Use the shell & operator to run two copies of the original C lightly modied to process either even or odd words to two les and cat them. How did these perform: Tests were done on Linux using the same processor as the previous example, using the 4.9meg World English Bible as data. This time the C was optimised with -O3 example runtime serial version in C 2.12 secs dual thread version using Pthreads 2.45 secs dual process version using bash 1.93 secs
Paul Cockshott, And the SICSA multi-core challenge
Lessons
Paul Cockshott,
What next
On the SCC
Try to run the concordance bash style on say 32 cores Port Lino to the SCC
Paul Cockshott,