Baker CHPT 5 SIMD Good

Download as pdf or txt
Download as pdf or txt
You are on page 1of 94

CS4961 Parallel Programming

Chapter 5
Introduction to SIMD

Johnnie Baker
January 14, 2011

1
Chapter Sources
J. Baker, Chapters 1-2 slides from Fall 2010 class in
Parallel & Distributed Programming, Kent State
University
Mary Hall, Slides from Lectures 7-8 for CS4961,
University of Utah.
Michael Quinn, Parallel Programming in C with MPI
and OpenMP, 2004, McGraw Hill.

2
Some Definitions
Concurrent - Events or processes which seem to
occur or progress at the same time.
Parallel Events or processes which occur or progress
at the same time
Parallel programming (also, unfortunately, sometimes
called concurrent programming), is a computer
programming technique that provides for the parallel
execution of operations , either
within a single parallel computer
or across a number of systems.
In the latter case, the term distributed computing is used.

3
Flynns Taxonomy
(Section 2.6 in Quinns Textbook)
Best known classification scheme for parallel
computers.
Depends on parallelism it exhibits with its
Instruction stream
Data stream
A sequence of instructions (the instruction stream)
manipulates a sequence of operands (the data
stream)
The instruction stream (I) and the data stream (D)
can be either single (S) or multiple (M)
Four combinations: SISD, SIMD, MISD, MIMD
4
SISD
Single Instruction, Single Data
Single-CPU systems
i.e., uniprocessors
Note: co-processors dont count as additional processors
Concurrent processing allowed
Instruction prefetching
Pipelined execution of instructions
Concurrent execution allowed
That is, independent concurrent tasks can execute
different sequences of operations.
Functional parallelism is discussed in later slides in Ch. 1
E.g., I/O controllers are independent of CPU
Most Important Example: a PC
5
SIMD
Single instruction, multiple data
One instruction stream is broadcast to all
processors
Each processor, also called a processing
element (or PE), is usually simplistic and
logically is essentially an ALU;
PEs do not store a copy of the program nor
have a program control unit.
Individual processors can remain idle during
execution of segments of the program (based
on a data test).
6
SIMD (cont.)
All active processor executes the same
instruction synchronously, but on different
data
Technically, on a memory access, all active
processors must access the same location in
their local memory.
This requirement is sometimes relaxed a bit.
The data items form an array (or vector) and
an instruction can act on the complete array in
one cycle.
7
SIMD (cont.)
Quinn also calls this architecture a processor
array. Some examples include
ILLIAC IV (1974) was the first SIMD computer
The STARAN and MPP (Dr. Batcher architect)
Connection Machine CM2, built by Thinking
Machines).
MasPar (for Massively Parallel) computers.

8
How to View a SIMD Machine
Think of soldiers all in a unit.
The commander selects certain soldiers as
active for example, the first row.
The commander barks out an order to all the
active soldiers, who execute the order
synchronously.
The remaining soldiers do not execute orders until
they are re-activated.

9
MISD

Multiple instruction streams, single data


stream
This category does not receive much attention
from most authors, so it will only be
mentioned on this slide.
Quinn argues that a systolic array is an
example of a MISD structure (pg 55-57)
Some authors include pipelined architecture
in this category
10
MIMD
Multiple instruction, multiple data
Processors are asynchronous, since they can
independently execute different programs on
different data sets.
Communications are handled either
through shared memory. (multiprocessors)
by use of message passing (multicomputers)
MIMDs have been considered by most
researchers to include the most powerful and
least restricted computers.
11
MIMD (cont. 2/4)
Have very major communication costs
When compared to SIMDs
Internal housekeeping activities are often overlooked
Maintaining distributed memory & distributed databases
Synchronization or scheduling of tasks
Load balancing between processors
One method for programming MIMDs is for all
processors to execute the same program.
Execution of tasks by processors is still asynchronous
Called SPMD method (single program, multiple data)
Usual method when number of processors are large.
Considered to be a data parallel programming style for
MIMDs.

12
MIMD (cont 3/4)
A more common technique for programming MIMDs is to
use multi-tasking:
The problem solution is broken up into various tasks.
Tasks are distributed among processors initially.
If new tasks are produced during executions, these may
handled by parent processor or distributed
Each processor can execute its collection of tasks
concurrently.
If some of its tasks must wait for results from other tasks or new
data , the processor will focus the remaining tasks.
Larger programs usually run a load balancing algorithm in
the background that re-distributes the tasks assigned to
the processors during execution
Either dynamic load balancing or called at specific times
Dynamic scheduling algorithms may be needed to assign a
higher execution priority to time-critical tasks
13
E.g., on critical path, more important, earlier deadline, etc.
MIMD (cont 4/4)

Recall, there are two principle types of MIMD


computers:
Multiprocessors (with shared memory)
Multicomputers
Both are important and are covered next in
greater detail.

14
Multiprocessors
(Shared Memory MIMDs)
All processors have access to all memory locations .
Two types: UMA and NUMA
UMA (uniform memory access)
Frequently called symmetric multiprocessors or SMPs
Similar to uniprocessor, except additional, identical CPUs
are added to the bus.
Each processor has equal access to memory and can do
anything that any other processor can do.
SMPs have been and remain very popular

15
Multiprocessors (cont.)
NUMA (non-uniform memory access).
Has a distributed memory system.
Each memory location has the same address for all
processors.
Access time to a given memory location varies
considerably for different CPUs.
Normally, fast cache is used with NUMA systems to
reduce the problem of different memory access time
for PEs.
Creates problem of ensuring all copies of the same data in
different memory locations are identical.

16
Multicomputers
(Message-Passing MIMDs)
Processors are connected by a network
Interconnection network connections is one possibility
Also, may be connected by Ethernet links or a bus.
Each processor has a local memory and can only access its
own local memory.
Data is passed between processors using messages, when
specified by the program.
Message passing between processors is controlled by a
message passing language (typically MPI)
The problem is divided into processes or tasks that can be
executed concurrently on individual processors. Each
processor is normally assigned multiple processes.

17
Multiprocessors vs Multicomputers
Programming disadvantages of message-passing
Programmers must make explicit message-
passing calls in the code
This is low-level programming and is error
prone.
Data is not shared between processors but
copied, which increases the total data size.
Data integrity problem: Difficulty to maintain
correctness of multiple copies of data item.

18
Multiprocessors vs Multicomputers (cont)

Programming advantages of message-passing


No problem with simultaneous access to data.
Allows different PCs to operate on the same data
independently.
Allows PCs on a network to be easily upgraded
when faster processors become available.
Mixed distributed shared memory systems exist
Lots of current interest in a cluster of SMPs.
Easier to build systems with a very large number of
processors.
19
Seeking Concurrency
Several Different Ways Exist

Data dependence graphs


Data parallelism
Task parallelism
Sometimes called control parallelism or functional
parallelism.
Pipelining

20
Data Dependence Graph
Directed graph
Vertices = tasks
Edges = dependences
Edge from u to v means
that task u must finish
before task v can start.

21
Data Parallelism
All tasks (or processors) apply the same set of operations to
different data.

Example:
for i 0 to 99 do
a[i] b[i] + c[i]
endfor

Operations may be executed concurrently

Accomplished on SIMDs by having all active processors


execute the operations synchronously.

Can be accomplished on MIMDs by assigning 100/p tasks to


each processor and having each processor to calculated its
share asynchronously. 22
Supporting MIMD Data Parallelism
SPMD (single program, multiple data) programming is
generally considered to be data parallel.
Note SPMD could allow processors to execute different sections
of the program concurrently
A way to more strictly enforce a SIMD-type data parallel
programming using SPMP programming is as follows:
Processors execute the same block of instructions concurrently
but asynchronously
No communication or synchronization occurs within these
concurrent instruction blocks.
Each instruction block is normally followed by a synchronization
and communication block of steps
If processors have multiple identical tasks, the preceding
method can be generalized using virtual parallelism.
Virtual Parallelism is where each processor of a parallel
processor plays the role of several processors.
23
Data Parallelism Features
Each processor performs the same data computation
on different data sets
Computations can be performed either
synchronously or asynchronously
Defn: Grain Size is the average number of
computations performed between communication or
synchronization steps
See Quinn textbook, page 411
Data parallel programming usually results in smaller
grain size computation
SIMD computation is considered to be fine-grain
MIMD data parallelism is usually considered to be medium
grain
MIMD multi-tasking is considered to be course grain
24
Task/Functional/Control/Job
Parallelism
Independent tasks apply different operations to different data
elements

a2
b3
m (a + b) / 2
s (a2 + b2) / 2
v s - m2
First and second statements may execute concurrently
Third and fourth statements may execute concurrently
Normally, this type of parallelism deals with concurrent
execution of tasks, not statements
25
Control Parallelism Features
Problem is divided into different non-identical
tasks
Tasks are divided between the processors so
that their workload is roughly balanced
Parallelism at the task level is considered to be
coarse grained parallelism

26
Data Dependence Graph
Can be used to identify data
parallelism and job parallelism.
See page 11 of textbook.
Most realistic jobs contain both
parallelisms
Can be viewed as branches in
data parallel tasks
- If no path from vertex u to vertex
v, then job parallelism can be
used to execute the tasks u and v
concurrently.
- If larger tasks can be subdivided
into smaller identical tasks, data
parallelism can be used to
execute these concurrently.
27
For example, mow lawn becomes
Mow N lawn
Mow S lawn
Mow E lawn
Mow W lawn
If 4 people are available
to mow, then data parallelism
can be used to do these
tasks simultaneously.
Similarly, if several people
are available to edge lawn
and weed garden, then we
can use data parallelism to
provide more concurrency.

28
Pipelining
Divide a process into stages
Produce several items simultaneously

29
Compute Partial Sums
Consider the for loop:
p[0] a[0]
for i 1 to 3 do
p[i] p[i-1] + a[i]
endfor
This computes the partial sums:
p[0] a[0]
p[1] a[0] + a[1]
p[2] a[0] + a[1] + a[2]
p[3] a[0] + a[1] + a[2] + a[3]
The loop is not data parallel as there are dependencies.
However, we can stage the calculations in order to achieve some
parallelism.
30
Partial Sums Pipeline

p[0] p[1] p[2] p[3]

p[0] p[1] p[2]


= + + +

a[0] a[1] a[2] a[3]


31
Alternate Names for SIMDs
Recall that all active processors of a true SIMD
computer must simultaneously access the same
memory location.
The value in the i-th processor can be viewed as the
i-th component of a vector.
SIMD machines are sometimes called vector
computers [Jordan,et.al.] or processor arrays [Quinn
94,04] based on their ability to execute vector and
matrix operations efficiently.

32
SIMD Computers
SIMD computers that focus on vector
operations
Support some vector and possibly matrix
operations in hardware
Usually limit or provide less support for non-
vector type operations involving data in the
vector components.
General purpose SIMD computers
May also provide some vector and possibly matrix
operations in hardware.
More support for traditional type operations (e.g.,
other than for vector/matrix data types).
33
Why Processor Arrays?
Historically, high cost of control units
Scientific applications have data parallelism

34
Data/instruction Storage
Front end computer
Also called the control unit
Holds and runs program
Data manipulated sequentially
Processor array
Data manipulated in parallel

35
Processor Array Performance
Performance: work done per time unit
Performance of processor array
Speed of processing elements
Utilization of processing elements

36
Performance Example 1
1024 processors
Each adds a pair of integers in 1 sec (1
microsecond or one millionth of second or
10-6 second.)
What is the performance when adding two
1024-element vectors (one per processor)?

Performanc e 10241operations
sec 1.024 109
ops / sec
37
Performance Example 2
512 processors
Each adds two integers in 1 sec
What is the performance when adding two
vectors of length 600?
Since 600 > 512, 88 processor must add two
pairs of integers.
The other 424 processors add only a single
pair of integers.

38
Example of a 2-D Processor Interconnection
Network in a Processor Array

Each VLSI chip has 16 processing elements.


Each PE can simultaneously send a value to a neighbor.

PE =
processor
element

39
SIMD Execution Style
The traditional (SIMD, vector, processor array) execution style
([Quinn 94, pg 62], [Quinn 2004, pgs 37-43]:
The sequential processor that broadcasts the commands
to the rest of the processors is called the front end or
control unit (or sometimes host).
The front end is a general purpose CPU that stores the
program and the data that is not manipulated in parallel.
The front end normally executes the sequential portions of
the program.
Alternately, all PEs needing computation can execute computation
steps synchronously in order to avoid broadcast cost to distribute
results
Each processing element has a local memory that cannot
be directly accessed by the control unit or other processing
elements.
40
SIMD Execution Style
Collectively, the individual memories of the
processing elements (PEs) store the (vector) data
that is processed in parallel.
When the front end encounters an instruction
whose operand is a vector, it issues a command to
the PEs to perform the instruction in parallel.
Although the PEs execute in parallel, some units
can be allowed to skip particular instructions.

41
Masking on Processor Arrays
All the processors work in lockstep except those that
are masked out (by setting mask register).
The conditional if-then-else is different for processor
arrays than sequential version
Every active processor tests to see if its data meets the
negation of the boolean condition.
If it does, it sets its mask bit so those processors will not
participate in the operation initially.
Next the unmasked processors, execute the THEN part.
Afterwards, mask bits (for original set of active processors)
are flipped and unmasked processors perform the the ELSE
part.

42
if (COND) then A else B

43
if (COND) then A else B

44
if (COND) then A else B

45
SIMD Machines
An early SIMD computer designed for vector and
matrix processing was the Illiac IV computer
Initial development at the University of Illinois 1965-70
Moved to NASA Ames, completed in 1972 but not fully
functional until 1976.
See Jordan et. al., pg 7 and Wikepedia
The MPP, DAP, the Connection Machines CM-1 and
CM-2, and MasPars MP-1 and MP-2 are examples of
SIMD computers
See Akl pg 8-12 and [Quinn, 94]
The CRAY-1 and the Cyber-205 use pipelined
arithmetic units to support vector operations and are
sometimes called a pipelined SIMD
See [Jordan, et al, p7], [Quinn 94, pg 61-2], and [Quinn
2004, pg37).
46
SIMD Machines
Quinn [1994, pg 63-67] discusses the CM-2 Connection
Machine (with 64K PEs) and a smaller & updated CM-200.
Professor Batcher was the chief architect for the STARAN
(early 1970s) and the MPP (Massively Parallel Processor, early
1980s) and an advisor for the ASPRO
STARAN was built to do air traffic control (ATC), a real-time application
ASPRO is a small second generation STARAN used by the Navy in the
spy planes for air defense systems, a related area to ATC.
MPP was ordered by NASA-Goddard, in DC area and was used for
multiple scientific computations.
Professor Batcher is best known architecturally for the MPP,
which is at the Smithsonian Institute & currently displayed at
a D.C. airport.

47
Todays SIMDs
SIMD functionality is sometimes embedded in
sequential machines.
Others are being build as part of hybrid
architectures.
Some SIMD and SIMD-like features are included in
some multi/many core processing units
Some SIMD-like architectures have been build as
special purpose machines, although some of these
could classify as general purpose.
Some of this work has been proprietary.
The fact that a parallel computer is SIMD or SIMD-like is
often not advertised by the company building them.
48
An Inexpensive SIMD
ClearSpeed produced a COTS (commodity off the shelf) SIMD
Board
WorldScape has developed some defense and commercial
applications for this computer.
Not a traditional SIMD as the hardware doesnt tightly
synchronize the execution of instructions.
Hardware design supports efficient synchronization
This machine is programmed like a SIMD.
The U.S. Navy observed that ClearSpeed machines process
radar a magnitude faster than others.
Earlier, quite a bit of information about this was at the
websites www.wscape.com and www.clearspeed.com
I have a lot of ClearSpeed information posted at
www.cs.kent.edu/~jbaker/ClearSpeed/
Must log in from a cs.kent.edu account in order to be able to gain
49
access using login-name and password.
Advantages of SIMDs
Reference: [Roosta, pg 10]
Less hardware than MIMDs as they have only one
control unit.
Control units are complex.
Less memory needed than MIMD
Only one copy of the instructions need to be stored
Allows more data to be stored in memory.
Much less time required for communication between
PEs and data movement.

50
Advantages of SIMDs (cont)
Single instruction stream and synchronization of PEs
make SIMD applications easier to program,
understand, & debug.
Similar to sequential programming
Control flow operations and scalar operations can be
executed on the control unit while PEs are executing
other instructions.
MIMD architectures require explicit synchronization
primitives, which create a substantial amount of
additional overhead.

51
Advantages of SIMDs (cont)
During a communication operation between PEs,
PEs send data to a neighboring PE in parallel and in lock
step
No need to create a header with routing information as
routing is determined by program steps.
the entire communication operation is executed
synchronously
SIMDs are deterministic & have much more predictable
running time.
Can normally compute a tight, worst case upper bound for
the time required for both computation & communications
operations.
Less complex hardware in SIMD since no message
decoder is needed in the PEs
MIMDs need a message decoder in each PE.
52
We next consider the usual claims
about SIMD shortcomings

Since SIMDs are not well-understood by most


computer science professionals today, it is
important that we examine these carefully.
Also, due to rapid changes that occur regularly
in computational science, some concerns may
become either more relevant or less relevant
over time.

53
SIMD Shortcoming Claims
(with some rebuttals -- 1/7)

Claims are from Quinns textbook [i.e., Quinn 2004].


Similar statements are found in [Grama, et. al].
Claim 1: SIMDs have a data-parallel orientation, but
not all problems are data-parallel
While not all solutions are data parallel, most problems
seem to have a data parallel solution.
In [Fox, et.al.], the observation was made in their study of
large parallel applications at national labs, that most were
data parallel by nature, but often had points where
significant branching occurred.

54
SIMD Shortcoming Claims
(with some rebuttals 2/7)
Claim 2: Speed drops for conditionally executed
branches
MIMDs processors can execute multiple branches
concurrently.
For an if-then-else statement with execution times for the
then and else parts being roughly equal, about of the
SIMD processors are idle during its execution
With additional branching, the average number of
inactive processors can become even higher.
With SIMDs, only one of these branches can be
executed at a time.
Note that this reason justifies the study of multiple
SIMDs (or MSIMDs), which can reduce or eliminate
branching
On many applications, any branching is quite shallow.
55
SIMD Shortcoming Claims
(with some rebuttals 3/7)
Claim 2 (cont): Speed drops for conditionally
executed code
In [Fox, et.al.], the observation was made that for the real
applications surveyed, the MAXIMUM number of active
branches at any point in time was about 8.
The cost of the simple processors used in a SIMD is small,
so the cost of multiple simple processors being idle may be
less important than the increase in running time.
Programmers used to worry about full utilization of
memory but stopped this after memory cost became
insignificant overall.
Often the much of the code in long THEN and ELSE parts is
identical and the IF-THEN-ELSE construct can be replaced
with multiple short IF-THEN-ELSE so that all processors
can execute the identical code simultaneously.
56
SIMD Shortcoming Claims
(with some rebuttals 4/7)

Claim 3: Dont adapt to multiple users well.


This is true to some degree for all parallel computers.
If usage of a parallel processor is dedicated to a important
problem, it is probably best not to risk compromising its
performance by sharing its time with other applications
This reason also justifies the study of multiple SIMDs (or
MSIMD).
SIMD architecture has not received the research and
development attention that MIMD has received and can
greatly benefit from further attention.

57
SIMD Shortcoming Claims
(with some rebuttals -- 5/7)

Claim 4: Do not scale down well to starter


systems that are affordable.
This point is arguable and its truth is likely to vary
rapidly over time
ClearSpeed produced several very economical
SIMD boards that plugs into a PC with at least 48
processors per chip and 2-3 chips per board.

58
SIMD Shortcoming Claims
(with some rebuttals -- 6/7)

Claim 5: Requires customized VLSI for processors and


expense of control units in PCs has dropped.
Reliance on COTS (Commodity, off-the-shelf parts) has
dramatically dropped the price of MIMDS over time.
Expense of PCs (with control units) has dropped
significantly
However, reliance on COTS has fueled the success of a
low level parallelism provided by clusters and
restricted new innovative parallel architecture research
for well over a decade.
59
SIMD Shortcoming Claims
(with some rebuttals 7/7)
Claim 5 (cont.)
There is strong evidence that the period of continual
dramatic increases in speed of PCs and clusters is
ending.
Continued rapid increases in parallel performance in
the future will be necessary in order to solve
important problems that are beyond our current
capabilities
Additionally, with the appearance of the very
economical COTS SIMDs, this claim no longer
appears to be relevant.
60
Switch here to looking at some of
Mary Halls slides
Her focus is focused more on the occurrence
of SIMD computations in some of the new
GPGPS architectures

61
Review: Predominant Parallel Control Mechanisms

62
SIMD and MIMD Architectures: Whats the Difference?

A typical SIMD architecture and a typical MIMD architecture (b)


Slide source: Grama et al., Introduction to Parallel Computing, 63
http://www.users.cs.umn.edu/~karypis/parbook
SIMD and MIMD Architectures:
Whats the Difference?

Slide source: Grama et al., Introduction to Parallel Computing,


http://www.users.cs.umn.edu/~karypis/parbook
64
Overview of SIMD Programming
Vector architectures
Early examples of SIMD supercomputers
TODAY Mostly
Multimedia extensions such as SSE and AltiVec
Graphics and games processors
Accelerators (e.g., ClearSpeed)
Is there a dominant SIMD programming model
Unfortunately, NO!!!
Why not?
Vector architectures were programmed by scientists
Multimedia extension architectures are programmed by
systems programmers (almost assembly language!)
GPUs are programmed by games developers (domain-
specific libraries) 65
Added Information

SSE
Streaming SIMD Extensions
A series of additional instructions built into the Pentium CPU
An earlier full-screen editor in OS/2
AltiVec
A floating point SIMD instruction set designed and owend
by Apple Computer, IBM, and Motorola
The AIM alliance.

66
Scalar vs. SIMD in Multimedia
Extensions

67
Multimedia Extension Architectures
At the core of multimedia extensions
SIMD parallelism
Variable-sized data fields:
Vector length = register width / type size

68
Multimedia / Scientific Applications
Image
Graphics : 3D games, movies
Image recognition
Video encoding/decoding : JPEG, MPEG4
Sound
Encoding/decoding: IP phone, MP3
Speech recognition
Digital signal processing: Cell phones
Scientific applications
Double precision Matrix-Matrix multiplication
(DGEMM)
Y[] = a*X[] + Y[] (SAXPY)

69
Characteristics of Multimedia
Applications
Regular data access pattern
Data items are contiguous in memory
Short data types
8, 16, 32 bits
Data streaming through a series of processing
stages
Some temporal reuse for such data streams
Sometimes
Many constants
Short iteration counts
Requires saturation arithmetic

70
Saturation Arithmetic
Saturation arithmetic is a version of arithmetic in which all operations such as addition and
multiplication are limited to a fixed range between a minimum and maximum value. If the result of an
operation is greater than the maximum it is set ("clamped") to the maximum, while if it is below the
minimum it is clamped to the minimum. The name comes from how the value becomes "saturated"
once it reaches the extreme values; further additions to a maximum or subtractions from a minimum
will not change the result.
For example, if the valid range of values is from -100 to 100, the following operations produce the
following values:
60 + 43 = 100
(60 + 43) 150 = 50
43 150 = 100
60 + (43 150) = 40
10 11 = 100
99 99 = 100
30 (5 1) = 100
30 5 30 1 = 70
As can be seen from these examples, familiar properties like associativity and distributivity fail in
saturation arithmetic. This makes it unpleasant to deal with in abstract mathematics, but it has an
important role to play in digital hardware and algorithms.
Typically, early computer microprocessors did not implement integer arithmetic operations using
saturation arithmetic; instead, they used the easier-to-implement modular arithmetic, in which
values exceeding the maximum value "wrap around" to the minimum value, like the hours on a clock
passing 71
Why SIMD
More parallelism
When parallelism is abundant
SIMD in addition to ILP
instruction level parallelism)
Simple design
Replicated functional units
Small die area
No heavily ported register files
Die area: +MAX-2(HP): 0.1% +VIS(Sun): 3.0%
Must be explicitly exposed to the hardware
By the compiler or by the programmer

72
Programming Multimedia Extensions
Language extension
Programming interface similar to function call
C: built-in functions, Fortran: intrinsics
Most native compilers support their own
multimedia extensions
GCC: -faltivec, -msse2
AltiVec: dst= vec_add(src1, src2);
SSE2: dst= _mm_add_ps(src1, src2);
BG/L: dst= __fpadd(src1, src2);
No Standard !
Need automatic compilation
73
Programming Complexity Issues
High level: Use compiler
may not always be successful
Low level: Use intrinsics or inline assembly
tedious and error prone
Data must be aligned, and adjacent in memory
Unaligned data may produce incorrect results
May need to copy to get adjacency (overhead)
Control flow introduces complexity and
inefficiency
Exceptions may be masked
74
1. Independent ALU Ops

R = R + XR * 1.08327
G = G + XG * 1.89234
B = B + XB * 1.29835

R R XR 1.08327
G = G + XG * 1.89234
B B XB 1.29835

75
2. Adjacent Memory References

R = R + X[i+0]
G = G + X[i+1]
B = B + X[i+2]

R R
G = G + X[i:i+2]
B B

76
3. Vectorizable Loops

for (i=0; i<100; i+=1)


A[i+0] = A[i+0] + B[i+0]

77
3. Vectorizable Loops

for (i=0; i<100; i+=4)


A[i+0] = A[i+0] + B[i+0]
A[i+1] = A[i+1] + B[i+1]
A[i+2] = A[i+2] + B[i+2]
A[i+3] = A[i+3] + B[i+3]

for (i=0; i<100; i+=4)


A[i:i+3] = B[i:i+3] + C[i:i+3]

78
4. Partially Vectorizable Loops
for (i=0; i<16; i+=1)
L = A[i+0] B[i+0]
D = D + abs(L)

79
4. Partially Vectorizable Loops
for (i=0; i<16; i+=2)
L = A[i+0] B[i+0]
D = D + abs(L)
L = A[i+1] B[i+1]
D = D + abs(L)

for (i=0; i<16; i+=2)


L0 = A[i:i+1] B[i:i+1]
L1
D = D + abs(L0)
D = D + abs(L1)
80
Exploiting SLP with SIMD Execution
Benefit:
Multiple ALU ops One SIMD op
Multiple ld/st ops One wide mem op

Cost:
Packing and unpacking
Reshuffling within a register
Alignment overhead

81
Packing/Unpacking Costs

C = A + 2 C A 2
= +
D = B + 3 D B 3

82
Packing/Unpacking Costs
Packing source operands
Copying into contiguous memory

A A
A = f() B B
B = g()
C = A + 2 C A 2
= +
D = B + 3 D B 3

83
Packing/Unpacking Costs
Packing source operands
Copying into contiguous memory
Unpacking destination operands
Copying back to location
A A
A = f() B B
B = g()
C = A + 2 C A 2
= +
D = B + 3 D B 3
E = C / 5
F = D * 7 C C
D D

84
Alignment Code Generation
Aligned memory access
The address is always a multiple of 16 bytes
Just one superword load or store instruction

float a[64];
for (i=0; i<64; i+=4)
Va = a[i:i+3];

0 16 32 48

85
Alignment Code Generation (cont.)
Misaligned memory access
The address is always a non-zero constant offset
away from the 16 byte boundaries.
Static alignment: For a misaligned load, issue
two adjacent aligned loads followed by a merge.

float a[64]; float a[64];


for (i=0; i<60; i+=4) for (i=0; i<60; i+=4)
Va = a[i+2:i+5]; V1 = a[i:i+3];
V2 = a[i+4:i+7];
Va = merge(V1, V2, 8);

0 16 32 48

86
Statically align loop
iterations

float a[64];
for (i=0; i<60; i+=4)
Va = a[i+2:i+5];

float a[64];
Sa2 = a[2]; Sa3 = a[3];
for (i=2; i<62; i+=4)
Va = a[i+2:i+5];
87
Alignment Code Generation (cont.)
Unaligned memory access
The offset from 16 byte boundaries is varying or
not enough information is available.
Dynamic alignment: The merging point is
computed during run time.
float a[64]; float a[64];
for (i=0; i<60; i++) for (i=0; i<60; i++)
Va = a[i:i+3]; V1 = a[i:i+3];
V2 = a[i+4:i+7];
align = (&a[i:i+3])%16;
Va = merge(V1, V2, align);
0 16 32 48

88
SIMD in the Presence of Control Flow

for (i=0; i<16; i++)


if (a[i] != 0)
b[i]++;

for (i=0; i<16; i+=4){


pred = a[i:i+3] != (0, 0, 0, 0);
old = b[i:i+3];
new = old + (1, 1, 1, 1);
b[i:i+3] = SELECT(old, new, pred);
}
Overhead:
Both control flow paths are always executed !
89
An Optimization:
Branch-On-Superword-Condition-Code

for (i=0; i<16; i+=4){


pred = a[i:i+3] != (0, 0, 0, 0);
branch-on-none(pred) L1;
old = b[i:i+3];
new = old + (1, 1, 1, 1);
b[i:i+3] = SELECT(old, new, pred);
L1:
}

90
Control Flow
Not likely to be supported in todays
commercial compilers
Increases complexity of compiler
Potential for slowdown
Performance is dependent on input data
Many are of the opinion that SIMD is not a
good programming model when there is
control flow.
But speedups are possible!

91
Nuts and Bolts
What does a piece of code really look like?
for (i=0; i<100; i+=4)
A[i:i+3] = B[i:i+3] + C[i:i+3]

for (i=0; i<100; i+=4) {


__m128 btmp = _mm_load_ps(float B[I]);
__m128 ctmp = _mm_load_ps(float C[I]);
__m128 atmp = _mm_add_ps(__m128 btmp, __m128 ctmp);
void_mm_store_ps(float A[I], __m128 atmp);
}

92
Wouldnt you rather use a compiler?
Intel compiler is pretty good
icc msse3 vecreport3 <file.c>
Get feedback on why loops were not
vectorized
First programming assignment
Use compiler and rewrite code examples to
improve vectorization
One example: write in low-level intrinsics

93
Next Time
Discuss Red-Blue computation, problem 10
on page 111 (not assigned, just to discuss)
More on Data Parallel Algorithms

94

You might also like