Lecture1 Introduction PDF

CS 751: Principles of Concurrent,
Parallel and Distributed Programs

RK Shyamasundar
1
Syllabus and Mechanics
• SIC 201, Tuesday, & Friday 330-500PM
• TA: R. Harikrishnan (harikrishnan@cse.iitb.ac.in)

• OUTLINE
• Mechanics:
– Attending Classes – Mandatory
– Evaluation
• Quiz: Two – 20%
• Project/Presentation: 30%
– Midterm : 25%
– Final Exam: 30%
2
What we shall study
• Sequential + Concurrency + Distribution
– Nondeterminism– Guarded Command Notation
– Shared Variable Concurrency, Interleaving semantics
– Mutual Exclusion and other Classic Synchronization Problems
– Language Abstractions: Monitors
– Distributed Programming via Communication primitives
• Communicating Sequential processes Notation (CSP)
– Understand subtleties among Concurrency, parallelism, distribution
– Distributed Mutual exclusion, Remote Procedure Calls, Time Synchronization
• Strategies for Parallel Programming
– Thread Parallelism, Run Time Scheduling, Work stealing
• Logging, Recovery, Crashes, Concurrency Control, CDN (content delivery networks)
• PGAS Parallel Languages: X10 Features
• Map-reduce
• Consensus: Paxos, BFT, SBFT, PBFT
– Blockchains, Cryptocurrencies
3
Whither Programming: Sequential
to Parallel!
4
Software Crisis
“To put it quite bluntly: as long as there were no
machines, programming was no problem at all;
when we had a few weak computers,
programming became a mild problem, and
now we have gigantic computers,
programming has become an equally gigantic
problem."
Edsgar Dijkstra, 1972, Turing Award Lecture
5
First Software Crisis: 1960s-70s
• Problem: Assembly Language Programming

– Computers could handle larger more complex programs Needed to get
Abstraction and Portability without losing Performance
• Solution: High-level languages for von-Neumann

machines
– FORTRAN and C Provided “common machine language” for uniprocessors
– Single memory image. Single flow of control
6
Second Software Crisis: 80s and ’90s
Problem:
• Inability to build and maintain complex and robust applications requiring
multi-million lines of code developed by hundreds of programmers
– Needed to get Composability, Malleability and Maintainability
– High-performance was not an issue • left to Moore’s Law
• Programming in the Large Vs Programming in the Small
– Coordination, discipline of software system management.
– software management needed to focus from software requirements to realization and
at the same time manage migration of systems to new technology or new requirements.
• Solution:
– Object Oriented Programming - C++, C# and Java
– Better tools:
• Component libraries, Purify
– Better software engineering methodology
– Design patterns, specification, testing, code reviews
• UML ….
7
Parallel Programming Models
• Data parallel model.:
– same or similar computations are performed on different data repeatedly.
Image processing algorithms that apply a filter to each pixel are a common
example of data parallelism.
– OpenMP is an API that is based on compiler directives that can express a data
parallel model.
• Task parallel model. independent works are encapsulated in functions to
be mapped to individual threads, which execute asynchronously. Thread
libraries (e.g., the Win32 thread API or POSIX* threads) are designed to
express task-level concurrency.
• Hybrid models. Sometimes, more than one model may be applied to solve
one problem, resulting in a hybrid algorithm model. A database is a good
example of hybrid models. Tasks like inserting records, sorting, or indexing
can be expressed in a task-parallel model, while a database query uses the
data-parallel model to perform the same operation on different data.
8
Methodology for Parallelization
• Basis for enhancement of performance via parallel processing basically lies
with decomposition techniques. Dividing a computation into smaller
computations and assigning these to different processors for execution are
two key steps in parallel design. Two of the most common decomposition
techniques
• Functional decomposition is used to introduce concurrency in the
problems that can be solved by different independent tasks. All these
tasks can run concurrently.
• Data decomposition works best on an application that has a large data
structure. By partitioning the data on which the computations are
performed, a task is decomposed into smaller tasks to perform
computations on each data partition. The tasks performed on the data
partitions are usually similar. There are different ways to perform data
partitioning: partitioning input/output data or partitioning intermediate
data.
– Static/Dynamic
9
How much do we gain in performance
Amdahl’s law provides the theoretical basis to assess the potential benefits of
converting a serial application to a parallel one. It predicts the speedup
limit of parallelizing an application:
Tparallel = {( 1 – P ) + P / N} * Tserial + O(N)
where Tserial: time to run an application in serial version,
P: parallel portion of the process,
N: number of processors,
O(N): parallel overhead in using N threads.
Predict the speedup by looking at the scalability:
Scalability = Tserial / Tparallel
Theoretical limit of scalability (assuming there is no parallel overhead):
Scalability = 1/ {(1– P) + P/N}
When N→∞, Scalability→1/ (1– P)
10
Challenges
• Decomposing applications to units of
computation that can be execued concurrently.
• Continuing to meet the above as the number of

processors increase – Scalability.
• May not be enough concurrent jobs to keep the CPUs busy
• Shared work queues become a bottleneck
• Need to discover intrinsic parallelism (finding finer-grained
parallelism is a challenge).
11
Parallel Programming Languages
• Parallel programming languages have not been able to address the issues
of natural specification of parallel algorithms and handling of the
architectural features to achieve performance concurrently.
• In other words, there is a large gap between programming languages that

are too low level, requiring specification of many details that obscure the
meaning of the algorithm, and languages that are too high-level, making
the performance implications of various constructs unclear.
• However, in the context of sequential computing standard languages such

as C or Java do a reasonable job of bridging such a gap.
• Main challenge is to bridge the gap between specification and realization

(implementation) that would preserve natural specification of programs
and at the same time enable realization of efficient programs on different
architectures.
12
Today: Programmers are
Oblivious about the Processors
• Solid boundary between Hardware and Software

• Programmers don’t have to know anything about the processor
• High level languages abstract away the processors
– Ex: Java bytecode is machine independent
– Moore’s law does not require the programmers to know anything
about the processors to get good speedups
• Programs are oblivious to the processor
– works on all processors
– A program written in ’70 using C still works and is much faster today
– This abstraction provides a lot of freedom for the programmers
13
Emerging Software Crisis: 2002 – 20??
• Problem: Sequential performance is left behind by
Moore’s law
– Needed continuous and reasonable performance
improvements
• To support new features to support larger datasets
• While sustaining portability, malleability and maintainability
without unduly increasing complexity faced by the programmer
– Critical to keep-up with the current rate of evolution in
software
.
Saman Amarasinghe, 6.189 Multicore Programming Primer, January (IAP) 2007. (Massachusetts Institute of Technology: MIT OpenCourseWare)
14
Programmers View of Memory
Till 1985
• 386,ARM, MIPS, SPARC
• 1-10 MHZ
• ext mem Cyc – 1 cycle
• Most Inst -- 1 Cycle
• Lang. C -ops- 1 or 2 Cycles
•On chip computation (Clk Sp) – sped up faster (till 2005) than
off-chip communication (with memory) as feature sizes shrank
• Gap filled by transistor budgets on caches - filled the mismatch till 2005
• caches, deep-pipelining with bypasses, superscalar -- burned power to keep
the illusion in tact till 2005 – Tipping point
Scrapping single CPU Pentium 15

Is this an Issue?
• If we don’t get any more performance

– Stuck at the current functionality
– Software will move from a frontier field with a lot of innovation to a
mature one with lot of tweaking and optimization
• If the complexity of programming is too high
– Only a few can be programmers
– Hard to add innovative and novel functionality
• • If the software becomes more unreliable
– Already unreliable.
– Scary consequences if even more unreliable
• If software becomes too costly to produce
– Only a few software companies will be left in the world
– Less innovations
16
Innovations in Architecture
Architecture is nothing but a way of thinking about programming
Edsgar Dijkstra
• 1943 ENIAC has accumulators operating in parallel
• 1965 Dijkstra describes critical regions problem
• 1966 Mike Flynn introduced SIMD/MIMD classification
• 1967 parallelizing FORTRAN compiler
• 1967 Amdahl describes what became the Amdahl’s law
• 1974 conditional critical regions and monitors introduced
• 1978 Tony Hoare introduces CSP

…..
17
Where shall we search for the solution?
• Advances in Computer Architecture
• Advances in Programming Languages
• Advances in Compilers
• Advances in Tools
18
Computer Architecture: Uniprocessors
• General-purpose unicores stopped historic
performance scaling
– Power consumption
– Wire delays
– DRAM access latency
– Diminishing returns of more instruction-level parallelism
19
20
Technology Trends: Microprocessor
Capacity
Moore’s Law
2X transistors/Chip Every 1.5 years

Gordon Moore (co-founder of Intel)
Called “Moore’s Law” predicted in 1965 that the transistor
density of semiconductor chips would
Microprocessors have double roughly every 18 months.
become smaller, denser, and
more powerful.
Slide source: Jack Dongarra
Microprocessor Transistors and Clock Rate
Growth in transistors per chip Increase in clock rate
100,000,000 1000
10,000,000
R10000 100
Pentium
Clock Rate (MHz)

1,000,000
Transistors
10
i80386
i80286 R3000
100,000 R2000
i8086 1
10,000
i8080
i4004
0.1
1,000
1970 1980 1990 2000
1970 1975 1980 1985 1990 1995 2000 2005
Year
Year
Why bother with parallel programming? Just wait a year or two…

Limit #1: Power density
Can soon put more transistors on a chip than can afford to turn on.
-- Patterson ‘07
Scaling clock speed (business as usual) will not work

10000 Sun’s
Surface
Rocket
1000
Nozzle
Power Density (W/cm2)
Nuclear
100
Reactor
8086 Hot Plate

10 4004 P6
8008 8085 386 Pentium®
286 486
8080 Source: Patrick
1 Gelsinger, Intel
1970 1980 1990 2000 2010
Year
24
25
26
What is Parallel Computing?
• Parallel computing: using multiple processors in
parallel to solve problems more quickly than with a
single processor
• Examples of parallel machines:
– A cluster computer that contains multiple PCs combined
together with a high speed network
– A Symmetric Multi Processor (SMP*) by connecting
multiple processors to a single memory system
– A Chip Multi-Processor (CMP) contains multiple
processors (called cores) on a single chip
• Concurrent execution comes from desire for
performance; unlike the inherent concurrency in a
multi-user distributed system
• * Technically, SMP stands for “Symmetric Multi-Processor”
Why writing (fast) parallel programs is
hard
From Kathy Yellick

Principles of Parallel Computing
• Finding enough parallelism (Amdahl’s Law)
• Granularity
• Locality
• Load balance
• Coordination and synchronization
• Performance modeling
All of these things makes parallel programming

even harder than sequential programming.
Finding Enough Parallelism
• Suppose only part of an application seems parallel
• Amdahl’s law
– let s be the fraction of work done sequentially, so
(1-s) is fraction parallelizable
– P = number of processors
Speedup(P) = Time(1)/Time(P)
<= 1/(s + (1-s)/P)
<= 1/s
• Even if the parallel part speeds up perfectly
performance is limited by the sequential part
Overhead of Parallelism
• Given enough parallel work, this is the biggest barrier to
getting desired speedup
• Parallelism overheads include:
– cost of starting a thread or process
– cost of communicating shared data
– cost of synchronizing
– extra (redundant) computation
• Each of these can be in the range of milliseconds
(=millions of flops) on some systems
• Tradeoff: Algorithm needs sufficiently large units of work
to run fast in parallel (I.e. large granularity), but not so
large that there is not enough parallel work
Locality and Parallelism
Conventional
Storage Proc Proc
Hierarchy Proc
Cache Cache Cache
L2 Cache L2 Cache L2 Cache
interconnects
potential
L3 Cache L3 Cache L3 Cache
Memory Memory Memory
• Large memories are slow, fast memories are small

• Storage hierarchies are large and fast on average
• Parallel processors, collectively, have large, fast cache
– the slow accesses to “remote” data we call “communication”
• Algorithm should do most work on local data
Load Imbalance
• Load imbalance is the time that some
processors in the system are idle due to
– insufficient parallelism (during that phase)
– unequal size tasks
• Examples of the latter
– adapting to “interesting parts of a domain”
– tree-structured computations
– fundamentally unstructured problems
• Algorithm needs to balance load
Need to Study Parallelism
• The extra power from parallel computers is
enabling in science, engineering, business, …
• Multicore chips present a new opportunity
• Deep intellectual challenges for CS – models,
programming languages, algorithms, HW, …
• Computational Thinking
34
Parallel Computation Ideas
• Concepts -- looking at problems with “parallel eyes”
• Algorithms -- different resources; different goals
• Languages -- reduce control flow; increase
independence; new abstractions
• Hardware -- the challenge is communication, not
instruction execution
• Programming -- describe the computation without
saying it sequentially
• Practical wisdom about using parallelism
• Sequential programming a special case of Parallel
programming
35
A Naïve Parallel Execution
• Juggling -- event-based computation
• House construction -- parallel tasks, wiring
and plumbing performed at once
• Assembly line manufacture – pipelining, many
instances in process at once
• Call center -- independent tasks executed
simultaneously
• CRUX: Describe execution of tasks
36
Concurrent Vs Distributed
Programming
• Concurrency: Performance/Throughput
• Distributed: Partial Knowledge
Parallel Distributed
Goal Speed Partial
Knowledge
Interactions frequent Not frequent
Granularity Fine coarse
Reliable Assumed Not assumed
37
A simple task
• Adding an array of numbers A[0],…,A[n-1]
with sum
• Classical (Sequential):
(…((sum+A[0])+A[1])+…)+A[n-1]
• How do we execute in parallel?
• (…((A[0]+A[1]) + (A[2]+A[3])) + ... + (A[n-

2]+A[n-1]))…)
38
Graphical Depiction
Same No. of operations with different orders

Researchers Dream
• Since 70s (Illiac IV days) the dream has been
to compile sequential programs into parallel
object code
– Four decades of continual, well-funded research
by smart researchers implies it’s hopeless
• For a tight loop summing numbers, its doable
• For other computations it has proved extremely
challenging to generate parallel code, even with
pragmas or other assistance from programmers
Why has it been elusive?
• It’s not likely a compiler will produce parallel code from
a C specification any time soon…
• Fact: For most computations, the “best” (practically,
not theoretically) sequential solution and a “best”
parallel solution are usually fundamentally different …
– Different solution paradigms imply computations are not
“simply” related
– Compiler transformations generally preserve the solution
paradigm
• CRUX: Programmer must discover Parallel Solution
Another example Computation
• Consider computing the prefix sums
for (i=1; i<n; i++) {
A[i]+=A[i-1]
} (* A[i] is the sum of the first i+1 elements *)
• Semantics ...
• A[0] is unchanged
• A[1] = A[1] + A[0]
• A[2] = A[2] + (A[1] + A[0])
• ...
• A[n-1] = A[n-1] + (A[n-2] + ( ... (A[1] + A[0]) … )
• How do we parallize and any adv. in doing so?
Comparison
Sequential Parallel
Computes the prefixes Computes the last

Lecture1 Introduction PDF

Uploaded by

Copyright:

Available Formats

Lecture1 Introduction PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture1 Introduction PDF

Uploaded by

Copyright:

Available Formats

CS 751: Principles of Concurrent,

Parallel and Distributed Programs

• TA: R. Harikrishnan (harikrishnan@cse.iitb.ac.in)

• Problem: Assembly Language Programming

• Solution: High-level languages for von-Neumann

• Continuing to meet the above as the number of

• In other words, there is a large gap between programming languages that

• However, in the context of sequential computing standard languages such

• Main challenge is to bridge the gap between specification and realization

• Solid boundary between Hardware and Software

Scrapping single CPU Pentium 15

• If we don’t get any more performance

• 1943 ENIAC has accumulators operating in parallel

• 1965 Dijkstra describes critical regions problem

• 1966 Mike Flynn introduced SIMD/MIMD classification

• 1967 parallelizing FORTRAN compiler

• 1967 Amdahl describes what became the Amdahl’s law

• 1974 conditional critical regions and monitors introduced

• 1978 Tony Hoare introduces CSP

• Advances in Computer Architecture

• Advances in Programming Languages

2X transistors/Chip Every 1.5 years

Clock Rate (MHz)

Why bother with parallel programming? Just wait a year or two…

Scaling clock speed (business as usual) will not work

8086 Hot Plate

From Kathy Yellick

All of these things makes parallel programming

Memory Memory Memory

• Large memories are slow, fast memories are small

• (…((A[0]+A[1]) + (A[2]+A[3])) + ... + (A[n-

Same No. of operations with different orders

Computes the prefixes Computes the last

You might also like