Lecture1 Introduction PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

CS 751: Principles of Concurrent,

Parallel and Distributed Programs


RK Shyamasundar

1
Syllabus and Mechanics
• SIC 201, Tuesday, & Friday 330-500PM

• TA: R. Harikrishnan (harikrishnan@cse.iitb.ac.in)


• OUTLINE

• Mechanics:
– Attending Classes – Mandatory
– Evaluation
• Quiz: Two – 20%
• Project/Presentation: 30%
– Midterm : 25%
– Final Exam: 30%

2
What we shall study
• Sequential + Concurrency + Distribution
– Nondeterminism– Guarded Command Notation
– Shared Variable Concurrency, Interleaving semantics
– Mutual Exclusion and other Classic Synchronization Problems
– Language Abstractions: Monitors
– Distributed Programming via Communication primitives
• Communicating Sequential processes Notation (CSP)
– Understand subtleties among Concurrency, parallelism, distribution
– Distributed Mutual exclusion, Remote Procedure Calls, Time Synchronization
• Strategies for Parallel Programming
– Thread Parallelism, Run Time Scheduling, Work stealing
• Logging, Recovery, Crashes, Concurrency Control, CDN (content delivery networks)
• PGAS Parallel Languages: X10 Features
• Map-reduce
• Consensus: Paxos, BFT, SBFT, PBFT
– Blockchains, Cryptocurrencies

3
Whither Programming: Sequential
to Parallel!

4
Software Crisis
“To put it quite bluntly: as long as there were no
machines, programming was no problem at all;
when we had a few weak computers,
programming became a mild problem, and
now we have gigantic computers,
programming has become an equally gigantic
problem."
Edsgar Dijkstra, 1972, Turing Award Lecture

5
First Software Crisis: 1960s-70s

• Problem: Assembly Language Programming


– Computers could handle larger more complex programs Needed to get
Abstraction and Portability without losing Performance

• Solution: High-level languages for von-Neumann


machines
– FORTRAN and C Provided “common machine language” for uniprocessors
– Single memory image. Single flow of control

6
Second Software Crisis: 80s and ’90s
Problem:
• Inability to build and maintain complex and robust applications requiring
multi-million lines of code developed by hundreds of programmers
– Needed to get Composability, Malleability and Maintainability
– High-performance was not an issue • left to Moore’s Law
• Programming in the Large Vs Programming in the Small
– Coordination, discipline of software system management.
– software management needed to focus from software requirements to realization and
at the same time manage migration of systems to new technology or new requirements.

• Solution:
– Object Oriented Programming - C++, C# and Java
– Better tools:
• Component libraries, Purify
– Better software engineering methodology
– Design patterns, specification, testing, code reviews
• UML ….

7
Parallel Programming Models
• Data parallel model.:
– same or similar computations are performed on different data repeatedly.
Image processing algorithms that apply a filter to each pixel are a common
example of data parallelism.
– OpenMP is an API that is based on compiler directives that can express a data
parallel model.
• Task parallel model. independent works are encapsulated in functions to
be mapped to individual threads, which execute asynchronously. Thread
libraries (e.g., the Win32 thread API or POSIX* threads) are designed to
express task-level concurrency.
• Hybrid models. Sometimes, more than one model may be applied to solve
one problem, resulting in a hybrid algorithm model. A database is a good
example of hybrid models. Tasks like inserting records, sorting, or indexing
can be expressed in a task-parallel model, while a database query uses the
data-parallel model to perform the same operation on different data.

8
Methodology for Parallelization
• Basis for enhancement of performance via parallel processing basically lies
with decomposition techniques. Dividing a computation into smaller
computations and assigning these to different processors for execution are
two key steps in parallel design. Two of the most common decomposition
techniques
• Functional decomposition is used to introduce concurrency in the
problems that can be solved by different independent tasks. All these
tasks can run concurrently.
• Data decomposition works best on an application that has a large data
structure. By partitioning the data on which the computations are
performed, a task is decomposed into smaller tasks to perform
computations on each data partition. The tasks performed on the data
partitions are usually similar. There are different ways to perform data
partitioning: partitioning input/output data or partitioning intermediate
data.
– Static/Dynamic

9
How much do we gain in performance

Amdahl’s law provides the theoretical basis to assess the potential benefits of
converting a serial application to a parallel one. It predicts the speedup
limit of parallelizing an application:
Tparallel = {( 1 – P ) + P / N} * Tserial + O(N)
where Tserial: time to run an application in serial version,
P: parallel portion of the process,
N: number of processors,
O(N): parallel overhead in using N threads.
Predict the speedup by looking at the scalability:
Scalability = Tserial / Tparallel
Theoretical limit of scalability (assuming there is no parallel overhead):
Scalability = 1/ {(1– P) + P/N}
When N→∞, Scalability→1/ (1– P)

10
Challenges
• Decomposing applications to units of
computation that can be execued concurrently.

• Continuing to meet the above as the number of


processors increase – Scalability.
• May not be enough concurrent jobs to keep the CPUs busy
• Shared work queues become a bottleneck
• Need to discover intrinsic parallelism (finding finer-grained
parallelism is a challenge).

11
Parallel Programming Languages
• Parallel programming languages have not been able to address the issues
of natural specification of parallel algorithms and handling of the
architectural features to achieve performance concurrently.

• In other words, there is a large gap between programming languages that


are too low level, requiring specification of many details that obscure the
meaning of the algorithm, and languages that are too high-level, making
the performance implications of various constructs unclear.

• However, in the context of sequential computing standard languages such


as C or Java do a reasonable job of bridging such a gap.

• Main challenge is to bridge the gap between specification and realization


(implementation) that would preserve natural specification of programs
and at the same time enable realization of efficient programs on different
architectures.

12
Today: Programmers are
Oblivious about the Processors

• Solid boundary between Hardware and Software


• Programmers don’t have to know anything about the processor
• High level languages abstract away the processors
– Ex: Java bytecode is machine independent
– Moore’s law does not require the programmers to know anything
about the processors to get good speedups
• Programs are oblivious to the processor
– works on all processors
– A program written in ’70 using C still works and is much faster today
– This abstraction provides a lot of freedom for the programmers

13
Emerging Software Crisis: 2002 – 20??
• Problem: Sequential performance is left behind by
Moore’s law
– Needed continuous and reasonable performance
improvements
• To support new features to support larger datasets
• While sustaining portability, malleability and maintainability
without unduly increasing complexity faced by the programmer
– Critical to keep-up with the current rate of evolution in
software

.
Saman Amarasinghe, 6.189 Multicore Programming Primer, January (IAP) 2007. (Massachusetts Institute of Technology: MIT OpenCourseWare)

14
Programmers View of Memory

Till 1985
• 386,ARM, MIPS, SPARC
• 1-10 MHZ
• ext mem Cyc – 1 cycle
• Most Inst -- 1 Cycle
• Lang. C -ops- 1 or 2 Cycles

•On chip computation (Clk Sp) – sped up faster (till 2005) than
off-chip communication (with memory) as feature sizes shrank
• Gap filled by transistor budgets on caches - filled the mismatch till 2005
• caches, deep-pipelining with bypasses, superscalar -- burned power to keep
the illusion in tact till 2005 – Tipping point

Scrapping single CPU Pentium 15


Is this an Issue?

• If we don’t get any more performance


– Stuck at the current functionality
– Software will move from a frontier field with a lot of innovation to a
mature one with lot of tweaking and optimization
• If the complexity of programming is too high
– Only a few can be programmers
– Hard to add innovative and novel functionality
• • If the software becomes more unreliable
– Already unreliable.
– Scary consequences if even more unreliable
• If software becomes too costly to produce
– Only a few software companies will be left in the world
– Less innovations

16
Innovations in Architecture
Architecture is nothing but a way of thinking about programming
Edsgar Dijkstra

• 1943 ENIAC has accumulators operating in parallel

• 1965 Dijkstra describes critical regions problem

• 1966 Mike Flynn introduced SIMD/MIMD classification

• 1967 parallelizing FORTRAN compiler

• 1967 Amdahl describes what became the Amdahl’s law

• 1974 conditional critical regions and monitors introduced

• 1978 Tony Hoare introduces CSP


…..

17
Where shall we search for the solution?

• Advances in Computer Architecture

• Advances in Programming Languages

• Advances in Compilers

• Advances in Tools

18
Computer Architecture: Uniprocessors
• General-purpose unicores stopped historic
performance scaling
– Power consumption
– Wire delays
– DRAM access latency
– Diminishing returns of more instruction-level parallelism

19
20
Technology Trends: Microprocessor
Capacity

Moore’s Law

2X transistors/Chip Every 1.5 years


Gordon Moore (co-founder of Intel)
Called “Moore’s Law” predicted in 1965 that the transistor
density of semiconductor chips would
Microprocessors have double roughly every 18 months.
become smaller, denser, and
more powerful.
Slide source: Jack Dongarra
Microprocessor Transistors and Clock Rate
Growth in transistors per chip Increase in clock rate

100,000,000 1000

10,000,000
R10000 100
Pentium

Clock Rate (MHz)


1,000,000
Transistors

10
i80386
i80286 R3000
100,000 R2000

i8086 1
10,000
i8080
i4004
0.1
1,000
1970 1980 1990 2000
1970 1975 1980 1985 1990 1995 2000 2005
Year
Year

Why bother with parallel programming? Just wait a year or two…


Limit #1: Power density
Can soon put more transistors on a chip than can afford to turn on.
-- Patterson ‘07

Scaling clock speed (business as usual) will not work


10000 Sun’s
Surface
Rocket
1000
Nozzle
Power Density (W/cm2)

Nuclear
100
Reactor

8086 Hot Plate


10 4004 P6
8008 8085 386 Pentium®
286 486
8080 Source: Patrick
1 Gelsinger, Intel
1970 1980 1990 2000 2010
Year
24
25
26
What is Parallel Computing?
• Parallel computing: using multiple processors in
parallel to solve problems more quickly than with a
single processor
• Examples of parallel machines:
– A cluster computer that contains multiple PCs combined
together with a high speed network
– A Symmetric Multi Processor (SMP*) by connecting
multiple processors to a single memory system
– A Chip Multi-Processor (CMP) contains multiple
processors (called cores) on a single chip
• Concurrent execution comes from desire for
performance; unlike the inherent concurrency in a
multi-user distributed system
• * Technically, SMP stands for “Symmetric Multi-Processor”
Why writing (fast) parallel programs is
hard

From Kathy Yellick


Principles of Parallel Computing
• Finding enough parallelism (Amdahl’s Law)
• Granularity
• Locality
• Load balance
• Coordination and synchronization
• Performance modeling

All of these things makes parallel programming


even harder than sequential programming.
Finding Enough Parallelism
• Suppose only part of an application seems parallel
• Amdahl’s law
– let s be the fraction of work done sequentially, so
(1-s) is fraction parallelizable
– P = number of processors
Speedup(P) = Time(1)/Time(P)
<= 1/(s + (1-s)/P)
<= 1/s
• Even if the parallel part speeds up perfectly
performance is limited by the sequential part
Overhead of Parallelism
• Given enough parallel work, this is the biggest barrier to
getting desired speedup
• Parallelism overheads include:
– cost of starting a thread or process
– cost of communicating shared data
– cost of synchronizing
– extra (redundant) computation
• Each of these can be in the range of milliseconds
(=millions of flops) on some systems
• Tradeoff: Algorithm needs sufficiently large units of work
to run fast in parallel (I.e. large granularity), but not so
large that there is not enough parallel work
Locality and Parallelism
Conventional
Storage Proc Proc
Hierarchy Proc
Cache Cache Cache
L2 Cache L2 Cache L2 Cache

interconnects
potential
L3 Cache L3 Cache L3 Cache

Memory Memory Memory

• Large memories are slow, fast memories are small


• Storage hierarchies are large and fast on average
• Parallel processors, collectively, have large, fast cache
– the slow accesses to “remote” data we call “communication”
• Algorithm should do most work on local data
Load Imbalance
• Load imbalance is the time that some
processors in the system are idle due to
– insufficient parallelism (during that phase)
– unequal size tasks
• Examples of the latter
– adapting to “interesting parts of a domain”
– tree-structured computations
– fundamentally unstructured problems
• Algorithm needs to balance load
Need to Study Parallelism
• The extra power from parallel computers is
enabling in science, engineering, business, …
• Multicore chips present a new opportunity
• Deep intellectual challenges for CS – models,
programming languages, algorithms, HW, …
• Computational Thinking

34
Parallel Computation Ideas
• Concepts -- looking at problems with “parallel eyes”
• Algorithms -- different resources; different goals
• Languages -- reduce control flow; increase
independence; new abstractions
• Hardware -- the challenge is communication, not
instruction execution
• Programming -- describe the computation without
saying it sequentially
• Practical wisdom about using parallelism
• Sequential programming a special case of Parallel
programming

35
A Naïve Parallel Execution
• Juggling -- event-based computation
• House construction -- parallel tasks, wiring
and plumbing performed at once
• Assembly line manufacture – pipelining, many
instances in process at once
• Call center -- independent tasks executed
simultaneously
• CRUX: Describe execution of tasks
36
Concurrent Vs Distributed
Programming
• Concurrency: Performance/Throughput
• Distributed: Partial Knowledge

Parallel Distributed
Goal Speed Partial
Knowledge
Interactions frequent Not frequent
Granularity Fine coarse
Reliable Assumed Not assumed

37
A simple task
• Adding an array of numbers A[0],…,A[n-1]
with sum
• Classical (Sequential):
(…((sum+A[0])+A[1])+…)+A[n-1]
• How do we execute in parallel?

• (…((A[0]+A[1]) + (A[2]+A[3])) + ... + (A[n-


2]+A[n-1]))…)
38
Graphical Depiction

Same No. of operations with different orders


Researchers Dream
• Since 70s (Illiac IV days) the dream has been
to compile sequential programs into parallel
object code
– Four decades of continual, well-funded research
by smart researchers implies it’s hopeless
• For a tight loop summing numbers, its doable
• For other computations it has proved extremely
challenging to generate parallel code, even with
pragmas or other assistance from programmers
Why has it been elusive?
• It’s not likely a compiler will produce parallel code from
a C specification any time soon…
• Fact: For most computations, the “best” (practically,
not theoretically) sequential solution and a “best”
parallel solution are usually fundamentally different …
– Different solution paradigms imply computations are not
“simply” related
– Compiler transformations generally preserve the solution
paradigm
• CRUX: Programmer must discover Parallel Solution
Another example Computation
• Consider computing the prefix sums
for (i=1; i<n; i++) {
A[i]+=A[i-1]
} (* A[i] is the sum of the first i+1 elements *)
• Semantics ...
• A[0] is unchanged
• A[1] = A[1] + A[0]
• A[2] = A[2] + (A[1] + A[0])
• ...
• A[n-1] = A[n-1] + (A[n-2] + ( ... (A[1] + A[0]) … )
• How do we parallize and any adv. in doing so?
Comparison
Sequential Parallel

Computes the prefixes Computes the last

You might also like