Multicore-Architecture-And-Programming Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

lOMoARcPSD|30789968

IV YEAR VIII SEM CS6801 Multicore Architecture AND


Programming
Multi core architectures and programming (Mcap) (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)
lOMoARcPSD|30789968

DEPARTMENT OF COMPUTER SCIENCE AND


ENGINEERING

CS6801 MULTI-CORE ARCHITECTURES AND PROGRAMMING

SEM:08
YEAR:04

BATCH 2015-2019

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Vision of Institution
To build Jeppiaar Engineering College as an Institution of Academic Excellence in
Technical education and Management education and to become a World Class
University.
Mission of Institution

M1 To excel in teaching and learning, research and innovation by promoting the


principles of scientific analysis and creative thinking

To participate in the production, development and dissemination of knowledge and


M2
interact with national and international communities

To equip students with values, ethics and life skills needed to enrich their lives and
M3
enable them to meaningfully contribute to the progress of society

M4 To prepare students for higher studies and lifelong learning, enrich them with the
practical and entrepreneurial skills necessary to excel as future professionals and
contribute to Nation’s economy

Program Outcomes (POs)


Engineering knowledge: Apply the knowledge of mathematics, science, engineering
PO1 fundamentals, and an engineering specialization to the solution of complex
engineering problems.
Problem analysis: Identify, formulate, review research literature, and analyze
PO2 complex engineering problems reaching substantiated conclusions using first
principles of mathematics, natural sciences, and engineering sciences.
Design/development of solutions: Design solutions for complex engineering
problems and design system components or processes that meet the specified needs
PO3 with appropriate consideration for the public health and safety, and the cultural,
societal, and environmental considerations
Conduct investigations of complex problems: Use research-based knowledge and
PO4 research methods including design of experiments, analysis and interpretation of data,
and synthesis of the information to provide valid conclusions.
Modern tool usage: Create, select, and apply appropriate techniques, resources, and
PO5 modern engineering and IT tools including prediction and modeling to complex
engineering activities with an understanding of the limitations.
The engineer and society: Apply reasoning informed by the contextual knowledge to
PO6 assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
Environment and sustainability: Understand the impact of the professional
PO7 engineering solutions in societal and environmental contexts, and demonstrate the
knowledge of, and need for sustainable development.

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Ethics: Apply ethical principles and commit to professional ethics and responsibilities
PO8 and norms of the engineering practice.

Individual and team work: Function effectively as an individual, and as a member or


PO9
leader in diverse teams, and in multidisciplinary settings.
Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend
PO10 and write effective reports and design documentation, make effective presentations,
and give and receive clear instructions.
Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a
PO11
member and leader in a team, to manage projects and in multidisciplinary
environments.
Life-long learning: Recognize the need for, and have the preparation and ability to
PO12 engage in independent and life-long learning in the broadest context of technological
change.

Vision of Department
To emerge as a globally prominent department, developing ethical computer
professionals, innovators and entrepreneurs with academic excellence through quality
education and research.

Mission of Department
To create computer professionals with an ability to identify and formulate
M1
the engineering problems and also to provide innovative solutions through
effective teaching learning process.

M2 To strengthen the core-competence in computer science and engineering


and to create an ability to interact effectively with industries.

M3 To produce engineers with good professional skills, ethical values and life
skills for the betterment of the society.

M4 To encourage students towards continuous and higher level learning on


technological advancements and provide a platform for employment and
self-employment.

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Program Educational Objectives (PEOs)

PEO1 To address the real time complex engineering problems using innovative approach
with strong core computing skills.

PEO2 To apply core-analytical knowledge and appropriate techniques and provide


solutions to real time challenges of national and global society

PEO3 Apply ethical knowledge for professional excellence and leadership for the
betterment of the society.

PEO4 Develop life-long learning skills needed for better employment and
entrepreneurship

Program Specific Outcomes (PSOs)


Students will be able to

An ability to understand the core concepts of computer science and engineering and to
PSO1 enrich problem solving skills to analyze, design and implement software and hardware
based systems of varying complexity.

To interpret real-time problems with analytical skills and to arrive at cost effective and
PSO2 optimal solution using advanced tools and techniques.

An understanding of social awareness and professional ethics with practical proficiency in


the broad area of programming concepts by lifelong learning to inculcate employment and
PSO3
entrepreneurship skills.

BLOOM TAXANOMY LEVELS(BTL)

BTL1: Remembering
BTL 2: Understanding.,
BTL 3: Applying.,
BTL 4: Analyzing.,
BTL 5: Evaluating.,
BTL 6: Creating.,

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

JEPPIAAR ENGINEERING COLLEGE


DEPARTMENT OF CSE
QUESTION BANK

CS6801 MULTI-CORE ARCHITECTURES AND PROGRAMMING L T P C3 0 0 3

OBJECTIVES:

The student should be made to:


Understand the challenges in parallel and multi-threaded programming.
Learn about the various parallel programming paradigms, and solutions.

UNIT I MULTI-CORE PROCESSORS 9


Single core to Multi-core architectures – SIMD and MIMD systems – Interconnection
networks -Symmetric and Distributed Shared Memory Architectures – Cache coherence -
Performance Issues –Parallel program design.

UNIT II PARALLEL PROGRAM CHALLENGES 9


Performance – Scalability – Synchronization and data sharing – Data races –
Synchronization primitives (mutexes, locks, semaphores, barriers) – deadlocks and livelocks –
communication between threads (condition variables, signals, message queues and pipes).

UNIT III SHARED MEMORY PROGRAMMING WITH OpenMP 9


OpenMP Execution Model – Memory Model – OpenMP Directives – Work-sharing
Constructs – Library functions – Handling Data and Functional Parallelism – Handling Loops -
Performance
Considerations.

UNIT IV DISTRIBUTED MEMORY PROGRAMMING WITH MPI 9


MPI program execution – MPI constructs – libraries – MPI send and receive – Point-to-
point and Collective communication – MPI derived datatypes – Performance evaluation

UNIT V PARALLEL PROGRAM DEVELOPMENT 9


Case studies - n-Body solvers – Tree Search – OpenMP and MPI implementations and
comparison.
TOTAL: 45 PERIODS
OUTCOMES:
At the end of the course, the student should be able to:
Program Parallel Processors.
Develop programs using OpenMP and MPI.
Compare and contrast programming for serial processors and programming for parallel
processors.

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

TEXT BOOKS:
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-Kauffman/Elsevier,
2011.
2. Darryl Gove, “Multicore Application Programming for Windows, Linux, and Oracle Solaris”,
Pearson, 2011 (unit 2)
REFERENCES:
1. Michael J Quinn, “Parallel programming in C with MPI and OpenMP”, Tata McGraw Hill

COURSE OUTCOME

C409.1
Illustrate the challenges in parallel and multi threaded programming
C409.2 Explain the various parallel programming paradigms and solutions.

C409.3 Develop shared memory programs using OpenMP

C409.4 Develop Distributed memory programs using MPI

C409.5 Compare and contrast programming for serial processors and parallel processors.

S.No UNIT REF.BOOK PAGE.NO


1 UNIT I 1. Peter S. Pacheco, “An Introduction to 1-8
Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011.

2 UNIT II 2. Darryl Gove, “Multicore Application 8-14


Programming for Windows, Linux, and
Oracle Solaris”,
Pearson, 2011 (unit 2)

3 UNIT III 1. Peter S. Pacheco, “An Introduction to 14-22


Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011.

4 UNIT IV 1. Peter S. Pacheco, “An Introduction to 22-28


Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011.

5 UNIT V 1. Peter S. Pacheco, “An Introduction to 28-36


Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011.

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

UNIT –I

MULTI-CORE PROCESSORS

Single core to Multi-core architectures – SIMD and MIMD systems – Interconnection networks -
Symmetric and Distributed Shared Memory Architectures – Cache coherence - Performance
Issues –Parallel program design.
Bloom’
Q. No. Questions CO
s Level

What is Single core processors? C409.1


1. BTL1
Single core processors have only one processor in die to process
instructions.
What are the Problems of Single Core Processors: C409.1 BTL1
2. As we try to increase the clock speed of this processor , the amount
of heat produced by the chip also increases. It is a big hindrance in the
way of single core processors to continue evolving
What is Multicore processor? C409.1 BTL1
A multi-core processor is a single computing component with two or
more independent actual processing units (called "cores"), which are units
3. that read and execute program instructions. The multiple cores are embedded
in the same die. The multicore processor may looks like a single processor
but actuall y it contains two (dual - core), three (tri - core), four (quad -
core), six(hexa-core), eight(octa-core)or ten (deca-core) cores.Some
processor even have 22 or 32 cores..
What are the Problems with multicore processors. C409.1 BTL1
According to Amdahl’s law , the performance of parallel computing
is limited by its serial components. So, increasing the number of cores may
4.
not be the best solution .There is need to increase the clock speed of
individual cores.

Comparison Of Single-Core processor And Multi-Core Processor. C409.1 BTL1

Parameter Single-Core Multi-Core


Processor Processor
5.
Number of cores on a
Single Multiple
die
Can execute multiple
Can execute Single instructions by using
Instruction Execution
instruction at a time multiple cores

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Speed up the
Speed up every programs which are
Gain program or software designed for multi-
being executed core processors

Dependent on the
Dependent on the frequency, number of
Performance clock frequency of cores and program to
the core be executed

Processor launched
Processor launched
before 2005 like
after 2005 like Core -
80386,486, AMD
Examples 2-Duo,Athlon 64
29000, AMD K6,
X2,I3,I5 and I7 etc
Pentium I,II,III etc.

What is meant by Single instruction, multiple data (SIMD) C409.1 BTL1


Single instruction, multiple data (SIMD), is a class of parallel computers
in Flynn's taxonomy. It describes computers with multiple processing
6
elements that perform the same operation on multiple data points
simultaneously. Thus, such machines exploit data level parallelism, but not
concurrency: there are simultaneous (parallel) computations, but only a
single process (instruction) at a given moment
What is meant by multiple instruction, multiple data (MIMD) C409.1 BTL1

In computing, MIMD (multiple instruction, multiple data) is a


technique employed to achieve parallelism. Machines using MIMD have a
number of processors that function asynchronously and independently. At
7 any time, different processors may be executing different instructions on
different pieces of data. MIMD architectures may be used in a number of
application areas such as computer-aided design/computer-aided
manufacturing, simulation, modeling, and as communication switches.
MIMD machines can be of either shared memory or distributed memory
categories.

Define Multistage interconnection networks. C409.1 BTL1

Multistage interconnection networks (MINs) are a class of high-


8 speed computer networks usually composed of processing elements (PEs)
on one end of the network and memory elements (MEs) on the other end,
connected by switching elements (SEs).

What is meant by Routing? C409.1 BTL1


9
-How does a message get from source to destination.
-Static or adaptive

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

What is Network interface? C409.1 BTL1


10
-Connects endpoints (e.g. cores) to network.
-Decouples computation/communication
.What is Centralized shared-memory multiprocessor C409.1 BTL1
It share a single centralized memory, interconnect processors and
memory by a bus
11
• also known as “uniform memory access” (UMA) or “symmetric (shared-
memory) multiprocessor” (SMP)
– A symmetric relationship to all processors.
– A uniform memory access time from any processor.
What is the concept of Caching in shared-memory machines. C409.1 BTL1
• private data: used by a single processor
– When a private item is cached, its location is migrated to the cache
12 – Since no other processor uses the data, the program behavior is
identical to that in a uniprocessor
• shared data: used by multiple processor
– When shared data are cached, the shared value may be replicated
in multiple caches.
What is Cache Coherence . C409.1 BTL1
• migration: a data item can be moved to a local cache and used
13
there in a transparent fashion
• replication for shared data that are being simultaneously read both
are critical to performance in accessing shared data.
What is meant by Snooping Solution (Snoopy Bus). C409.1 BTL1
– Send all requests for data to all processors
– Processors snoop to see if they have a copy and respond
14
accordingly
– Requires broadcast, since caching information is at processors
– Works well with bus (natural broadcast medium)
– Dominates for small scale machines (most of the market)
What is meant by Directory-Based Schemes. C409.1 BTL1
– Directory keeps track of what is being shared in a centralized place
(logically)
– Distributed memory => distributed directory for scalability (avoids
15
bottlenecks)
– Send point-to-point requests to processors via network
– Scales better than Snooping
– Actually existed BEFORE Snooping-based schemes

When a memory system is coherent ? C409.1 BTL1


A memory system is coherent if:
• P writes to X; no other processor writes to X; P reads X and
16
receives the value previously written by P
• P1 writes to X; no other processor writes to X; sufficient time
lapses; P2 reads X and receives value written by P1
• Two writes to the same location by two processors are seen in the

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

same order by all processors – write serialization


• The memory consistency model defines “time elapsed” before the
effect of a processor is seen by others.
What is meant by a distributed-memory system? C409.1 BTL1

A distributed-memory system (often called a multicomputer) consist of


multiple independent processing nodes with local memory modules which is
17 connected by a general interconnection network. Software DSM systems
can be implemented in an operating system, or as a programming library and
can be thought of as extensions of the underlying virtual memory
architecture.

What the difference between Message passing vs. DSM C409.1 BTL1

Message passing Distributed shared memory


Variables have to be marshalled Variables are shared directly
Cost of communication is
18 Cost of communication is obvious
invisible
Processes are protected by having Processes could cause error by
private address space altering data
Executing the processes may
Processes should execute at the same
happen with non-overlapping
time
lifetimes
What are the Advantages of DSM. (Apr/May 2018) C409.1 BTL1

 System scalable
 Hides the message passing
 Can handle complex and large data bases without replication or
sending the data to processes
19  DSM is usually cheaper than using multiprocessor system
 No memory access bottleneck, as no single bus
 DSM provides large virtual memory space
 DSM programs portable as they use common DSM programming
interface
 Shields programmer from sending or receive primitives
DSM can (possibly) improve performance by speeding up data access

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

What the Issues in implementing DSM software C409.1 BTL1


 Data is replicated or cached
 Reduce delays
 Semantics for concurrent access must be clearly specified
20  DSM is controlled by memory management software, operating
system, language run-time system
 Locating remote data
 Granularity: how much data is transferred in a single operation

What are the Disadvantages of DSM (Apr/May 2018) C409.1 BTL1

21  Could cause a performance penalty


 Should provide for protection against simultaneous access to shared
data such as lock
 Performance of irregular problems could be difficult
What are the Methods of achieving DSM. C409.1 BTL1
There are usually two methods of achieving distributed shared memory:
22
 hardware, such as cache coherence circuits and network interfaces;
 software.We can use this method in different ways such as
modifying the operating system kernel.
What is meant by Consistency models.. C409.1 BTL1
23
Memory system tries to behave based on certain rules in the system, which
is called system's consistency model.
Define Vector Instruction?(Apr/May2017) C409.1 BTL1

A vector processor or array processor is a central processing unit


24 (CPU) that implements an instruction set containing instructions that
operate on one-dimensional arrays of data called vectors, compared to scalar
processors, whose instructions operate on single data items.

What is meant by Snooping cache coherence? (Apr/May 2017) C409.1 BTL1

Also referred to as a bus-snooping protocol, a protocol for


25 maintaining cache coherency in symmetric multiprocessing environments.
In a snooping system, all caches on the bus monitor (or snoop) the bus to
determine if they have a copy of the block of data that is requested on the
bus.

Compare Symmetric memory architecture and distributed memory C409.1 BTL1


architecture. (Nov/Dec 2017)
26
Sno Symmetric memory distributed memory architecture.
architecture

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

1 sharedmemory cache- distributed memory refers to a


coherent multiprocessor multiprocessor computer system in
systems. The systems which each processor has its own
communicated with private memory. Computational tasks
each other and with can only operate on local data, and if
shared main memory remote data is required, the
over a shared bus. computational task must
communicate with one or more
remote processors

2 any access from any any access from any processor to


processor to main main memory would have different
memory would have latency
equal latency

What are multiprocessor systems and give their advantages? (Nov/Dec C409.1 BTL1
2017)

Multiprocessor systems also known as parallel systems or tightly


coupled systems are systems that have more than one processor in close
27
communication, sharing the computer bus, the clock and sometimes
memory & peripheral devices.
Their main advantages are
1Increased throughput
2 Economy of scale
3 Increased reliability
28 Define Channel? C409.1 BTL1
A single logical connection between routers/switches
List the pros and cons of distributed system (Apr/May 2018) C409.1 BTL1


System scalable

Hides the message passing

Can handle complex and large data bases without replication or
sending the data to processes
 DSM is usually cheaper than using multiprocessor system
 No memory access bottleneck, as no single bus
29
 DSM provides large virtual memory space
 DSM programs portable as they use common DSM programming
interface
 Shields programmer from sending or receive primitives
DSM can (possibly) improve performance by speeding up data access

 Could cause a performance penalty


 Should provide for protection against simultaneous access to shared
data such as lock

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Performance of irregular problems could be difficult

Define the symmetric shared memory (Apr/May 2018, Nov/Dec 2018) C409.1 BTL1
30 Symmetric Shared Memory Architecture consists of several processors
with a single physical memory shared by all processors through a shared
bus
List out the advantages of multicore CPU C409.1 BTL1

 The largest boost in performance will likely be noticed in improved


response time while running CPU intensive processes, like anti-virus
scans, ripping/burning media.
 Assuming that the die can fit into the package, physically, the multi-core
31
CPU designs require much less printed Circuit Board(PCB) space than
multichip SMP designs.
 Also, a dual core processor uses slightly less power than two coupled
single core processors, principally because of the decreased power required
to drive signals external to the chip

Define Vector Registers C409.1 BTL1

32 These are registers capable of storing a vector of operands and operating


simultaneously on their contents. The vector length is fixed by the system, and can
range from 4 to 128 64-bit elements. Vectorized and pipelined functional units.

Define Latency and Bandwidth C409.1 BTL1

 The latency is the time that elapses between the source’s beginning
to transmit the data and the destination’s starting to receive the first
33 byte.
 The bandwidth is the rate at which the destination receives data after
it has started to receive the first byte. So if the latency of an
interconnect is l seconds and the bandwidth is b bytes per second,
then the time it takes to transmit a message of n bytes is
 Message transmission time= l+n/b
List out the approaches in cache coherence C409.1 BTL1
34
 Snooping cache coherence
 Directory-based cache coherence.

List the steps involved in Parallel Program design C409.1 BTL1

35 1.Partitioning

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

2.Aggregating

3.Communication

4.Mapping

Define Directory based cache coherence C409.1 BTL1

It is protocols attempt to solve this problem through the use of a data


36 structure called a directory. The directory stores the status of each cache
line. Typically, this data structure is distributed; in our example, each
core/memory pair might be responsible for storing the part of the structure
that specifies the status of the cache lines in its local memory
Define Parallel Overhead C409.1 BTL1
37
Tparallel = Tserial/p + Toverhead.

Define Scalability C409.1 BTL1


38 The number of processes/threads that are used by the program. If we can
find a corresponding rate of increase in the problem size so that the program
always has efficiency E, then the program is scalable.
Define Amdhal’s Law (Nov/Dec 2018) C409.1 BTL1

39 Amdahl made an observation that’s become known as Amdahl’s law. It


says, roughly, that unless virtually all of a serial program is parallelized, the
possible speedup is going to be very limited—regardless of the number of
cores available.
Define Speedup and Efficiency C409.1 BTL1

The Serial run-time Tserial and our parallel run-time Tparallel, then
the best we can hope for is Tparallel = Tserial/p.
40
Parallel program has linear speedup. So if we define the speedup of a
parallel program to be linear speedup has S = p, which is unusual.
Furthermore, as p increases,we expect S to become a smaller and smaller
fraction of the ideal, linear speedup p.

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

How to parallelize the serial program C409.1 BTL1

 For the first step we might identify two types of tasks: finding the
41 bin to which an element of data belongs and incrementing the
appropriate entry in bin counts.
 For the second step, there must be a communication between the
computation of the appropriate bin and incrementing an element of
bin counts.
List out the different distributed memory interconnects C409.1 BTL1
42 Distributed-memory interconnects are often divided into two groups:
Direct interconnects and Indirect interconnects

Define Direct Interconnects C409.1 BTL1

In a direct interconnect each switch is directly connected to processor


43 memory pair, and the switches are connected to each other.
As before, the circles are switches, the squares are processors, and the lines
are bidirectional links.
A ring is superior to a simple bus since it allows multiple simultaneous
communications.
Define Indirect Interconnects C409.1 BTL1

 They provide an alternative to direct interconnects. In an indirect


44 interconnect the switches may not be directly connected to a
processor.
 They’re often shown with unidirectional links and a collection of
processors, each of which has an outgoing and an incoming link, and
a switching network.
Define Ideal Direct interconnect C409.1 BTL1
45
The ideal direct interconnect is a fully connected network in which each
switch is directly connected to every other switch
Define Hypercube C409.1 BTL1

 It is a highly connected direct interconnect that has been used in


46 actual systems. Hypercubes are built inductively:
 A one-dimensional hypercube is a fullyconnected system with two
processors.
 A two-dimensional hypercube is built from two one-dimensional
hypercubes by joining “corresponding” switches
Define Interleaved memory C409.1 BTL1
47
The memory system consists of multiple “banks” of memory, which can be
accessed more or less independently. After accessing one bank, there will be

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

a delay before it can be reassessed, but a different bank can be accessed


much sooner. So if the elements of a vector are distributed across multiple
banks, there can be little to no delay in loading/storing successive elements.
Define Strided memory C409.1 BTL1

48 In strided memory access, the program accesses elements of a vector located


at fixed intervals.For example, accessing the first element, the fifth element,
the ninth element, and so on, would be strided access with a stride of four

Define Graphics Processor Pipeline C409.1 BTL1

49 Real-time graphics application programming interfaces, or APIs, use


points,lines,and triangles to internally represent the surface of an object.
They use a graphics processing pipeline to convert the internal
representation into an array of pixels that can be sent to a computer screen
List out the two principal types of MIMD system C409.1 BTL1

50  Shared Memory System

 Distributed Memory system

PART B

Bloom’s
Q. No. Questions CO
Level

Explain Single core to Multi-core Architectures . C409.1


1 1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan- BTL5
Kauffman/Elsevier, 2011 Page No:15-20

Explain SIMD and MIMD systems (Apr/May2017,Nov/Dec 2017) C409.1


2 BTL5
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:29-34
Explain about Interconnection networks? (Apr/May 2017) C409.1 BTL5

3 1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-


Kauffman/Elsevier, 2011 Page No:35-44

Explain with neat diagram Symmetric Shared Memory Architectures C409.1 BTL5
4
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:35-42

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Explain with neat diagram Distributed Shared Memory Architectures C409.1 BTL5
(Nov/Dec 2018)
5
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:43-45

Explain Cache coherence in Symmetric Shared and Distributed Shared C409.1 BTL5
Memory Architectures
6
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No: 50-55

Explain the performance issues of multicore processor. .(Nov/Dec 2017) C409.1 BTL5
7 1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:58-64

Define cache coherence problem. What are the 2 main approaches to C409.1 BTL5
cache coherence? Describe working of snooping cache coherence and
8 explain describe directory based coherence. (Nov/Dec 2017)
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:43-45

Explain parallel program design (Apr/May 2017,Nov/Dec 2018) C409.1 BTL5


9
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:65-69
State and explain Amdahl’s law Outline the steps in designing and C409.1 BTL5

10
building parallel program. Give example (Apr/May 2018)

1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-


Kauffman/Elsevier, 2011
Elaborate the classification of computer architecture in parallel C409.1 BTL5

11 computing system (Apr/May 2018)

1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-


Kauffman/Elsevier, 2011
Explain Directory Based cache coherence protocol C409.1 BTL4
12
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Generalize the snooping protocol briefly C409.1 BTL6


13
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011
Summarize the Parallelizing the serial program C409.1 BTL5
14
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011
Explain the Shared memory interconnect C409.1 BTL3
15
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011
Highlight the limitations of single core processors and outline how C409.1 BTL3
multicore architecture overcome the limitations
16
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

UNIT II
PARALLEL PROGRAM CHALLENGES

Performance – Scalability – Synchronization and data sharing – Data races – Synchronization


primitives (mutexes, locks, semaphores, barriers) – deadlocks and livelocks – communication
between threads (condition variables, signals, message queues and pipes).

PART A

Bloom
Q. No. Questions CO ’s
Level

What is data race? C409.2 BTL1


1
A data race occurs when multiple threads use the same data item and one or
more of those threads are updating it.
How to avoid data races . C409.2 BTL1
2
One way to avoid data race is by utilizing proper synchronization
between threads.
Hoe to Avoid Data Races. C409.2 BTL1
Although it can be hard to identify data races, avoiding them can be
3 very simple: Make sure that only one thread can update the variable at a
time. The easiest way to do this is to place a synchronization lock around all
accesses to that variable and ensure that before referencing the variable, the
thread must acquire the lock.
What is the use of Synchronization Primitives? List out the various C409.2 BTL1
synchronization primitive in parallel programming (Nov/Dec 2018)
Synchronization is used to coordinate the activity of multiple
threads. There are various situations where it is necessary; this might be to
4 ensure that shared resources are not accessed by multiple threads
simultaneously or that all work on those resources is complete before new
work starts.
Process Synchronization
Memory Synchronization
Thread Synchronization
What is the simplest form of synchronization? C409.2 BTL1
The simplest form of synchronization is a mutually exclusive
5
(mutex) lock. Only one thread at a time can acquire a mutex lock, so they
can be placed around a data structure to ensure that the data structure is
modified by only one thread at a time.
How to Place Mutex Locks Around Accesses to Variables. C409.2 BTL1
6
int counter;
mutex_lock mutex;

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

void Increment()
{
acquire( &mutex );
counter++;
release( &mutex );}
void Decrement()
{
acquire( &mutex );
counter--;
release( &mutex );}
What is meant by contended mutex. C409.2 BTL1
7 If multiple threads are attempting to acquire the same mutex at the
same time, then only one thread will succeed, and the other threads will
have to wait. This situation is known as a contended mutex.
What is critical region. C409.2 BTL1
8 The region of code between the acquisition and release of a mutex
lock is called a critical section, or critical region. Code in this region will be
executed by only one thread at a time.
Develop the code for Placing a Mutex Lock Around a Region of Code C409.2

void * threadSafeMalloc( size_t size )


{
9 BTL6
acquire( &mallocMutex );
void * memory = malloc( size );
release( &mallocMutex );
return memory;
}
What is meant by Spin Locks. C409.2
10 BTL1
Spin locks are essentially mutex locks. The thread waiting to acquire a spin
lock will keep trying to acquire the lock without sleeping
Compare spin lock and mutex lock. (Apr/May 2018) C409.2
The difference between a mutex lock and a spin lock is that a thread
11 BTL2
waiting to acquire a spin lock will keep trying to acquire the lock without
sleeping .In comparison; a mutex lock may sleep if it is unable to acquire
the lock.
Write the advantage of spin locks. C409.2 BTL1
12 The advantage of using spin locks is that they will acquire the lock as soon
as it is released, whereas a mutex lock will need to be woken by the
operating system before it can get the lock
What is the disadvantage of spin locks. C409.2 BTL1
The disadvantage is that a spin lock will spin on a virtual CPU
13 monopolizing that resource. In comparison, a mutex lock will sleep and free
the virtual CPU for another thread to use.

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

What is Semaphores? C409.2 BTL1


Semaphores are counters that can be either incremented or
14 decremented. They can be used in situations where there is a finite limit to a
resource and a mechanism is needed to impose that limit. Semaphores will
also signal or wake up threads that are waiting on them to use available
resources
What is an example for semaphore. C409.2 BTL1
15 An example might be a buffer that has a fixed size. Every time an element
is added to a buffer, the number of available positions is decreased. Every
time an element is removed, the number available is increased
What is wait and release in semaphore. C409.2 BTL1
the method that acquires a semaphore might be called wait, down, or
16
acquire, and the method to release a semaphore might be called post,up,
signal, or release. When the semaphore no longer has resources available,
the threads requesting resources will block until resources are available.
Define readerswriter lock. C409.2 BTL1
17 A readerswriter lock (or multiple-reader lock) allows many threads
to read the shared data but can then lock the readers threads out to allow one
thread to acquire a writer lock to modify the data.
Show an example for Readers-Writer Lock. C409.2 BTL1
int readData( int cell1, int cell2 )
{
acquireReaderLock( &lock );
int result = data[cell] + data[cell2];
releaseReaderLock( &lock );
return result;
18
}
void writeData( int cell1, int cell2, int value )
{
acquireWriterLock( &lock );
data[cell1] += value;
data[cell2] -= value;
releaseWriterLock( &lock);
}
What are the use of Barriers. C409.2 BTL1
There are situations where a number of threads have to all complete their
19 work before any of the threads can start on the next task. In these situations,
it is useful to have a barrier where the threads will wait until all are present

Show One common example of using a barrier. C409.2 BTL1


One common example of using a barrier arises when there is a
20 dependence between different sections of code. For example, suppose a
number of threads compute the values stored in a matrix. The variable total
needs to be calculated using the values stored in the matrix. A barrier can be
used to ensure that all the threads complete their computation of the matrix

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

before the variable total is calculated .ZThe following example shows a


situation using a barrier to separate the calculation of a variable from its use.

Compute_values_held_in_matrix();
Barrier();
total = Calculate_value_from_matrix();
21 What is data sharing. (Apr/May 2017) C409.2 BTL1
Sharing data between multiple threads is called data sharing.
What are the difference between deadlock and livelock. (Apr/May C409.2 BTL1
2017)(Nov/Dec 2018)
The deadlock occurs where two or more threads cannot make
progress because the resources that they needare held by the other threads.
Example:Suppose two threads need to acquire mutex locks A and B to
complete some task. If thread 1 has already acquired lock A and thread 2
has already acquired lock B, then A cannot make forward progress because
it is waiting for lock B, and thread 2 cannot make progress because it is
waiting for lock A. The two threads are deadlocked.
22

What are conditions under which a deadlock situation may arise? C409.2
(Nov/Dec 2017)
Mutual Exclusion: At least one resource is held in a non-sharable
mode that is only one process at a time can use the resource. If
another process requests that resource, the requesting process must
be delayed until the resource has been released.
Hold and Wait:There must exist a process that is holding at least
one resource and is waiting to acquire additional resources that are
23 BTL1
currently being held by other processes.
No Preemption: Resouces cannot be preempted; that is, a resource
can only be released voluntarily by the process holding it, after the
process has completed its task.
Circular Wait: There must exist a set {p0, p1,.....pn} of waiting
processes such that p0 is waiting for a resource which is held by p1,
p1 is waiting for a resource which is held by p2,..., pn-1 is waiting for
a resource which is held by p n and pn is waiting for a resource which
is held by p0.

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Define thread. mention the uses of swapping. (Nov/Dec 2017) C409.2


A thread is the smallest unit of processing that can be performed in an OS.
24 In most modern operating systems, a thread exists within a process - that is, BTL1

a single process may contain multiple threads

Define deadlock. C409.2 BTL1


Deadlock is the situation where two or more threads cannot make progress
because the resources that they need are held by the other threads. It is
easiest to explain this with an example. Suppose two threads need to acquire
mutex locks A and B to complete some task. If thread 1 has already
25
acquired lock A and thread 2 has already acquired lock B, then A cannot
make forward progress because it is waiting for lock B, and thread 2 cannot
make progress because it is waiting for lock A. The two threads are
deadlocked

How to communicate multiple threads. C409.2 BTL1


The easiest way for multiple threads to communicate is through
memory. If two threads can access the same memory location, the cost of
that access is little more than the memory latency of the system. Of course,
26 memory accesses still need to be controlled to ensure that only one thread
writes to the same memory location at a time. A multithreaded application
will share memory between the threads by default, so this can be a very
low-cost approach. The only things that are not shared between threads are
variables on the stack of each thread (local variables) and thread-local
variables.
Illustrate an example which Use Multiple Barriers. C409.2
Compute_values_held_in_matrisx();
27 Barrier(); BTL2
total = Calculate_value_from_matrix();
Barrier();
Perform_next_calculation( total );
Define live lock. C409.2
28 BTL1
A livelock traps threads in an unending loop releasing and acquiring locks.
Livelocks can be caused by code to back out of deadlocks.
What is An atomic operation C409.2
29. An atomic operation is one that will either successfully complete or BTL1
fail; it is not possible for the operation to either result in a “bad” value or
allow other threads on the system to observe a transient value.
30 Write down the performance metrics (Apr/May 2018) C409.2 BTL1

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Define Message Queue C409.2


A message queue is a structure that can be shared between multiple
31 BTL1
processes. Messages can be placed into the queue and will be removed in
the same order in which they were added. Constructing a message queue
looks rather like constructing a shared memory segment.
Define Named Pipes C409.2
UNIX uses pipes to pass data from one process to another. For example, the
32 BTL1
output from the command ls, which lists all the files in a directory, could be
piped into the wc command, which counts the number of lines, words, and
characters in the input
Mention the mechanism associated with Named Pipes C409.2
Setting Up and Writing into a Pipe Make Pipe( Descriptor ); ID = Open
33 Pipe(Descriptor ); Write Pipe( ID, Message, sizeof(Message) ); ... Close BTL1
Pipe( ID ); Delete Pipe( Descriptor );
Opening an Existing Pipe to Receive Messages ID=Open Pipe( Descriptor );
Read Pipe( ID, buffer, sizeof(buffer) ); Close Pipe( ID );
How to create and Place Message Queues C409.2
Creating and Placing Messages into a Queue ID = Open Message Queue
Queue(Descriptor ); Put Message in Queue( ID, Message ); Close Message
Queue( ID );
34 Delete Message Queue( Description ); BTL1
Using the descriptor for an existing message queue enables two processes to
communicate by sending and receiving messages through the queue.
Opening a Queue and Receiving Messages ID=Open Message Queue
ID(Descriptor);
Message=Remove Message from Queue(ID); ... Close Message Queue(ID);
What is the fundamental way to share access to resources between C409.2
35 threads BTL1
 Deadlock
 Livelock
Give an example of critical regions C409.2
An operating system does not have an implementation of malloc() that is
36 BTL1
thread-safe, or safe for multiple threads to call at the same time. One way to
fix this is to place the call to malloc() in a critical section by surrounding it
with a mutex lock
List out the issues in shared caches C409.2
37 BTL1
 Capacity misses
 Conflict misses
Define False sharing C409.2
False sharing is the situation where multiple threads are accessing items of
38 data held on a single cache line. BTL1
Although the threads are all using separate items of data, the cache line
itself is shared between them so only a single thread can write to it at any
one time.
39 List out the Memory Bandwidth Measured on a System with Four C409.2 BTL1
Virtual CPUs

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Threads Time 7.437563 s Bandwidth 2.63 GB/s


Threads Time 15.238317 s Bandwidth 2.57 GB/s
Threads Time 24.580981 s Bandwidth 2.39 GB/s
Threads Time 37.457352 s Bandwidth 2.09 GB/s
What are the measures to be taken when bandwidth size reduces C409.2
The threads are interfering on the processor.
 A second interaction effect is if the threads start interfering in the
caches, such as multiple threads attempting to load data to the same
set of cache lines.
 One other effect is the behavior of memory chips when they become
40 saturated. The chips start experiencing queuing latencies where the BTL1
response time for each request increases. Memory chips are arranged
in banks.
 Accessing a particular address will lead to a request to a particular
bank of memory. Each bank needs a gap between returning two
responses. If multiple threads happen to hit the same bank, then the
response time becomes governed by the rate at which the bank can
return memory
Write a code to measure memory bandwidth using Memset C409.2
#include <stdio.h>
#include <stdlib.h>
#include <strings.h>
#include <pthread.h>
#include <sys/time.h>
#define BLOCKSIZE 1024*1025
int nthreads = 8;
char * memory;
double now( )
{
41 struct timeval time; BTL1
gettimeofday( &time, 0 );
return (double)time.tv_sec + (double)time.tv_usec / 1000000.0;
}
void *experiment( void *id )
{
unsigned int seed = 0;
int count = 20000;
for( int i=0; i<count; i++ )
{
memset( &memory[BLOCKSIZE * (int)id], 0, BLOCKSIZE );
}

How Bandwidth share between cores C409.2


42 Bandwidth is another resource shared between threads. The bandwidth BTL1
capacity of a system depends on the design of the processor and the memory
system as well as the memory chips and their location in the system

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

List out the three critical areas to address large difference scaling C409.2
 The amount of bandwidth to cache and the memory will be divided
among the active threads on the system.
43  The design of the caches will determine how much time is lost BTL1
because of capacity and conflict-induced cache misses.
 The way that the processor core pipelines are shared between active
software threads will determine how instruction issue rates change
as the number of active threads increases.
What is the role default malloc( ) C409.2
The default malloc() provides better performance than the alternative
44 implementation. The algorithm that provides improved scaling also adds a BTL1
cost to the single-threaded situation; it can be hard to produce an algorithm
that is fast for the single- threaded case and scales well with multiple
threads.
Define an idea to choose the appropriate data structures C409.2
Choosing the best structure to hold data, such as choosing an algorithm of
45 the appropriate complexity, can have a major impact on overall BTL1
performance.
Some structures will be efficient when data is accessed in one pattern, while
other structures will be more efficient if the access pattern is changed.
Define Column major order C409.2
The opposite ordering is followed, so adjacent elements of the first index
46 are adjacent in memory. This is called column-major order. Accessing BTL1
elements by a stride is a common error in codes translated from Fortran into
C. It shows how memory is addressed in C, where adjacent elements in a
row are adjacent in memory.
How to select the appropriate array access pattern C409.2
47 One common data access pattern is striding through elements of an array. BTL1
The performance of the application would be better if the array could be
arranged so that the selected elements were contiguous.
List out the techniques to reduce the latency C409.2
48  Out of order execution BTL1
 Hardware prefetching
 Software prefetching
List out the non technical reasons why functionality get placed in C409.2
libraries
 Libraries often represent a convenient product for an organizational
unit. One group of developers might be responsible for a particular
library of code, but that does not automatically imply that a single
49 library represents the best way for that code to be delivered to the BTL1
end users.
 Libraries are also used to group related functionality. For example,
an application might contain a library of string-handling functions.
Such a library might be appropriate if it contains a large body of
code. On the other hand, if it contains only a few small routines, it
might be more appropriate to combine it with another library.

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Why Algorithm complexity is important C409.2


Algorithmic complexity represents the expected performance of a section of
50 BTL1
code as the number of elements being processed increases. In the limit, the
code with the greatest algorithmic complexity will dominate the runtime of
the application

PART B
Bloom’
Q. No. Questions CO
s Level

Explain about Synchronization and data sharing in detail. C409.2 BTL5


1. 2. Darryl Gove, “Multicore Application Programming for Windows, Linux,
and Oracle Solaris”, Pearson, 2011 Pg.No:121-126

Explain Synchronization primitives mutexes and locks. C409.2 BTL5


2. 2. Darryl Gove, “Multicore Application Programming for Windows, Linux,
and Oracle Solaris”, Pearson, 2011 Pg.No:126-128

Explain Synchronization primitives in semaphores and barriers in C409.2 BTL5

3.
detail.

2. Darryl Gove, “Multicore Application Programming for Windows, Linux,


and Oracle Solaris”, Pearson, 2011 Pg.No:128-129
Explain the concepts of deadlocks and live locks C409.2 BTL5
4.
2. Darryl Gove, “Multicore Application Programming for Windows, Linux,
and Oracle Solaris”, Pearson, 2011 Pg.No:132-133
Explain communication between threads using condition variables and C409.2 BTL5
signals.
5.
2. Darryl Gove, “Multicore Application Programming for Windows, Linux,
and Oracle Solaris”, Pearson, 2011 Pg.No:133-139

Explain communication between threads using message queues and C409.2 BTL5
6. pipes.
2. Darryl Gove, “Multicore Application Programming for Windows, Linux,
and Oracle Solaris”, Pearson, 2011 Pg.No:138-139
Explain data races and scalability in parallel program. (apr/may2017) C409.2 BTL5
7.
2. Darryl Gove, “Multicore Application Programming for Windows, Linux,
and Oracle Solaris”, Pearson, 2011 Pg.No:121-126

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Explain Synchronization primitives in parallel program challenges. C409.2 BTL5

8.
(Apr/may2017)

2. Darryl Gove, “Multicore Application Programming for Windows, Linux,


and Oracle Solaris”, Pearson, 2011 Pg.No:126-130
Explain the various approaches to parallel programming. .(Nov/Dec C409.2 BTL5
2017)
9
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:43-47

What is a data race? What are the tools used for detecting data races? C409.2 BTL5

10. How to avoid races? (Nov/Dec 2017) (Apr/May 2018)

2. Darryl Gove, “Multicore Application Programming for Windows, Linux,


and Oracle Solaris”, Pearson, 2011 Pg.No:121-125
(i)Discuss in detail about producer consumer synchronization (ii)Write C409.2 BTL5
a simple semaphore to send a message (Apr/May 2018)
11
2. Darryl Gove, “Multicore Application Programming for Windows, Linux,
and Oracle Solaris”, Pearson, 2011

Write a short notes on deadlocks, livelocks and named pipes C409.2 BTL5
12 2. Darryl Gove, “Multicore Application Programming for Windows, Linux,
and Oracle Solaris”, Pearson, 2011

Discuss in detail about the importance of algorithmic complexity C409.2 BTL2


13 2. Darryl Gove, “Multicore Application Programming for Windows, Linux,
and Oracle Solaris”, Pearson, 2011

Explain the outline about necessity of structure reflects in performance C409.2 BTL4
14 2. Darryl Gove, “Multicore Application Programming for Windows, Linux,
and Oracle Solaris”, Pearson, 2011

Write in detail and summarize about hardware constraints applicable C409.2 BTL5
in improving scaling
15
2. Darryl Gove, “Multicore Application Programming for Windows, Linux,
and Oracle Solaris”, Pearson, 2011

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

UNIT III

SHARED MEMORY PROGRAMMING WITH OpenMP

OpenMP Execution Model – Memory Model – OpenMP Directives – Work-sharing Constructs –


Library functions – Handling Data and Functional Parallelism – Handling Loops – Performance
Considerations.
Bloom’
Q. No. Questions CO
s Level

What is OpenMP? BTL1


Like Pthreads, OpenMP is an API for shared-memory parallel
programming. The “MP” in OpenMP stands for “multiprocessing,” a term
1. that is synonymous with shared-memory parallel computing. Thus, OpenMP C409.3
is designed for systems in which each thread or process can potentially have
access to all available memory, and, when we’re programming with
OpenMP, we view our system as a collection of cores or CPUs, all of which
have access to main memory.
What is Pthread. C409.3 BTL1
Pthreads is lower level and provides us with the power to program
2.
virtually any conceivable thread behavior. This power, however, comes with
some associated cost—it’s up to us to specify every detail of the behavior of
each thread.
What the difference between Pthreads and OpenMP. C409.3 BTL1
Pthreads requires that the programmer explicitly specify the
behavior of each thread. OpenMP, on the other hand, sometimes allows the
programmer to simply state that a block of code should be executed in
parallel, and the precise determination of the tasks and which thread should
3. execute them is left to the compiler and the run-time system. This suggests a
further difference between OpenMP and Pthreads, that is, that Pthreads (like
MPI) is a library of functions that can be linked to a C program, so any
Pthreads program can be used with any C compiler, provided the system has
a Pthreads library. OpenMP, on the other hand, requires compiler support
for some operations, and hence it’s entirely possible that you may run across
a C compiler that can’t compile OpenMP programs into parallel programs.
What is the Execution Model of OpenMp. C409.3 BTL1
The OpenMP API uses the fork-join model of parallel execution.
4.
Multiple threads of execution perform tasks defined implicitly or explicitly
by OpenMP directives. The OpenMP API is intended to support programs
that will execute correctly both as parallel programs (multiple threads of

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

execution and a full OpenMP support library) and as


sequential programs (directives ignored and a simple OpenMP stubs
library).
What is an initial thread in OpenMP? C409.3 BTL1
5.
An OpenMP program begins as a single thread of execution, called
an initial thread.
How to compile and running OpenMP programs
To compile this with gcc we need to include the -fopenmp option
6. $ gcc –g –Wall –fopenmp –o omp_hello omp_hello.c
To run the program, we specify the number of threads on the command line.
For example, we might run the program with four threads and type
$ ./omp_hello 4
What is termed as initial task region. .(Nov/Dec 2017) C409.3 BTL1
The initial task region, that is defined by an implicit inactive parallel
7 region surrounding the whole program.When any thread encounters a
parallel construct, the thread creates a team of itself and zero or
more additional threads and becomes the master of the new team. A set
of implicit tasks, one per thread, is generated
Define odd even transportation sort? . (Apr/May 2017) C409.3 BTL1
• Parallelizable version of Bubble sort

• Requires N passes through the array.


• Each pass through the array analyzes either:
– Every pair of odd indexed elements and the preceding
element, or
– Every pair of even indexed elements and the preceding
element.
• Within each pass, elements that are not in order are swapped.

Develop a”hello word” program inthat uses open MP. . (Apr/May 2017) C409.3 BTL1
9.
Hello world program in C using MPI:
#include <stdio.h>

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

#include <mpi.h>

main(int argc, char **argv)


{
int ierr;

ierr = MPI_Init(&argc, &argv);


printf("Hello world\n");

ierr = MPI_Finalize();
}
Define Message Queue.(Nov/Dec 2017) C409.3 BTL1
Message queues provide an asynchronous communications protocol,
10
meaning that the sender and receiver of the message do not need to interact
with the message queue at the same time. Messages placed onto the queue
are stored until the recipient retrieves them.
What is an initial task region? C409.3 BTL1
11. An initial thread executes sequentially, as if enclosed in an implicit
task region, called an initial task region, that is defined by the implicit
parallel region surrounding the whole program.
Discuss the Structure of the OpenMP Memory Model C409.3 BTL1
The OpenMP API provides a relaxed-consistency, shared-memory
12
model. All OpenMP threads have access to a place to store and to retrieve
variables, called the memory. In addition, each thread is allowed to have its
own temporary view of the memory.
What is threadprivate memory? C409.3 BTL1
13
Each thread also has access to another type of memory that must not
be accessed by other threads, called threadprivate memory.
Show the format of directive in OpenMP. C409.3
#pragma omp directive-name [clause[ [,] clause]...] new-line
14 Each directive starts with #pragma omp. The remainder of the BTL2
directive follows the conventions of the C and C++ standards for compiler
directives. In particular, white space can be used before and after the #, and
sometimes white space must be used to separate the words in a directive.
What is meant by Stand-alone directives? C409.3 BTL1
Stand-alone directives do not have any associated executable user code.
Instead, they represent executable statements that typically do not have
15 succinct equivalent statements in the base languages. There are some
restrictions on the placement of a stand-alone directive within a program. A
stand-alone directive may be placed only at a point where a base language
executable statement is allowed

What is the use of parallel construct? C409.3 BTL1


This fundamental construct starts parallel execution.
16
#pragma omp parallel [clause[ [, ]clause] ...] new-line
structured-block
where clause is one of the following:

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

#pragma omp parallel [clause[ [, ]clause] ...] new-line


structured-block
if(scalar-expression)
num_threads(integer-expression)
default(shared | none)
private(list)
firstprivate(list)
shared(list)
copyin(list)
reduction(redution-identifier :list)
proc_bind(master | close | spread)
What is Worksharing Constructs? C409.3
A worksharing construct distributes the execution of the associated
17 region among the members of the team that encounters it. Threads execute BTL1
portions of the region in the context of the implicit tasks each one is
executing. If the team consists of only one thread then the worksharing
region is not executed in parallel.
List some worksharing constructs. C409.3
The OpenMP API defines the following worksharing constructs.
18 • loop construct BTL4
• sections construct
• single construct
• workshare construct
List the the Restrictions apply to worksharing constructs. C409.3
The following restrictions apply to worksharing constructs:
• Each worksharing region must be encountered by all threads in a team or
19 BTL4
by none at all, unless cancellation has been requested for the innermost
enclosing parallel region.
• The sequence of worksharing regions and barrier regions encountered
must be the same for every thread in a team.
Show the syntax of the loop construct. C409.3 BTL1

The syntax of the loop construct is as follows:


#pragma omp for [clause[[,] clause] ... ] new-line
for-loops
where clause is one of the following:
20 private(list)
firstprivate(list)
lastprivate(list)
reduction(reduction-identifier: list)
schedule(kind[, chunk_size])
collapse(n)
ordered
nowait
21 What is meant by binding? C409.3 BTL1
The binding thread set for a loop region is the current team. A loop region

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

binds to the innermost enclosing parallel region. Only the threads of the
team executing the binding parallel region participate in the execution of
the loop iterations and the implied barrier of the loop region if the barrier is
not eliminated by a nowait clause

What is the use of collapse clause. C409.3


The collapse clause may be used to specify how many loops are
22 associated with the loop construct. The parameter of the collapse clause BTL1
must be a constant positive integer expression. If no collapse clause is
present, the only loop that is associated with the loop construct is the one
that immediately follows the loop directive.
List the Restrictions to the loop construct. C409.3
Restrictions to the loop construct are as follows:
• All loops associated with the loop construct must be perfectly
nested; that is, there must be no intervening code nor any OpenMP
directive between any two loops.
• The values of the loop control expressions of the loops associated
with the loop construct must be the same for all the threads in the
team.
• Only one schedule clause can appear on a loop directive.
• Only one collapse clause can appear on a loop directive.
• chunk_size must be a loop invariant integer expression with a
23 positive value. BTL4
• The value of the chunk_size expression must be the same for all
threads in the team.
• The value of the run-sched-var ICV must be the same for all
threads in the team.
• When schedule(runtime) or schedule(auto) is specified,
chunk_size must not be specified.
• Only one ordered clause can appear on a loop directive.
• The ordered clause must be present on the loop construct if any
ordered region ever binds to a loop region arising from the loop
construct.
• The loop iteration variable may not appear in a threadprivate directive

How to Determine the Schedule of a Worksharing Loop? C409.3 BTL1


When execution encounters a loop directive, the schedule clause (if
any) on the directive, and the run-sched-var and def-sched-var ICVs are
used to determine how loop iterations are assigned to threads.. If the loop
24
directive does not have a schedule clause then the current value of the def-
sched-var ICV determines the schedule. If the loop directive has a schedule
clause that specifies the runtime schedule kind then the current value of the
run-sched-var ICV determines the schedule. Otherwise, the value of the
schedule clause determines the schedule.
25 What is The sections construct. C409.3 BTL1
The sections construct is a non-iterative work sharing construct that

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

contains a set of structured blocks that are to be distributed among and


executed by the threads in a team. Each structured block is executed once by
one of the threads in the team in the context of its implicit task.
Show the the syntax of the sections construct. C409.3
The syntax of the sections construct is as follows:

#pragma omp sections [clause[[,] clause] ...] new-line


{[
#pragma omp section new-line]
structured-block
[#pragma omp section new-line
26 BTL1
structured-block ]
...
}
where clause is one of the following
private(list)
firstprivate(list)
lastprivate(list)
reduction(reduction-identifier:list)
nowait
List the Restrictions to the sections construct . C409.3
• Orphaned section directives are prohibited. That is, the section
directives must appear within the sections construct and must not be
encountered elsewhere in thesections region.
• The code enclosed in a sections construct must be a structured
block.
• Only a single nowait clause can appear on a sections directive.
27 BTL4
firstprivate(list)
lastprivate(list)
reduction(reduction-identifier:list)
• A throw executed inside a sections region must cause execution to
resume within
the same section of the sections region, and the same thread that
threw the
exception must catch it.
Show The syntax of the single construct . C409.3 BTL1

!$omp single [clause[[,] clause] ...]


structured-block
!$omp end single [end_clause[[,] end_clause] ...]
28
where clause is one of the following:
private(list)
firstprivate(list)

and end_clause is one of the following:

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

copyprivate(list)
nowait
Show the syntax of the workshare construct. C409.3 BTL1
!$omp workshare
structured-block
!$omp end workshare [nowait]
The enclosed structured block must consist of only the following:
29 • array assignments
• scalar assignments
• FORALL statements
• FORALL constructs
• WHERE statements
• WHERE constructs
• atomic constructs
List the restrictions apply to the workshare construct: C409.3
• All array assignments, scalar assignments, and masked array
30 BTL4
assignments must be intrinsic assignments.
• The construct must not contain any user defined function calls
unless the function is ELEMENTAL
What is the use of schedule clause. C409.3
If more than one loop is associated with the loop construct, then the
iterations of all associated loops are collapsed into one larger iteration space
31 BTL1
that is then divided according to the schedule clause. The sequential
execution of the iterations in all associated loops determines the order of the
iterations in the collapsed iteration space

What is the use of single construct. C409.3


The single construct specifies that the associated structured block is
executed by only one of the threads in the team (not necessarily the master
32 BTL1
thread), in the context of its implicit task. The other threads in the team,
which do not execute the block, wait at an implicit barrier at the end of the
single construct unless a nowait clause is specified

List the Restrictions to the single construct are as follows: C409.3


33 BTL4
• The copyprivate clause must not be used with the nowait clause.
• At most one nowait clause can appear on a single construct.
What is workshare construct? C409.3
The workshare construct divides the execution of the enclosed
34 BTL1
structured block into separate units of work, and causes the threads of the
team to share the work such that each unit is executed only once by one
thread, in the context of its implicit task.
35 Explain Scope of a variable (Apr/May 2018) C409.3 BTL1

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

36 Define Race Condition (Apr/May 2018) C409.3 BTL1

State the trapezoidal rule in OpenMP (Nov/Dec 2018) C409.3


for (i = 1; i <= n-1; i++)
{
37 x_i = a + i*dx; BTL1
approx += f(x_i);
}
approx = dx*approx;

Define Coherence and consistency C409.3


 Coherence refers to the behavior of the memory system when a
38 single memory location is accessed by multiple threads. BTL1
 Consistency refers to the ordering of accesses to different memory
locations, observable from various threads in the system
Define data race C409.3
 A data race is defined to be accesses to a single variable by at least
39 two threads, at least one of which is a write, not separated by a BTL1
synchronization operation.
 OpenMP does guarantee certain consistency behavior, however.
That behavior is based on the OpenMP flush operation.
Define OpenMP flush operation
The OpenMP flush operation is applied to a set of variables
called the flush set. Memory operations for variables in the flush
40 set that precede the flush in program execution order must be
firmly lodged in memory and available to all threads before the
flush completes, and memory operations for variables in the flush
set, that follow a flush in program order cannot start until the
flush completes.
Mention the actions to be taken to move the value from one thread to C409.3
another thread
41  The first thread writes the value to the shared variable, BTL1
 The first thread flushes the variable
 The second thread flushes the variable and
 The second thread reads the variable
List out the two methods available for enabling nested parallel regions C409.3
42 BTL1
1.The omp_set_nested() library routine
2. Setting of the OMP_NESTED environment variable to TRUE
What is data dependences C409.3
1. OpenMP compilers don’t check for dependences among iterations in a
43 BTL1
loop that’s being parallelized with a parallel for directive. It’s up to us, the
programmers, to identify these dependences.
2. A loop in which the results of one or more iterations depend on other

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

iterations cannot, in general, be correctly parallelized by OpenMP .


Define Atomic Directive C409.3
The atomic directive ensures that a specific memory location is updated
atomically, rather than exposing it to the possibility of multiple,
44 BTL1
simultaneous writing threads.
The atomic directive supports no OpenMP clauses.
Syntax
#pragma omp atomic expression
Define Critical Directive C409.3
Specifies that code is only executed on one thread at a time. OpenMP does
45 BTL1
provide the option of adding a name to a critical directive:
Syntax
# pragma omp critical(name)
List out the two types of locks in OpenMP to destroy lock data C409.3
46 structures BTL1
 Simple locks
 Nested Locks
Define Barrier Directive and Master Directive C409.3
A barrier directive will cause the threads in a team to block until all the
threads have reached the directive.
Syntax
47 BTL1
#pragma omp barrier
Specifies that only the master thread should execute a section of the
program.
Syntax
# pragma omp master
Which are the OpenMP clauses supported by Single directive C409.3
 copyprivate
BTL1
48
 firstprivate
 nowait
 private
Define Synchronization Clauses C409.3
 Critical
 Atomic
49 BTL1
 Ordered
 Barrier
 Nowait

List out the three types of Scheduling? C409.3


50  Static BTL1
 Dynamic
 Guided
51 Define Data parallelism C409.3 BTL1
Data parallelism is a form of parallelization across multiple

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

processors in parallel computing environments. It focuses on


distributing the data across different nodes, which operate on the
data in parallel. It can be applied on regular data structures like
arrays and matrices by working on each element in parallel.
List out the four discrete steps to parallelization C409.3

52 BTL1

What are Loop Carried Dependencies C409.3


53 Loop-Carried Dependence Verification in OpenMP. Data dependence BTL1
analysis is a very difficult task, mainly due to the limitations imposed by
pointer aliasing, and by the overhead of dynamic data dependence analysis.

. ` PART B
Bloom’s
Q. No. Questions CO
Level

Explain OpenMP Execution Model in detail with example.(Nov/Dec C409.3


2017) (Apr/May 2018)
1. BTL5
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:210-215

Explain the Memory Model of OpenMP. (Apr/May 2018) C409.3 BTL5


2.
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:213-215
Explain the OpenMP Directives . (Apr/may2017) C409.3 BTL5

3. 1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-


Kauffman/Elsevier, 2011 Page No:224-231

Explain the Library functions used in OpenMP. C409.3 BTL5


4.
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Explain in detail how to Handle Loops in OpenMP (Nov/Dec 2017) C409.3 BTL5

5. 1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-


Kauffman/Elsevier, 2011 Page No: 236-240

Explain OpenMP directives. . (Apr/may 2017) C409.3 BTL5

6. 1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-


Kauffman/Elsevier, 2011 Page No:224-231

How data and functional parallelism are handled in shared memory C409.3 BTL5
programming with open MP? (Apr/may2017)
7.
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:212-215

Explain in detail about the handling loops in parallel operations C409.3 BTL5
(Nov/Dec 2017)
8.
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:236-240

Write an example program for shared memory programming with C409.3 BTL5
Pthread (Apr/May 2018)
9
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:236-240

Explain in detail about the synchronization primitives in parallel C409.3 BTL5


program challenges (Apr/May 2017)
10
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:236-240

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Explain the type shared memory model BTL5

11 1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-


Kauffman/Elsevier, 2011

Collect the all information about internal control variable BTL1

12 1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-


Kauffman/Elsevier, 2011

Explain briefly about General Data Parallelism BTL4

13 1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-


Kauffman/Elsevier, 2011

(i)Explain the NoWait Clause BTL4

(ii)Explain the single pragma


14
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011

(i)Illustrate the runtime library definitions BTL3

(ii)llustrate the execution environment routines


15
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

UNIT IV

DISTRIBUTED MEMORY PROGRAMMING WITH MPI


MPI program execution – MPI constructs – libraries – MPI send and receive – Point-to-point and
Collective communication – MPI derived datatypes – Performance evaluation.

PART A
Bloom
Q. No. Questions CO ’s
Level

What is MPI?
One process calls a send function and the other calls a receive function. The
implementation of message-passing that will be using is called MPI
1. (Message-Passing Interface). MPI is not a new programming language. It C409.4 BTL1

defines a library of functions that can be called from C, C++, and


FORTRAN Programs

How to execute MPI programs? BTL1


2. Many systems use the command called mpicc to compile and run MPI C409.4
programs:
mpicc -g -Wall -o mpi hello mpi hello.c
What is the use of Wrapper Script. (Apr/May 2017) BTL1
A wrapper script is a script whose main purpose is to run some
3. C409.4
program. In this case, the program is the C compiler.The wrapper
simplifies the running of the compiler by telling it where to find the
necessary header files and which libraries to link with the object file.
Show the syntax of MPI_init and MPI_finalize BTL1

The syntax of MPI_ Init ( ):


4. int MPI _Init( C409.4
int* argc p /* in/out */,
char*** argv p /* in/out */);
The syntax of MPI_ Finalize( ):
int MPI_ Finalize(void);
What is Communicator in MPI? BTL1
In MPI a communicator is a collection of processes that can send
5. messages to each other. One of the purposes of MPI Init is to define a C409.4
communicator that consists of all of the processes started by the user
when she started the program. This communicator is called MPI_
COMM_WORLD.

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

What is a SPMD program?


A single program is written so that different processes carry out different

6.
actions, and this is achieved by simply having the processes branch on the C409.4 BTL1
basis of their process rank. This approach to parallel programming is called
single program, multiple data, or SPMD

Give the syntax of MPI_Get_count BTL1


The syntax of MPI_Get_count is
7. int MPI_Get_count( C409.4
MPI_Status* status_p /* in */,
MPI_Datatype type /* in */,
int* count p /*out */);
What is Non-overtaking? BTL1
If process q sends two messages to process r, then the first
message sent by q must be available to r before the second message.
8. There is no restriction on the arrival of messages sent from different C409.4
processes. ie., if q and t both send messages to r, then even if q sends its
message before t sends its message, there is no requirement that q’s
message become available to r before t’s message. This is essentially
because MPI can’t impose performance on a network.
Evaluate the performance evaluation methods in distributed memory
programming. .(Nov/Dec 2017,
1.Timing performance
9 C409.4 BTL5
2.performance Results
3.Speedup performance
4.efficiency performance
5.Scalability performance
What is wildcard argument? BTL1
The wildcard arguments:
o Only a receiver can use a wildcard argument. Senders must specify a
process rank and a nonnegative tag. Thus, MPI uses a “push” C409.4
10
communication mechanism rather than a “pull” mechanism.
There is no wildcard for communicator arguments; both senders and
receivers must always specify communicators

11 Show the area of the trapezoid. C409.4 BTL1


Area of one trapezoid = h/2[f (xi) + (f (xi+1)]
Define collective communications in MPI. BTL1
12 C409.4
In MPI parlance, communication functions that involve all the
processes in a communicator are called collective communications.
BTL1
13 What is Local Variables? Give Examples. C409.4
Local variables are variables whose contents are significant only on
the process that’s using them. Some examples are: local a, local b and

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

local n.
BTL1
14 What is Global Variables? Give Examples. C409.4
Variables whose contents are significant to all the processes are
sometimes called global variables. Some examples are: a, b and n.
What is point to point communications in MPI? BTL1
15 C409.4
MPI_Send and MPI_Recv are often called point-to-point
communications.
List any two points stating that how collective communication differ
from point to point communication.
16  All the processes in the communicator must call the same collective C409.4 BTL4
function. Example, a program that attempts to match a call to MPI_Reduce
on one process with a call to MPI_ Recv on another process is erroneous,
and, in all likelihood, the program will hang or crash.
What is a butterfly-structured global sum? BTL1
17 The processes exchange partial results instead of using one-way C409.4
communications. Such a communication pattern is sometimes called a
butterfly-structured global sum.
What is broadcast? BTL1
18 C409.4
A collective communication in which data belonging to a single process
is sent to all of the processes in the communicator is called a broadcast
Show a broadcast function. BTL1
A broadcast function:
int MPI_Bcast (
19 void* data p /* in/out */, C409.4
int count /* in */,
MPI_Datatype datatype /* in */,
int source proc /* in */,
MPI_Comm comm /* in */);
Define block partition of the vector. BTL1
The work consists of adding the individual components of the vectors,
so we might specify that the tasks are just the additions of corresponding
components. Then there is no communication between the tasks, and the
problem of parallelizing vector addition boils down to aggregating the
20 C409.4
tasks and assigning them to the cores. If the number of components is n
and we have comm_sz cores or processes, let’s assume that n evenly
divides comm_sz and define local n D n=comm sz. Then we can simply
assign blocks of local n consecutive
components to each process. This is often called a block partition of
the vector.
Show MPI Scatter function. BTL1
The communication MPI provides a function:
21 int MPI Scatter( C409.4
void* send_buf_p /* in */,
int send_count /* in */,
MPI_Datatype send_type /* in */,

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

void* recv_buf_p /* out */,


int recv_count /* in */,
MPI_Datatype recv_type /* in */,
int src_proc /* in */,
MPI_Comm comm /* in */);
Give MPI Gather function. BTL1
The communication in this function can be carried out by MPI Gather,
int MPI_Gather(
void* send_buf_p /* in */,
int send_count /* in */,
22 C409.4
MPI_Datatype send_type /* in */,
void* recv_buf_p /* out */,
int recv_count /* in */,
MPI_Datatype recv_type /* in */,
int dest_proc /* in */,
MPI_Comm comm /* in */);
What is Derived data type in MPI? BTL1
23 In MPI, a derived data type can be used to represent any collection of C409.4
data items in memory by storing both the types of the items and their
relative locations in memory.
Show the use MPI_Type_create_struct. BTL1
We can use MPI Type create struct to build a derived datatype that
consists of individual elements that have different basic types:

int MPI_Type_create_struct(
int count /*
in */,
24 C409.4
int array_of_blocklengths[] /*
in */,
MPI_Aint array_of_displacements[] /*
in */,
MPI_Datatype array_of_types[] /*
in */,
MPI_Datatype* new_type_p /*
out */);
Show the ratio of the serial and parallel runtime. BTL1
The most widely used measure of the relation between the serial and the
parallel run-times is the speedup. It’s just the ratio of the serial run-time
25 to the parallel run-time: C409.4
Tserial (n)
S (n, p) = -----------------
Tparallel (n, p)
Show “per process” speedup? BTL1
26 The “Per Process speedup is C409.4

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

S (n, p) Tserial (n)


E (n, p) = ------------- = -------------------------
p p * T parallel (n, p)
what is a wrapper script? . (apr/may2017) BTL1
27 A wrapper script is a script whose main purpose is to run C409.4
wrapper the running of the compiler by telling it where to find the necess
which libraries to link with the object file .
How to design a parallel program? BTL1
A parallel program can be designed using four basic steps:
28 1. Partition the problem solution into tasks. C409.4
2. Identify the communication channels between the tasks.
3. Aggregate the tasks into composite tasks.
4. Map the composite tasks to cores.
What is linear speedup? BTL1
29 The ideal value for S (n,p) is p. If S (n,p) = p, then our parallel program C409.4
with comm_ sz = p processes is running p times faster than the serial
program. This speedup, sometimes called linear speedup.
What is block-cyclic partition? BTL1
Instead of using a cyclic distribution of individual components, use a cyclic
30 distribution of blocks of components, so a block-cyclic distribution isn’t C409.4

fully specified until we decide how large the blocks are

What are the possibilities for choosing a destination when sending BTL1
requests for work with MPI
31 MPI is designed to allow users to create programs that can run efficiently on most C409.4
parallel architectures. The design process included vendors (such as IBM, Intel,
TMC, Cray, Convex, etc.), parallel library authors (involved in the development of
PVM, Linda, etc.), and applications specialists
List the restrictions work sharing constructs BTL1
32 C409.4
The sequence of work-sharing constructs and barrier directives encountered must
be the same for every thread in a team
33 Write the evaluation methods is distributed memory programming C409.4 BTL1

34 Give the commands for MPI C409.4 BTL1

35 Define and broadcast and butterfly MPI C409.4 BTL1

What is MPI W_Time BTL1


36 MPI provides a function, MPI_Wtime, that returns the number of seconds C409.4
that have elapsed since some time in the past:
Double MPI_Wtime(void);
Define MPI_Barrier BTL1
37 The MPI collective communication function MPI_Barrier insures that no C409.4
process will return from calling it until every process in the communicator
has started calling it.

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Define Speed-Up and Efficiency BTL1


The most widely used measure of the relation between the serial and the
38 parallel run-times is the speedup. It‘s just the ratio of the serial run-time to C409.4
the parallel run-time:
S(n,p)=TSerial(n)/TSerial(n,p)
How the Toverhead is represented BTL1
The parallel run-time is denoted byTparallel. Since it depends on both the
input size, n, and the number of processes, comm._sz= p, we‘ll frequently
39 denote it as Tparallel(n,p). The parallel program will divide the work of the C409.4
serial program among the processes, and add in some overhead time, which
we denoted Toverhead:
Tparallel (n,p)=Tserial (n/p)+Toverhead
Define Speed up BTL1
The most widely used measure of the relation between the serial and the
40 parallel run-times is the speedup. It‘s just the ratio of the serial run-time to C409.4
the parallel run-time:
S(n,p)=Tserial(n)/Tparallel(n,p)
Define Efficiency BTL1
This speedup, sometimes called linear speedup. Another widely used
41 C409.4
measure of parallel performance is parallel efficiency.
E(n,p)= S(n,p)/p= Tserial(n)/P* Tparallel(n,p)
What is MPI derived data types BTL1
42 In MPI, a derived datatype can be used to represent any collection of data C409.4
items in memory by storing both the types of the items and their relative
locations in memory.
Define Gather BTL1
Gathers distinct messages from each task in the group to a single
destination task. This routine is the reverse operation of MPI_Scatter. The
43 C409.4
data stored in the memory referred to by send_buf_p on process 0 is stored
in the first block in recv_buf p, the data stored in the memory referred to by
send buf_p on process 1 is stored in the second block referred to by
recv_buf_p, and so on.
Define Scatter BTL1
44 Distributes distinct messages from a single source task to each task in the C409.4
group.

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

List out the types of collective operations BTL1


Synchronization - processes wait until all members of the group have
reached the synchronization point.
45 C409.4
• Data Movement - broadcast, scatter/gather, all to all
Collective Computation (reductions) - one member of the group collects
data from the other members and performs an operation (min, max, add,
multiply, etc.) on that data.
Define MPI_Allreduce BTL1
Consider a situation in which all of the processes need the result of a global
sum in order to complete some larger computation.
For example, if we use a tree to compute a global sum, we might ―reverse‖
the branches to distribute the global sum.

46 C409.4

List out the two possibilities when the message are assembled BTL1
47 C409.4
 the sending process can buffer the message or
 it can block
What are the types of MPI type BTL1
The MPI type MPI_Status is a struct with at least the three members MPI_
SOURCE, MPI_TAG, and MPI_ERROR.
48 Suppose our program contains the definition MPI_Status status; Then, after C409.4
a call to MPI Recv in which &status is passed as the last argument, we can
determine the sender and tag by examining the two members
status.MPI SOURCE
status.MPI TAG
49 What is status_p argument C409.4 BTL1
 The Amount of Data In The Message

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

 The Sender of The Message, Or


 The Tag of The Message.
Mention the syntax for MPI_send BTL1

50 C409.4

Write a note on Distributed memory machines (Nov/Dec 2018) BTL1


Programming on a distributed memory machine is a matter of organizing a
program as a set of independent tasks that communicate with each other via
51 messages. In addition, programmers must be aware of where data is stored, which C409.4
introduces the concept of locality in parallel algorithm design. An algorithm that
allows data to be partitioned into discrete units and then runs with minimal
communication between units will be more efficient than an algorithm that requires
random access to global structures.
How to compile an MPI Program (Nov/Dec 2018) BTL1
52 C409.4
Many systems use a command called mp icc for compilation mp icc is a
script that‘s a wrapper for the C compiler.

PART B

Bloom
Q. No. Questions CO ’s
Level

What is MPI? Write a program”hello,world” that makes some use of BTL1


MPI. how to compile and execute MPI programs. .(Nov/Dec 2017).
1. C409.4
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:86-90

Explain about Trapezoidal rule in MPI in detail BTL1

2. 1.Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan- C409.4


Kauffman/Elsevier, 2011 Page No:94-96

Give in detail about Collective Communication. BTL1


3. C409.4
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:101-115

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Explain the following MPI functions: BTL1


 MPI_Reduce
 MPI_Allreduce
 MPI_Scatter
4. C409.4
 MPI_Gather
 MPI_Allgather
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:88-91

Compare and contrast Collective communication Vs Point to point
communication.(16) (Nov/Dec 2017).
5. C409.4 BTL2
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:105-106

Disscus about MPI Derived data types with example programs. (10)

6. 1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan- C409.4 BTL6


Kauffman/Elsevier, 2011 Page No:116-118

i.Explain about Performance Evaluation of MPI Programs in BTL5


detail.(Apr/May 2017)
7. ii.What are the performance issues in multicore processor? C409.4
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:119-126

i.Explain tree structured communication BTL5


ii.What are the differences between point to point and collective
8. communication? (Apr/May 2017) C409.4
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:102-103

(i)Explain Loop Handling in detail BTL5


(ii)Describe about MPI programs execution with example (Apr/May
9 C409.4
2018)
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011
Explain the virtual memory in detail (Apr/May 2018) BTL5
10 C409.4
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011
(i)Describe the Attribute Caching BTL2
11 (ii)Discuss about the communicators C409.4
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

(i)Describe the Distributed array datatype constructor BTL5


12 (ii)Explain the Cartesian constructor C409.4
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011
(i) Generalize the group of process BTL6
13 (ii) Explain the virtual topology C409.4
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011
(i)Describe the MPI program execution BTL1
14 (ii)Describe about MPI Init and Finalize C409.4
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011
(i)Describe about the Datatype constructor BTL5
15 (ii)Discuss about the subarray datatype constructor C409.4
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011

UNIT V

PARALLEL PROGRAM DEVELOPMENT


Case studies - n-Body solvers – Tree Search – OpenMP and MPI implementations and
comparison.

PART A

Bloom’
Q. No. Questions CO s Level

What is an n-body problem? BTL1


In an n-body problem, it needs to find the positions and velocities of a
collection of interacting particles over a period of time. An n-body solver is a
1. program that finds the solution to an n-body problem by simulating the behavior C409.5
of the particles. The input to the problem is the mass, position, and velocity of
each particle at the start of the simulation, and the output is the position and
velocity of each particle at a sequence of user-specified times, or simply the
position and velocity of each particle at the end of a user-specified time period.

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Show the pseudocode of a serial n-body solver. C409.5 BTL1


1 Get input data;
2 for each timestep {
3 if (timestep output) Print positions and velocities of particles;
2. 4 for each particle q
5 Compute total force on q;
6 for each particle q
7 Compute position and velocity of q;
8 }
9 Print positions and velocities of particles;
List the two algorithms in the n-body solver. C409.5 BTL1
3.
The n-body solver with the original force calculation, the basic algorithm,
and the solver with the number of calculations reduced the reduced algorithm
List the two algorithms in the n-body solver. C409.5 BTL1
4. The n-body solver with the original force calculation, the basic
algorithm, and the solver with the number of calculations reduced the reduced
algorithm
Differentiate between two algorithms in n-body solver
5.
The n-body solver with the original force calculation, the basic algorithm, and the
solver with the number of calculations reduced, the reduced algorithm.
Define Graph. C409.5 BTL1
6.
A graph is a collection of vertices and edges or line segments joining pairs of
vertices.
List the use of Recursive depth-first search algorithm. C409.5 BTL1
The algorithm makes use of several global variables:
 n: the total number of cities in the problem
7.
 digraph: a data structure representing the input digraph
 hometown: a data structure representing vertex or city 0, the salesperson’s
hometown
 best tour: a data structure representing the best tour so far

What are the disadvantages of recursive depth first stage


8
It also has the disadvantage that at any given instant of time only t
This could be a problem when we try to parallelize tree search by
threads or processes
List the similarities parallelizing the solvers using pthreads. C409.5 BTL1
The more important similarities are:
 By default local variables in Pthreads are private, so all shared variables are
global in the Pthreads version.
8.  The principal data structures in the Pthreads version are identical to those in
the OpenMP version: vectors are two-dimensional arrays of doubles, and the
mass, position, and velocity of a single particle are stored in a struct. The
forces are stored in an array of vectors.
 Startup for Pthreads is basically the same as the startup for OpenMP: the
main thread gets the command-line arguments, and allocates and initializes

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

the principal data structures.


 The main difference between the Pthreads and the OpenMP implementations
is in the details of parallelizing the inner loops. Since Pthreads has nothing
analogous to a parallel for directive, we must explicitly determine which
values of the loop variables correspond to each thread’s calculations. To
facilitate this, a function Loop schedule, which determines
 the initial value of the loop variable,
 the final value of the loop variable, and
 the increment for the loop variable.
 The input to the function is
 the calling thread’s rank,
 the number of threads,
 the total number of iterations, and
an argument indicating whether the partitioning should be block or cyclic
Define communication structure C409.5
A communication structure is called a ring pass. In a ring pass, imagine the
processes as being interconnected in a ring. Process 0 communicates directly with
processes 1 and comm._ sz-1, process 1 communicates with processes 0 and 2, and
so on. The communication in a ring pass takes place in phases, and during each
phase each process sends data to its “lower-ranked” neighbor, and receives data
from its “higher-ranked” neighbor. Thus, 0 will send to comm._sz -1 and receive
from 1. 1 will send to 0 and receive from 2, and so on. In general, process q will
9. send to process (q-1+comm._sz)%comm_sz and receive from process BTL1
(q+1)%comm_sz.

List the properties of function First_index. C409.5 BTL1


The function First index should determine a global index glb part2 with the
9
following properties:
1. The particle glb_part2 is assigned to the process with rank owner.
2. glb_part1 < glb_part2 < glb_part1 + comm_sz.
What is a word about I/O. C409.5 BTL1
10 The basic I/O was designed for use by single-process, single-threaded
programs, and when multiple processes or multiple threads attempt to access the
I/O buffers; the system makes no attempt to schedule their access.
What is race condition? .(Nov/Dec 2017) C409.5 BTL1
11 A race condition is an undesirable situation that occurs when a device or system
attempts to perform two or more operations at the same time, but because of the
nature of the device or system, the operations must be done in the proper

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

sequence to be done correctly.


Show the Pseudocode for the MPI implementation of the reduced n-body C409.5 BTL1
solver.
1 source = (my_rank + 1) % comm._sz;
2 dest = (my_rank - 1 + comm._sz) % comm._sz;
3 Copy loc_pos into tmp_pos;
4 loc_forces = tmp_forces = 0;
5
6 Compute forces due to interactions among local particles;
12 7 for (phase = 1; phase < comm._sz; phase++) {
8 Send current tmp_pos and tmp_forces to dest;
9 Receive new tmp_pos and tmp_forces from source;
10 /* Owner of the positions and forces we’re receiving */
11 owner = (my_rank + phase) % comm._sz;
12 Compute forces due to interactions among my particles
13 and owner’s particles;
14 }
15 Send current tmp_pos and tmp_forces to dest;
16 Receive new tmp_pos and tmp_forces from source;
What is NP-complete problem? .(Apr/May 2017) C409.5 BTL1
In computational complexity theory, a decision problem is NP-complete
13 when it is both in NP and NP-hard. The set of NP-complete problems is
often denoted by NP-C or NPC. The abbreviation NP refers to
"nondeterministic polynomial time".

What is Depth-first search? C409.5 BTL1


Depth-first search (DFS) is an algorithm .for traversing or searching tree or
graph data structures. One starts at the root (selecting some arbitrary node as
14
the root in the case of a graph) and explores as far as possible along each
branch before backtracking.

What is Recursive depth-first search? C409.5 BTL1


Depth-first search (DFS) is an algorithm that traverses a graph in search of
one or more goal nodes. The defining characteristic of this search is that,
whenever DFS visits a maze cell c, it recursively searches the sub-maze
whose origin is c. This recursive behaviour can be simulated by an iterative
15
algorithm using a stack. A cell can have three states:
Unvisited. The cell has not yet been visited by DFS.
 Visit In Progress. The cell has been discovered, but not yet finished. Ie, the
recursive search which begins at this node has not yet terminated.
 Visited. The cell has been discovered, and the submazes which start at this
node have been completely visited also.
What problem occurs when test lock condition and update lock condition C409.5 BTL1
16
combined?
The combination of “test lock condition” and “update lock condition” can

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

cause a problem: the lock condition (e.g. the cost of the best tour) can
change between the time of the first test and the time that the lock is
acquired. Thus, the threads also need to check the lock condition after they
acquire the lock.
Show the pseudocode for updating the best tour. C409.5 BTL1
The pseudocode for updating the best tour should look something like this:
if (new tour cost < best tour cost) {
Acquire lock protecting best tour;
17 if (new tour cost < best tour cost)
Update best tour;
Relinquish lock;
}

Show the syntax for mutex checking. C409.5 BTL1


Pthreads has a nonblocking version of pthreads_mutex_lock called
pthread_mutex_trylock. This function checks to see if the mutex is
18
available. If it is, it acquires the mutex and returns the value 0. If the mutex
isn’t available, instead of waiting for it to become available, it will return a
nonzero value.

Define MPI. C409.5 BTL1


19 The message passing interface (MPI) is a standardized means of
exchanging messages between multiple computers running a parallel
program across distributed memory
Show a pseudocode for a recursive solution to TSP using depth firs C409.5 BTL1
search.(Apr/May2017)

20

What are the features of distributed memory?(Nov/Dec 2017) C409.5 BTL1

21
distributed memory refers to a multiprocessor computer system in which each
processor has its own private memory. Computational tasks can only operate on
local data, and if remote data is required, the computational task must communicate

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

with one or more remote processors.


Show the syntax of MPI_Pack and MPI_Unpack. C409.5
int MPI_Pack(
void* data to be packed /* in */
int to be packed count / _ in _/,
MPI_Datatype datatype /_ in _/,
void_ contig buf /_ out _/,
int contig buf size /_ in _/,
int_ position p /_ in/out _/,
22 MPI Comm comm /_ in _/); BTL1

int MPI Unpack(


void_ contig buf /_ in _/,
int contig buf size /_ in _/,
int_ position p /_ in/out _/,
void_ unpacked data /_ out _/,
int unpack count /_ in _/,
MPI Datatype datatype /_ in _/,
MPI Comm comm /_ in _/);
List the use of MPI_IN_PLACE argument. C409.5 BTL1
23 The use of argument MPI_IN_PLACE is that the input and output buffers
are the same. This can save on memory and the implementation may be able
to avoid copying from the input buffer to the output buffer
What the use of functions MPI Scatter and MPI Gather. C409.5 BTL1
24 The functions MPI_Scatter and MPI_Gather can be use to split an array of
data among processes and collect distributed data into a single array,
respectively
What are the use of functions MPI_Scatterv and MPI_Gatherv. C409.5 BTL1
When the amount of data going to or coming from each process is the
25
same for each process. If we need to assign different amounts of data to each
process, or to collect different amounts of data from each process we can use
MPI_Scatterv and MPI_Gatherv, respectively.
Show the syntax of MPI_Scatterv. C409.5 BTL1

int MPI Scatterv(


void* sendbuf /* in */,
int* sendcounts /* in */,
int* displacements /* in */,
26 MPI_Datatype sendtype /* in */,
Void* recvbuf /* out */,
int recvcount /* in */,
MPI_Datatype recvtype /* in */,
int root /* in */,
MPI_Comm comm /* in */);

27 Show the syntax of MPI_ Gatherv. C409.5 BTL1

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

int MPI_Gatherv(
void* sendbuf /* in */,
int sendcount /* in */,
MPI_Datatype sendtype /* in */,
void* recvbuf /* out */,
int* recvcounts /* in */,
int* displacements /* in */,
MPI_Datatype recvtype /* in */,
int root /* in */,
MPI_Comm comm /* in */);
What are the three modes provided by MPI? C409.5 BTL1
28
MPI provides three other modes for sending: synchronous, standard, and
ready.
List the use of functions MPI_Pack and MPI_Unpack. C409.5 BTL1
MPI provides the function MPI_Pack for storing a data structure in a single,
29
contiguous buffer before sending. Similarly, the function MPI_Unpack can
be used to take data that’s been received into a single contiguous buffer and
unpack it into a local data structure
List the use of ready mode. C409.5 BTL1
30 Ready sends (MPI_Rsend) are erroneous unless the matching receive
has already been started when MPI_Rsend is called. The ordinary send
MPI_Send is called the standard mode send.
List the use of Synchronous mode. C409.5 BTL1
31 Synchronous sends won’t buffer the data; a call to the synchronous send
function MPI_Ssend won’t return until the receiver has begun receiving the
data.
How to compute n-body forces C409.5 BTL1

for each particle q


forces[q] = 0;
for each particle q {
for each particle k > q
{
x _diff = pos[q][X] - pos[k][X];
y_diff = pos[q][Y] - pos[k][Y];
32
dist = sqrt(x diff_x diff + y diff_y diff);
dist_cubed = dist_dist_dist;
force qk[X] = G_masses[q]_masses[k]/dist cubed _ x_diff;
force qk[Y] = G_masses[q]_masses[k]/dist cubed _ y_diff
forces[q][X] += force qk[X];
forces[q][Y] += force qk[Y];
forces[k][X] -= force qk[X];
forces[k][Y] -= force qk[Y];
}
}

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

List the data structures used for serial implementation C409.5 BTL1

33 The data structures are the tour, the digraph, and, in the iterative implementations, the sta
and the stack are essentially list structures. For tour instead of array structure with three m
the array storing the cities, the number of cities, and the cost of the partial tour.
When the digraph are represented using List.
C409.5 BTL1
Difference between Parallelizing the two n-body solvers using pthread
and OpenMP.

The main difference between the Pthreads and the OpenMP implementations is in th
34 parallelizing the inner loops. Since Pthreads has nothing analogous to a parallel for d
must explicitly determine which values of the loop variables correspond to each thread’s
calculations. To facilitate this a function Loop schedule which contains
. the initial value of the loop variable,
. the final value of the loop variable, and
. the increment for the loop variable.

C409.5 BTL1
How the performance of the reduced solver is much superior to the performance
of the basic solver?
35
The efficiency of the basic solver on 16 nodes is about 0.95, while the efficiency of the redu
16 nodes is only about 0.70. A point to stress here is that the reduced MPI solver makes muc
efficient use of memory than the basic MPI solver the basic solver must provide storage for
on each process, while the reduced solver only needs extra storage for n=comm_sz position
n=comm_sz forces
C409.5 BTL1
How can we use OpenMP to map tasks/particles to cores in the basic version o
n-body solver?

for each timestep


{
if (timestep output) Print positions and velocities of particles;
for each particle q
Compute total force on q;
36
for each particle q
Compute position and velocity of q;
}
The two inner loops are both iterating over particles. So, in principle, parallelizing
the two inner for loops will map tasks/particles to cores, and we might try somet
like this:
for each timestep
{
if (timestep output) Print positions and velocities of
particles;

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

# pragma omp parallel for


for each particle q
Compute total force on q;
# pragma omp parallel for
for each particle q
Compute position and velocity of q;
}

C409.5 BTL1
Write the pseudocode for the MPI version of the basic n-body solver?

Get input data;


2 for each timestep {
3 if (timestep output)
4 Print positions and velocities of particles;
37
5 for each local particle loc q
6 Compute total force on loc q;
7 for each local particle loc q
8 Compute position and velocity of loc q;
9 Allgather local positions into global pos
array;
10 }
11 Print positions and velocities of particles;
C409.5 BTL1
Write the pseudocode for the MPI implementation of the reduced n-body solv

1 source = (my_rank + 1) % comm _sz;


2 dest = (my_rank - 1 + comm_sz) % comm_sz;
3 Copy_loc_pos_into_tmp pos;
4 loc_forces = tmp_forces = 0;
5
38
6 Compute forces due to interactions among local particles;
7 for (phase = 1; phase < comm sz; phase++) {
8 Send current tmp_pos and tmp_forces to dest;
9 Receive new tmp_pos and tmp_forces from source;
10 /_ Owner of the positions and forces we’re receiving _/
11 owner = (my_rank + phase) % comm_sz;
12 Compute forces due to interactions among my particles
13 and owner’s particles;
14 }

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

15 Send current tmp_pos and tmp_forces to dest;


16 Receive new tmp_pos and tmp_forces from source;

C409.5 BTL1
What are the two phases for computation of forces?
39
The following choices with respect to the data structures:
Each process stores the entire global array of particle masses.
Each process only uses a single n-element array for the positions.
Each process uses a pointer loc_pos that refers to the start of its block of pos.
C409.5 BTL1
Write the pseudo code for an implementation of a depth-first solution to TSP with
using recursion?

for (city = n-1; city >= 1; city--)


Push(stack, city);
while (!Empty(stack)) {
city = Pop(stack);
if (city == NO CITY) // End of child list, back up
Remove last city(curr tour);}
else {
40 Add city(curr tour, city);
if (City count(curr tour) == n) {
if (Best tour(curr tour)) {
Update best tour(curr tour);
Remove last city(curr tour);
} else {
Push(stack, NO CITY);
for (nbr = n-1; nbr >= 1; nbr--)
if (Feasible(curr tour, nbr))
Push(stack, nbr);
}
}/_ if Feasible _/
} /_ while !Empty _/
41 How the function Push_copy is used in TSP C409.5 BTL1

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

It is necessary to push onto the stack to create a copy of the tour before actually pushing it on to the stack
using the function Push _copy. The extra memory is required to allocating storage for a new tour and
copying the existing tour is time-consuming. Reduce the costs by saving freed tours in our own data
structure, and when a freed tour is available we can use it in the Push _copy function instead of calling
malloc.
C409.5 BTL1
What are the algorithms for identifying which subtrees we assign to the pr
threads
42
 depth-first
search
 breadth-first
search
Define the term POSIX or PThreads C409.5 BTL1

Pthreads are libraries of type definitions, functions, and macros that can be used in C program
43 a standard for Unix-like operating systems—for example, Linux and Mac OS X. It specifies
facilities that should be available in such systems. In particular, it specifies an application prog
interface (API) for multithreaded programming. Pthreads is not a programming language (suc
Java). Rather, like MPI, Pthreads specifies a library that can be linked with C programs. Unl
Pthreads API is only available on POSIX systems—Linux, Mac OS X, Solaris, HPUX, and so
C409.5 BTL1
What are the reason for parameter threads_in_cond_wait used in Tree
search?
44
There are also two cases to consider:
o threads _in_cond_wait < thread_count, it tells us how many threads are waiting
o threads_in_cond_wait == thread count,all the treads are out of work, and its

C409.5 BTL1
What are the global variables for Recursive depth first search?
45
n: the total number of cities in the problem .
digraph: a data structure representing the input digraph .
hometown: a data structure representing vertex or city 0, the salesperson’s hometown . best
tour: a data structure representing the best tour so far.
Mention the performance of MPI solvers C409.5 BTL1
The performance of the reduced solver is much superior to the performance of the
basic solver, although the basic solver achieves higher efficiencies .
46
A point to stress here is that the reduced MPI solver makes much more efficient
use of memory than the basic MPI solver; the basic solver must provide storage for
all n positions on each process, while the reduced solver only needs extra storage
for n/commsz positions and n/commsz forces.
47 Mention the principal data structures on pthread C409.5 BTL1
The vectors are two-dimensional arrays of doubles, and the mass, position, and

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

velocity of a single particle are stored in a struct. The forces are stored in an array of
vectors.
Name any two OpenMp environment variables (Nov/Dec 2018) C409.5 BTL1
omp_set_num_threads(num_threads)
48
omp_get_num_threads()
omp_get_max_threads()
omp_get_thread_num()
List any two scoping clauses in OpenMP (Nov/Dec 2018) C409.5 BTL1
49  Shared Variables
 Private Variables

What are the reason for parameter threads_in_cond_wait used in Tree C409.5 BTL1
search?

there are also two cases to consider:


50
 threads _in_cond_wait < thread_count, it tells us how many
threads are waiting
 threads_in_cond_wait == thread count,all the treads are out of
work, and its time to quit.

PART-B
Bloom
Q. No. Questions CO ’s
Level

1.Explain n-Body solvers in OpenMP. C409.5 BTL5


1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
1. Kauffman/Elsevier, 2011 Page No:271-297

Explain about Tree Search. C409.5 BTL5

2. 1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-


Kauffman/Elsevier, 2011 Page No:229-318

Explain OpenMP implementations in detail. C409.5 BTL5

3. 1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-


Kauffman/Elsevier, 2011 Page No:15-20

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Explain MPI implementations in detail. C409.5 BTL4


1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
4.
Kauffman/Elsevier, 2011 Page No: 86-88

Compare OpenMP and MPI implementations. C409.5 BTL4


5.
Refer Notes

Explin how to Parallelizing the tree search in detail. C409.5 BTL5

6. 1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-


Kauffman/Elsevier, 2011 Page No: 306-308

i.How to parallelize the basic solver using MPI. .(Apr/May2017) C409.5 BTL5
ii.Explain Non recursive depth first search. .(Apr/May2017)
7. 1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No: 316-317 & 303-304

Explain the implementation of tree search using MPI and dynamic C409.5 BTL5
partitioning. .(Apr/May2017)
8.
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No:229-318

What does the n-body problem do/give the pseudocode for serial n- C409.5 BTL1
body solver and for computing n-body forces. .(Nov/Dec 2017).
9 1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No: 271-297

How will you parallelize the reduced solver using Open Mp? How will C409.5 BTL1
you parallelize the reduced solver using Open MP? .(Nov/Dec 2017).
10
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011 Page No: 271-297

Generalize about the two Serial program C409.5 BTL6


11
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011
Describe about the Parallelizing the reduced solver using OpenMP C409.5 BTL2
12
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)


lOMoARcPSD|30789968

Describe about the Recursive and non recursive DFS C409.5 BTL2
13
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011
Examine the Data structures for the serial implementation C409.5 BTL3
14
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011
Express detail about the static parallelizing of tree search using C409.5 BTL1
15 PThreads
1. Peter S. Pacheco, “An Introduction to Parallel Programming”, Morgan-
Kauffman/Elsevier, 2011

Downloaded by SHARMILA RANI (sharmilarajme@gmail.com)

You might also like