0% found this document useful (0 votes)

5 views

Advanced Computer Architecture Assigment

Vector supercomputers are specialized machines that efficiently execute large-scale scientific and engineering computations by processing multiple data elements simultaneously. Key components include vector registers, vector processing units, and a high-bandwidth memory system, which enhance speed and efficiency for applications like weather forecasting and scientific simulations. Notable vector processor models include Cray-1, Fujitsu VP Series, and NEC SX Series, each with unique features and applications.

Uploaded by

ayushdeepanshu7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Advanced Computer Architecture Assigment

Uploaded by

ayushdeepanshu7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Q1. What are Vector Supercomputers? Discuss some vector processor models.

Vector Supercomputers: A Deep Dive

Vector supercomputers are specialized machines designed to execute large-scale scien fic and
engineering computa ons efficiently. Unlike tradi onal scalar processors that operate on single data
items at a me, vector processors can process mul ple data elements simultaneously using a single
instruc on. This parallel processing capability significantly enhances computa onal performance,
making vector supercomputers ideal for applica ons like weather forecas ng, climate modeling, and
computa onal fluid dynamics.

Key Components of a Vector Supercomputer

1. Vector Registers: These are long registers that can store mul ple data elements, typically
ﬂoa ng-point numbers.

2. Vector Processing Units (VPUs): These units perform arithme c and logical opera ons on
the data stored in vector registers. They are op mized for high-speed vector opera ons.

3. Memory System: A high-bandwidth memory system is essen al to supply data to the VPUs
at a rate that matches their processing speed.
4. Vector Instruc ons: These instruc ons specify opera ons on en re vectors of data, rather
than individual elements.
5. Pipelining: To maximize throughput, VPUs employed pipelining, allowing mul ple opera ons
to be in progress concurrently.
6. Parallelism: Exploits data-level parallelism by opera ng on mul ple data elements
simultaneously.

How Vector Supercomputers Work

1. Data Loading: Data is loaded from memory into vector registers.

2. Vector Opera on: The VPU executes a vector instruc on, performing the same opera on on
all elements of the vector simultaneously.
3. Result Storage: The results of the opera on are stored back into vector registers or
memory.

Advantages of Vector Supercomputers

1. Speed: Highly op mized for repe ve numerical computa ons.

2. Eﬃciency: Excellent for problems with high data parallelism.

3. Pipelining: Increases throughput by breaking opera ons into stages.

4. Specializa on: Tailored for scien ﬁc and engineering applica ons.

Applica ons of Vector Supercomputers

1. Weather Predic on: Solving large-scale diﬀeren al equa ons for climate modelling.

2. Scien ﬁc Simula ons: Physics, chemistry, and biology experiments.

3. Engineering: Simula ng structural mechanics or ﬂuid dynamics.

4. Defence: Simula ng ballis cs and cryptographic computa ons.

Examples of Vector Supercomputers (Concise)

1. Cray-1 (1976)

o Developer: Seymour Cray

o Key Features: Register-to-register architecture, 80 MHz clock, vector and scalar

processing.

o Applica ons: Weather modeling, scien ﬁc simula ons, defense.

2. Fujitsu VP Series

o Developer: Fujitsu

o Key Features: Advanced pipelined vector processors, large-scale computa on

support.

o Applica ons: Climate modeling, engineering simula ons.

3. NEC SX Series

o Developer: NEC Corpora on

o Key Features: High memory bandwidth, advanced vector processing.

o Applica ons: Earthquake simula on, ﬂuid dynamics.

4. Vector Processor Models

Vector processors are specialized computing architectures designed for efficient
numerical computation, especially for operations involving vectors, matrices, and
other high-dimensional data structures. They leverage parallelism by performing the
same operation on multiple data elements simultaneously.
Here are some popular models of vector processors with examples and diagrams:
1. Memory-to-Memory Vector Processor
 Description: In this model, vector operands are fetched directly from memory,
processed in the vector functional units, and stored back into memory.
 Advantages:
o Simple architecture as vectors are directly accessed from memory.
 Disadvantages:
o High memory bandwidth requirement.
o Potential delays due to memory access bottlenecks.
Example: CDC STAR-100
 Developed by Control Data Corporation, it was one of the first commercial vector
processors.
Diagram:

2. Register-to-Register Vector Processor

 Description: Operands are first loaded into vector registers, processed in the vector
functional units, and results are stored back in registers before being written to
memory.
 Advantages:
o Reduces memory bandwidth requirements.
o Faster operations since data is accessed from registers.
 Disadvantages:
o Requires larger and faster vector register files.
Example: Cray-1
 One of the most famous vector processors, developed by Seymour Cray.
Diagram:

3.Hybrid Vector Processor Architecture

A hybrid architecture combines the best features of both memory-to-memory and
register-to-register vector processing architectures. This allows for flexibility in data
handling and is optimized for both performance and resource utilization.
Advantages
1. Combines flexibility and performance of memory-to-memory and register-to-register
architectures.
2. Reduces memory bottlenecks with efficient use of registers.
3. Suitable for diverse applications, balancing scalar and vector tasks.
Disadvantages
1. More complex and costly to design and implement.
2. Higher power consumption due to dual architecture.

Example: IBM 3090 Vector Facility

Combines scalar and vector processing capabilities for scientific and engineering applications.

Diagram:

Q.2 Consider the following pipeline reserva on table.

Q. What are the forbidden latencies and the ini al collision vector?

Q i). Draw the state transi on diagram for scheduling the pipeline.
Q. ii) List all simple and greedy cycles.

Q iii). Determine the op mal constant latency cycle and minimal average latency.

Q v). Let the pipeline clock period be T = 10ns. Determine the throughput of this pipeline.
Q.3 Explain the diﬀerences among UMA, NUMA, COMA and NORMA computers

UMA, NUMA, COMA, and NORMA Computers: Detailed Explana on

These are diﬀerent memory architectures used in mul processor systems. The diﬀerences lie in
how memory is organized, accessed, and shared among processors.

1. Uniform Memory Access (UMA)

 Deﬁni on: In UMA systems, all processors share the same physical memory and have equal
access me to the memory, regardless of which processor accesses it.

 Key Characteris cs:

o Single shared memory module.

o Equal latency for all processors.

o O en implemented in Symmetric Mul processing (SMP) systems.

o Simpler programming model as memory is uniformly accessible.

 Advantages:

o Easy to program due to uniform access.

o Good for workloads where memory access is not a bo leneck.

 Disadvantages:

o Performance bo leneck for systems with many processors because of conten on for
memory.

o Limited scalability.

 Example:

o Tradi onal mul core systems like Intel Xeon processors in SMP conﬁgura on.

 Diagram:

2. Non-Uniform Memory Access (NUMA)

 Deﬁni on: In NUMA systems, each processor has its own local memory, but processors can
also access the memory of other processors. Access to local memory is faster than access to
remote memory.

 Key Characteris cs:

o Memory is divided into "local" and "remote" regions.

o Processors can access remote memory, but with higher latency.

o Suitable for large-scale mul processor systems.

 Advantages:

o Be er scalability than UMA.

o Reduced memory conten on since processors primarily use local memory.

 Disadvantages:

o Programming complexity due to variable memory access mes.

o Poorly op mized programs may result in excessive remote memory access.

 Example:

o High-performance servers using AMD's EPYC processors.

 Diagram:
3. Cache-Only Memory Architecture (COMA)

 Deﬁni on: In COMA, there is no separate main memory. Each processor has a large local
cache that serves as the memory. All memory in the system behaves like a distributed cache.

 Key Characteris cs:

o Each processor’s local cache contains part of the global memory.

o Data is dynamically migrated and replicated as needed.

o Ideal for applica ons with frequent access to distributed data.

 Advantages:

o Eliminates memory bo leneck by fully distribu ng memory.

o Reduces latency for frequently accessed data.

 Disadvantages:

o Complex hardware design.

o Ineﬃcient for workloads with li le data locality.

 Example:

o Used in experimental and specialized systems, like KSR-1 (Kendall Square Research
machines).

 Diagram:
4. No Remote Memory Access (NORMA)

 Deﬁni on: In NORMA systems, each processor has its own private memory, and there is no
shared memory or remote memory access. Processors communicate via message passing.

 Key Characteris cs:

o Completely distributed memory.

o No hardware-level memory sharing.

o Communica on handled by so ware (e.g., MPI).

 Advantages:

o Highly scalable as there is no memory conten on.

o Suited for distributed compu ng environments like clusters.

 Disadvantages:

o Programming is complex due to the need for explicit message passing.

o Higher overhead for communica on.

 Example:

o Beowulf clusters and distributed systems.

 Diagram:
Key Comparison Table

Feature UMA NUMA COMA NORMA

Shared
Shared memory Distributed
Memory memory No shared
(non-uniform cache as
Sharing (uniform memory
access) memory
access)

Low (if
High (requires
cached),
Access Latency Uniform Non-uniform explicit
dynamic
communica on)
migra on

Moderate
Scalability Limited Moderate High
to high

Programming Easy Moderate Complex Complex

Explicit
Implicit Implicit Implicit
Communica on (so ware, e.g.,
(hardware) (hardware/so ware) (hardware)
MPI)

SMP High-performance
Example KSR-1 Beowulf clusters
systems servers

Complexity Low Medium High Medium

Q4. What is data dependence and control dependence? Write the programs which shows this
dependency among data.

Data Dependence

Data dependence occurs when an instruc on depends on the result of a previous instruc on. There
are three types:

1. True Dependence (Read A er Write - RAW)

An instruc on reads a value wri en by a previous instruc on.
Example:

2. An -Dependence (Write A er Read - WAR)

An instruc on writes to a variable that was read by a previous instruc on.
Example:
3. Output Dependence (Write A er Write - WAW)
Two instruc ons write to the same memory loca on.
Example:

Control Dependence

Control dependence occurs when the execu on of an instruc on depends on the control ﬂow of
a previous instruc on (e.g., a branch or condi onal).
Example:

Programs Demonstra ng Dependencies

Program 1: Data Dependence

Program 2: Control Dependence

Aspect Data Dependence Control Dependence

Defini on Shared or modified data Execu on based on control flow

Types RAW, WAR, WAW Branching/Condi onal

Impact on Parallelism Restricts due to data Restricts order based on control

Examples Arithme c opera ons Condi onal/Loop statements

Q5. What are data and control hazards? Describe various methods to resolve these hazards.

Data Hazards and Control Hazards

Data Hazards
Data hazards occur when instruc ons that exhibit data dependencies are executed
simultaneously in a pipeline and the desired order of execu on is disrupted. These
dependencies can cause incorrect program execu on if not handled properly.
Types of Data Hazards
1. Read A er Write (RAW) - Also known as a true dependency. It occurs when an
instruc on depends on the result of a previous instruc on.

2. Write A er Read (WAR) - Also known as an an -dependency. It occurs when an instruc on

writes to a register or memory loca on that a previous instruc on is reading.

Example:
LOAD R1, 0(R2) # Instruc on on 1 reads memory into R1

ADD R1, R3, R4 # Instruc on on 2 writes to R1

Problem: If Instruc on on 2 writes to R1 before Instruc on on 1 reads it, incorrect data will be read.

3. Write A er Write (WAW) - Also known as an output dependency. It occurs when two
instruc ons write to the same register or memory loca on.

Example:
ADD R1, R2, R3 # Instruc on on 1 write to R1
SUB R1, R4, R5 # Instruc on on 2 writes to R1
Problem: The final value of R1 depends on the order of comple on
Control Hazards
Control hazards occur due to branch instruc ons that change the flow of program
execu on. These hazards arise when the pipeline makes incorrect predic ons about the flow
of control.
Examples of Control Hazards
1. Branch Instruc ons:
o If the branch condi on is not resolved early enough, the pipeline might fetch
instruc ons from the wrong path.
2. Jump Instruc ons:
o These redirect the program counter (PC) to a new loca on, poten ally
invalida ng prefetched instruc ons.

Methods to Resolve Hazards

Resolving Data Hazards
1. Stalling:
o Introduce pipeline stalls (or bubbles) to wait for the dependent instruc on to
complete.
2. Forwarding (Data Bypassing):
o Pass the result of an instruc on directly to a subsequent instruc on before it
is wri en to the register ﬁle.
3. Instruc on Reordering:
o Rearrange instruc ons to minimize dependencies, avoiding hazards without
changing the program's logic.
4. Pipeline Interlocks:
o Use hardware mechanisms to detect and delay dependent instruc ons
dynamically.
Resolving Control Hazards
1. Stalling on Branches:
o Stall the pipeline un l the branch decision is resolved.
2. Branch Predic on:
o Predict the outcome of a branch instruc on (taken or not taken). Modern
processors use sophis cated dynamic branch predictors to improve accuracy.
3. Delayed Branching:
o Rearrange instruc ons to ﬁll the branch delay slot, ensuring useful work is
done while wai ng for the branch decision.
4. Specula ve Execu on:
o Execute instruc ons following a branch specula vely and discard results if the
predic on is incorrect.
5. Prefetching:
o Fetch both poten al paths of a branch and later discard the incorrect path.
Padhne ke liya h bas
Hazard Method Advantages Disadvantages

Requires addi onal

Data Hazard Data Forwarding Reduces stalls, improves speed
hardware

Reduces pipeline
Pipeline Stalls Simple to implement
eﬃciency

Register Eliminates WAR and WAW Requires hardware

Renaming hazards support

Instruc on
Improves parallelism Compiler-dependent
Reordering

Control High performance for Incorrect predic ons

Branch Predic on
Hazard predictable branches cause stalls

Delayed Requires careful

U lizes delay slot eﬀec vely
Branching programming

Specula ve
Improves branch handling High hardware complexity
Execu on

Q6. Diﬀeren ate between synchronized and asynchronized parallel algorithms.

Synchronized vs. Asynchronized Parallel Algorithms
Parallel algorithms aim to divide a problem into subproblems that can be solved
simultaneously. These algorithms can be classiﬁed as synchronized or asynchronized,
depending on how they coordinate and manage the execu on of their tasks.
Synchronized Parallel Algorithms
In synchronized parallel algorithms, tasks execute in a coordinated manner, o en following a
strict sequence. This synchroniza on ensures that tasks complete in a speciﬁc order and that
data dependencies are maintained.
Advantages:
 Simplicity: Easier to design and implement due to the sequen al nature of execu on.
 Determinism: The output is always the same for a given input, making debugging
easier.
Disadvantages:
 Performance Bo lenecks: Synchroniza on points can lead to performance
bo lenecks, especially when tasks have varying execu on mes.
 Limited Scalability: The degree of parallelism is limited by the synchroniza on
overhead.

Asynchronous Parallel Algorithms

In asynchronous parallel algorithms, tasks execute independently without strict

synchronization. Tasks can start and finish at different times, and they communicate
through shared memory or message passing.

Advantages:

 High Performance: Can achieve high performance by overlapping computation and

communication.
 Scalability: Can scale well to large numbers of processors.

Disadvantages:

 Complexity: More difficult to design and implement due to the need for careful
synchronization and data management.
 Non-Determinism: The output may vary slightly between different runs due to the
non-deterministic nature of task execution.

Real-world Analogy:

 Synchronized: A factory assembly line where each worker performs a specific task in
a sequential order.
 Asynchronous: A group of students working on a project, each focusing on different
parts independently.

Synchronized Parallel
Feature Asynchronous Parallel Algorithms
Algorithms
Sequential with
Task Execution Independent and concurrent
synchronization
Data
Strict data dependencies Relaxed data dependencies
Dependency
Limited by
Performance Can achieve high performance
synchronization
Limited by
Scalability Can scale well
synchronization
Complexity Simpler to implement More complex to implement
Determinism Deterministic Non-deterministic
Task
Tasks are interdependent Tasks are independent
Dependency
Requires strict coordination and
Coordination Less coordination is needed
synchronization
Can be less efficient due to sequential Can be more efficient due to
Efficiency
nature parallel execution
Ease of Harder due to non-deterministic
Easier due to deterministic order.
Debugging order.

Q7. Write short notes on following.

i) Bernstein's condition:

Bernstein's Conditions

Bernstein's conditions are a set of rules used to determine whether two statements in a
program can be executed concurrently without affecting the final result. If two statements
satisfy these conditions, they are considered independent and can be executed in parallel.

The three conditions are:

1. Read-Write Conflict: The first statement must not read any variable that the second
statement writes to.
o R(S1) ∩ W(S2) = ∅
2. Write-Read Conflict: The first statement must not write to any variable that the
second statement reads from.
o W(S1) ∩ R(S2) = ∅
3. Write-Write Conflict: Neither statement must write to the same variable.
o W(S1) ∩ W(S2) = ∅

Example:

Consider the following two statements:

Here:

 R(S1) = {y, z}, W(S1) = {x}

 R(S2) = {b, c}, W(S2) = {a}
Checking Bernstein's conditions:

1. Read-Write Conflict: R(S1) ∩ W(S2) = {y, z} ∩ {a} = ∅

2. Write-Read Conflict: W(S1) ∩ R(S2) = {x} ∩ {b, c} = ∅
3. Write-Write Conflict: W(S1) ∩ W(S2) = {x} ∩ {a} = ∅

Since all three conditions are satisfied, statements S1 and S2 can be executed concurrently.

Venn Diagram for the Example

Here's a Venn diagram illustrating the read and write sets for the two statements S1: x = y +
z and S2: a = b * c:

Explanation:

 Circle S1: Represents the set of variables involved in statement S1.

o The left part of the circle represents the read set: {y, z}
o The right part of the circle represents the write set: {x}
 Circle S2: Represents the set of variables involved in statement S2.
o The left part of the circle represents the read set: {b, c}
o The right part of the circle represents the write set: {a}

As we can see, the two circles do not intersect. This means that there is no overlap between
the read and write sets of the two statements. Therefore, they can be executed
concurrently without affecting each other's results.

ii) Degree of parallelism

Degree of Parallelism
The degree of parallelism (DOP) refers to the number of tasks, processes, or threads that
can execute simultaneously in a parallel system. It quantifies the level of concurrency
achievable in a program or system.

Formula:
Characteristics:

1. A high DOP indicates more tasks can run concurrently, improving performance on
systems with many processors.
2. Limited DOP often results from:
o Data dependencies.
o Resource constraints (e.g., limited processors or memory).

Example:

Consider a program divided into four tasks:

 Task A: 2 seconds.
 Task B: 3 seconds.
 Task C: 4 seconds.
 Task D: 2 seconds.

Sequential Execution:

If executed sequentially, total time is:

Parallel Execution:

Assume tasks are mapped to 3 processors as follows:

 Processor 1 executes Task A (2 seconds).

 Processor 2 executes Task B (3 seconds).
 Processor 3 executes Task C and Task D in sequence (4 + 2 = 6 seconds).

Degree of Parallelism:
Diagram:

1. Sequential Execution:

2. Parallel Execution:

In this case, three processors are utilized, and the execution time is reduced compared to
sequential processing.

Factors Affecting Degree of Parallelism:

 Problem Structure: The inherent parallelism in the problem itself.

 Hardware Resources: The number of available processors or cores.
 Software Implementation: The efficiency of the parallel algorithm and the overhead
of task scheduling and synchronization.

A higher degree of parallelism can lead to significant performance improvements, especially

for large-scale computations. However, it's important to balance the benefits of parallelism
with the overhead of task creation, synchronization, and communication.

iii) Amdahl's law for a fixed Workload

Amdahl's Law
Amdahl's Law is a formula used to predict the maximum possible speedup of a program
using parallel processing. It states that the overall speedup of a program is limited by the
sequential portion of the program that cannot be parallelized.

Formula:

Where:

Speedup: Overall improvement in performance

P: Fraction of the workload that can be parallelized.

(1−P): Fraction of the workload that is serial

N: Number of processors.

Key Insights:
Diagram:

Explanation of the Diagram:

1. Sequential Portion: This represents the 30% of the task that cannot be parallelized. It
acts as a bottleneck and limits the overall speedup.
2. Parallel Portion: This represents the 70% of the task that can be parallelized. It is
divided into four equal segments, each representing a portion assigned to one of the
four processors (N=4).
3. Speedup: The overall speedup is limited by the sequential portion. Even though the
parallel portion can be sped up by a factor of 4, the total speedup is only about 2.11.
4. Maximum Speedup: As the number of processors (N) approaches infinity, the
speedup approaches the theoretical maximum, which is 1/(1-P) = 1/0.3 = 3.33.
However, in practice, achieving this maximum speedup is often not feasible due to
factors like communication overhead and resource limitations.

iv) Tomasulo's algorithm

Tomasulo's Algorithm

Tomasulo's algorithm is a dynamic scheduling algorithm used in modern processors to execute

instruc ons out-of-order. It is designed to exploit instruc on-level parallelism (ILP) by overlapping
instruc on execu on and hiding latency.

Key Concepts:

 Instruc on Issue: Instruc ons are fetched from the instruc on fetch unit and issued to the
reserva on sta ons.
 Reserva on Sta ons: These are hardware buffers that hold instruc ons wai ng for their
operands.
 Reorder Buffer (ROB): This buffer holds the results of instruc ons in the order they were
issued, ensuring correct program order.
 Func onal Units: These are hardware units that perform arithme c, logical, and floa ng-
point opera ons.

Algorithm Steps:

1. Instruc on Issue:
o Instruc on is fetched and decoded.
o If operands are ready, the instruc on is immediately sent to the func onal unit.
o If operands are not ready, the instruc on is placed in a reserva on sta on.
2. Execu on:
o Func onal units execute instruc ons as soon as their operands are available.
3. Comple on:
o When an instruc on completes, its result is wri en back to the register ﬁle or to
other reserva on sta ons.
o The instruc on is removed from the reserva on sta on and the ROB.

Example:

Consider the following sequence of instruc ons:

Assume that the mul plica on opera on has a long latency.

Execu on:

1. Issue:
oInstruc ons 1 and 2 are issued to their respec ve func onal units.
oInstruc on 3 is issued but stalls because R1 and R4 are not ready.
2. Execu on:
o The addi on and mul plica on opera ons start execu ng.
3. Comple on:
o The addi on completes and its result is wri en to R1.
o Instruc on 3 can now be executed because R1 is ready.
o The mul plica on completes and its result is wri en to R4.
o Instruc on 3 completes.
Benefits of Tomasulo's Algorithm:

 Out-of-Order Execution: Instructions can be executed as soon as their operands are

available, regardless of their program order.
 Latency Hiding: Long latency instructions can be overlapped with other instructions.
 Improved Performance: By exploiting ILP, Tomasulo's algorithm can significantly
improve processor performance.

Drawbacks:

 Hardware Complexity: It requires complex hardware implementation.

 Overhead: There is overhead associated with instruction issue, execution, and
completion.

Tomasulo's algorithm is a fundamental technique used in modern processors to achieve

high performance and efficient execution of instructions.

BOOK EXAMPLE:
v). Remote procedure call

Remote Procedure Call (RPC)

Remote Procedure Call (RPC) is a programming paradigm that allows a program to call a
procedure (function) on another computer as if it were a local procedure. This enables
distributed computing, where different components of an application can be located on
different machines.

How RPC Works:

1. Client Stub: The client program calls a local procedure, which is a stub function.
2. Parameter Marshalling: The client stub marshals (packages) the arguments for the
remote procedure.
3. Network Transmission: The marshalled arguments are sent over the network to the
server.
4. Server Stub: The server receives the request and unmarshals the arguments.
5. Procedure Execution: The server stub calls the actual remote procedure.
6. Result Marshalling: The server marshals the return value or results.
7. Network Transmission: The marshalled results are sent back to the client.
8. Client Stub: The client stub unmarshals the results and returns them to the client
program.

Steps of RPC:

1. The client calls the local client stub.

2. The client stub packages (marshals) the parameters into a message.
3. The message is sent to the server over the network.
4. The server stub receives the message, unmarshals the parameters, and invokes the
desired procedure on the server.
5. The procedure executes, and the result is sent back via the server stub.
6. The client stub unmarshals the result and passes it to the client.

Example:

Consider a simple example of a remote procedure to calculate the factorial of a number:

Diagram:
Key Points:

 Transparency: RPC provides a transparent way to access remote procedures, making

it easier to build distributed applications.
 Efficiency: RPC can be efficient, especially for frequent calls, as it avoids the
overhead of creating new connections for each call.
 Challenges: RPC can be complex to implement, especially when dealing with error
handling, security, and network issues.

RPC Frameworks:

Several RPC frameworks are available to simplify the implementation of RPC:

 CORBA (Common Object Request Broker Architecture): A complex, platform-

independent standard for distributed object computing.
 RMI (Remote Method Invocation): A Java-specific RPC mechanism built on top of
Java's object serialization.
 GRPC (Google Remote Procedure Call): A modern, high-performance RPC
framework based on Protocol Buffers.
 WCF (Windows Communication Foundation): A Microsoft framework for building
service-oriented applications.

By using RPC, developers can create distributed systems that are more scalable, fault-
tolerant, and efficient.

Q9. Write a short note on SIMD multiprocessor

SIMD Multiprocessor

A Single Instruction, Multiple Data (SIMD) multiprocessor is a parallel processing

architecture where a single instruction is executed simultaneously on multiple data
elements. This allows for efficient processing of large data sets, especially in tasks like image
processing, scientific simulations, and machine learning.
Diagram:

Core Components:

1. Instruction Pool: This is where the instructions to be executed are stored. In a SIMD
architecture, a single instruction is broadcast to all the processing elements.
2. Vector Unit: This unit houses multiple Processing Elements (PEs). Each PE has its own
local memory and arithmetic logic unit (ALU).
3. Data Pool: This is where the data to be processed is stored. Each PE has access to a
portion of the data pool.

How it Works:

1. Instruction Fetch: The control unit fetches an instruction from the instruction pool.
2. Instruction Broadcast: The fetched instruction is broadcast to all the PEs
simultaneously.
3. Data Fetch: Each PE fetches the data it needs to process from the data pool.
4. Execution: All PEs execute the same instruction on their respective data elements in
parallel.
5. Result Storage: The results of the computation are stored in the local memory of
each PE.

Key Points:

 Efficiency: SIMD architectures are highly efficient for data-parallel operations, as

they can process multiple data elements in a single clock cycle.
 Scalability: By adding more PEs, the processing power of the SIMD processor can be
increased.
 Specialization: SIMD architectures are often used in specialized hardware like GPUs
and DSPs, where high performance and efficient data processing are critical.

Limitations:

 Data Dependency: If the data to be processed by one PE depends on the result of

another PE, it can limit the parallelism.
 Instruction Flexibility: All PEs must execute the same instruction on the same data
type, limiting the flexibility of the architecture.

In essence, the SIMD architecture leverages the power of parallelism to efficiently process
large amounts of data. By executing the same instruction on multiple data elements
simultaneously, it achieves significant performance gains in many applications.

Advantages of SIMD:

 High Performance: Well-suited for data-parallel applications.

 Efficient Use of Hardware: Multiple data elements can be processed
simultaneously.
 Lower Power Consumption: Compared to multiple scalar processors.

Q10. Explain Parallel Algorithms for array processors

Parallel Algorithms for Array Processors

Array processors are a type of parallel computer architecture designed to efficiently process
large arrays of data. They are well-suited for tasks like matrix operations, signal processing,
and image processing.

Here are some common parallel algorithms for array processors:

1. Matrix Multiplication:

 Divide and Conquer: The matrices are divided into smaller sub-matrices. Each
processor is assigned a sub-matrix multiplication task.
 Systolic Array: A specialized hardware array where data flows through the array, and
each processing element performs a simple operation on the data it receives.

2. Fast Fourier Transform (FFT):

 Parallel Radix-2 FFT: The FFT algorithm is divided into stages, and each stage can be
parallelized across multiple processors.

3. Sorting:

 Parallel Merge Sort: The array is divided into smaller sub-arrays, which are sorted
independently. The sorted sub-arrays are then merged in parallel.
 Parallel Quick Sort: Similar to sequential quick sort, but the partitioning and sorting
of sub-arrays can be parallelized

Example: Matrix Multiplication on a 2D Array Processor

Consider a 2D array processor with 4x4 processing elements. We want to multiply two 4x4
matrices, A and B, to get the result matrix C.
2D array processor performing matrix multiplication

Steps:

1. Data Distribution: Matrix A and B are distributed across the processing elements.
2. Parallel Multiplication: Each processing element multiplies the corresponding
elements of its sub-matrices of A and B.
3. Partial Summation: The partial products are summed within each row and column of
processing elements.
4. Global Summation: The final results are obtained by summing the partial sums
across the entire array.

Key Advantages of Array Processors:

 High Performance: Well-suited for data-parallel applications.

 Regular Structure: Simple hardware design and efficient data flow.
 Scalability: Can be easily scaled by adding more processing elements.

Limitations:

 Limited Flexibility: Primarily designed for specific types of data-parallel tasks.

 Programming Complexity: Requires specialized programming techniques.
Array processors have been used in various applications, including scientific computing,
image processing, and signal processing. While they have been largely replaced by more
flexible and general-purpose parallel architectures like GPUs and multi-core processors, the
concepts and techniques used in array processors continue to influence the design of
modern parallel computing systems.

Q11. Discuss the scheduling and load balancing problem for a multi-processor system.
Give a suitable example with illustrative diagrams.

Scheduling and Load Balancing in Multi-Processor Systems

In a multi-processor system, efficient scheduling and load balancing are crucial for
maximizing performance and resource utilization. Scheduling involves assigning tasks to
processors, while load balancing aims to distribute the workload evenly among processors.

Scheduling Problem

The scheduling problem in a multi-processor system involves determining the optimal order
and timing for executing tasks. The goal is to minimize task completion time and maximize
system throughput.

Types of Scheduling:

 Static Scheduling: Tasks are assigned to processors before execution begins. This
approach is suitable for tasks with predictable execution times.
 Dynamic Scheduling: Tasks are assigned to processors at runtime, based on their
availability and the current workload. This approach is more flexible but can
introduce overhead due to dynamic scheduling decisions.

Load Balancing Problem

Load balancing ensures that the workload is evenly distributed among processors. This helps
prevent idle processors and resource bottlenecks.

Load Balancing Techniques:

 Static Load Balancing: Tasks are assigned to processors based on static estimates of
their workload.
 Dynamic Load Balancing: Tasks are dynamically migrated between processors to
balance the load at runtime.

Example: Task Scheduling and Load Balancing

Consider a multi-processor system with 4 processors (P1, P2, P3, P4) and a set of tasks with
different execution times.
Static Scheduling:

One simple static scheduling strategy is to assign tasks to processors in a round-robin

fashion:

Dynamic Scheduling:

A dynamic scheduling algorithm might initially assign tasks as follows:

As T1 finishes, P1 can take on T5. If P2 finishes T3 before P1 finishes T1, P2 can take on T5.

Diagram:
Challenges:

 Task Dependencies: Some tasks may depend on the completion of others,

complicating the scheduling process.
 Communication Overhead: Task migration and communication between processors
can introduce overhead.
 Dynamic Workload: In real-world systems, workloads can vary over time, making it
difficult to achieve optimal load balancing.

Strategies for Effective Scheduling and Load Balancing:

 Task Prioritization: Prioritize tasks based on their importance or deadlines.

 Task Partitioning: Break down large tasks into smaller, independent subtasks.
 Communication Optimization: Minimize communication between processors.
 Adaptive Scheduling: Adjust the scheduling strategy based on the current workload
and system conditions.

Q12.Write a note on interconnection network.

Interconnection Networks

An interconnection network is a crucial component of parallel computer systems,

responsible for communication and data transfer between processing elements (PEs). It
determines the efficiency and scalability of the parallel system.

Types of Interconnection Networks:

1. Direct Networks:

 Fully Connected Network: Each PE is directly connected to every other PE. While this
offers high connectivity, it becomes impractical for large systems due to the high
number of connections.
 Star Network: A central node (hub) connects to all other PEs. This network is simple
but prone to bottlenecks at the hub.
 Bus Network: A single bus connects all PEs. It's simple and cheap but has limited
bandwidth and scalability.
 Ring Network: PEs are connected in a circular fashion. It offers better scalability than
a bus but can suffer from congestion if many PEs need to communicate
simultaneously.

2. Indirect Networks:

 Mesh Network: PEs are arranged in a 2D or higher-dimensional grid. It offers good

scalability and fault tolerance.
 Hypercube Network: Each PE is connected to a fixed number of other PEs. It provides
efficient communication for many parallel algorithms.
 Butterfly Network: A special case of the hypercube network, it is often used in
parallel FFT algorithms.
 Tree Network: A hierarchical network where PEs are organized in a tree-like
structure. It is suitable for divide-and-conquer algorithms.

Advantages of Mesh Network:

1. Easy to scale by adding rows and columns.

2. Low contention as multiple paths exist for data transfer.
3. Fault tolerance due to multiple routing paths.

Disadvantages:

1. Increased latency for distant nodes.

2. Requires more communication links, leading to higher costs.

Diagram:

+
Here is a diagram illustrating a 2D Mesh interconnection network. It shows 9 nodes
arranged in a 3x3 grid, each connected to its immediate neighbours (up, down, left, right).
Let me know if further adjustments or explanations are needed!

Key Factors in Interconnection Network Design:

 Latency: The time taken for data to travel between PEs.

 Bandwidth: The amount of data that can be transferred per unit time.
 Scalability: The ability to accommodate a large number of PEs.
 Fault Tolerance: The ability to continue functioning even if some PEs or connections
fail.
 Cost: The cost of implementing the network.

The choice of interconnection network depends on the specific requirements of the parallel
application, such as the communication pattern, the number of PEs, and the desired
performance. By carefully selecting and designing the interconnection network, it is possible
to achieve high performance and efficiency in parallel computing systems.

Q13.Explain instruction pipeline.

Instruction Pipeline

An instruction pipeline is a technique used in computer architecture to improve the

performance of a processor by overlapping the execution of multiple instructions. It breaks
down the instruction execution process into a series of stages, allowing multiple instructions
to be processed simultaneously.

Stages of a Typical Pipeline:

1. Fetch: Fetches the next instruction from memory.

2. Decode: Decodes the fetched instruction to determine its operation and operands.
3. Execute: Executes the instruction, performing the specified operation.
4. Memory Access: Accesses memory to read or write data.
5. Write Back: Writes the result of the operation back to the register file.

5-Stage Pipelined MIPS

How Pipelining Works:

While one instruction is in the Write Back stage, the next instruction can be in the Memory
Access stage, and so on. This overlapping of instruction execution increases the overall
throughput of the processor.

Example:

Consider the following sequence of instructions:

Pipeline Execution:
As you can see, multiple instructions are being processed simultaneously in different stages
of the pipeline. This significantly improves the overall performance of the processor.

Challenges and Limitations:

 Pipeline Hazards:
o Structural Hazards: When multiple instructions require the same hardware
resource at the same time.
o Data Hazards: When an instruction depends on the result of a previous
instruction that has not yet completed.
o Control Hazards: When the instruction pipeline needs to be stalled due to
branch instructions or exceptions.
 Clock Cycle Time: The pipeline stage with the longest delay determines the overall
clock cycle time.

To address these challenges, techniques like pipelining optimization, branch prediction, and
instruction scheduling are used to improve pipeline efficiency.

By understanding the principles of instruction pipelining, you can appreciate the

complexities involved in modern processor design and the techniques used to achieve high
performance.

Q14.What do you mean by cache coherence.

Cache Coherence

In multiprocessor systems, where multiple processors share a common memory, each

processor typically has its own local cache memory to speed up data access. Cache
coherence is the mechanism that ensures all processors have a consistent view of shared
memory data.

Maintaining the consistency among those copies raises the problem of cache coherence.
Following is the three cause

1. Sharing of Writable Data: When multiple processors modify the same data, their cached
copies can become outdated.
2. Process Migration: If a process with cached data moves to a different processor, its
cached data might not be consistent with the main memory.
3. I/O Activity: I/O operations can update memory directly making cached data invalid.

From the point of view of cache coherence, data structures can be divided into three
classes:
1. Read-Only: No coherence issues arise as there are no updates to the data.

2. Shared Writable: Main source of coherence problems due to multiple processors

modifying the data.

3. Private Writable: Problems only occur during process migration, as the data is only
written by a single process.

There are several techniques to maintain cache coherence for the shared writable data
structure. These methods can be divided into two classes:

Hardware-Based Cache Coherence Protocols.

Main Categories:

1. Memory Update Policy: This category focuses on how the memory is updated when
a cache block is modified.

 Write Through: Each write operation is written to both the cache and main memory
simultaneously.

DIAGRAM:

 Write Back: Writes are only made to the cache. The modified block is written back to
main memory only when it is evicted from the cache.

DIAGRAM:
2. Cache Coherence Policy: This category deals with how the cache maintains
consistency among multiple caches.
o Write Invalidate: When a block is modified, all other caches containing that
block are invalidated.

DIAGRAM:

o Write Update: When a block is modified, all other caches containing that
block are updated with the new value.

DAIGRAM:
3. Interconnection Scheme: This category determines how the different components of
the system, such as processors and memory modules, are connected and how data
flows between them, which significantly impacts the overall performance and
efficiency of the cache coherence mechanism.
o Single Bus: A single shared bus connects all components.
o Multistage Directory: A hierarchical directory structure is used to track cache
block locations.
o Multiple Bus Hierarchical Cache Coherence Protocols: Multiple buses are
used, and cache coherence is maintained using hierarchical protocols.

Further Refinements within Multistage Directory:

 Full-Map Directories: Each directory entry contains a full map of the cache locations
for a particular block.
 Limited Directories: Each directory entry contains a limited number of pointers to
caches containing the block.
 Chained Directories: Directory entries for a block are chained together, forming a
linked list.

Software-Based Cache Coherence Protocols.

Main Categories:

1. Indiscriminate Invalidation: This protocol is the simplest. Whenever a cache block is

modified, all other caches containing that block are invalidated. This approach is
straightforward but can lead to unnecessary cache misses.
2. Select Invalidation: This category aims to reduce the overhead of indiscriminate
invalidation. It employs techniques to selectively invalidate only those caches that
actually need to be invalidated.
o Parallel For-Loop Based: This approach leverages parallel programming
constructs like for-loops to efficiently identify and invalidate the relevant
caches.
o Critical Section Based: This method uses critical sections to ensure that only
one processor modifies a shared cache block at a time, preventing conflicts
and the need for extensive invalidation.

Further Refinements within Select Invalidation:

 Fast Selective Invalidation: This protocol optimizes selective invalidation by using

techniques to quickly identify and invalidate the necessary caches.
 Version Control Scheme: This approach maintains a version number for each cache
block. When a block is modified, its version number is incremented, allowing caches
to determine if their copy is outdated.
 Timestamp Scheme: This method uses timestamps associated with cache blocks to
track their recency. Caches can then invalidate blocks based on their age.

Problem:

When multiple processors have cached copies of the same memory location, and one
processor modifies the data, the other copies become stale. This inconsistency can lead to
incorrect program execution.

Example:

Consider two processors, P1 and P2, and a shared memory location X.

1. Initial State:
o P1 reads the value of X from main memory and caches it.
o P2 also reads the value of X from main memory and caches it.
o Both P1 and P2 have identical copies of X in their caches.
2. Modification:
o P1 modifies the value of X in its cache.
3. Incoherence:

P2's cached copy of X is now stale. If P2 reads X from its cache, it will get the old, incorrect
value
Solu ons:

Cache coherence protocols are used to maintain consistency among caches. Common approaches
include:

 Write-Invalidate: When a processor writes to a shared loca on, it invalidates the

corresponding cache lines in other processors' caches.

Diagram (Write-Invalidate):

 Write-Update: When a processor writes to a shared loca on, it updates the corresponding
cache lines in other processors' caches.
Q15.write algorithm for associative search

Associative Search Algorithm

Associative search, also known as content-addressable memory (CAM) search, is a

technique for retrieving data based on its content rather than its address. Here's a basic
algorithm:

1. Broadcast the Search Key:

o The search key (the value to be found) is broadcast simultaneously to all
memory locations.
2. Parallel Comparison:
o Each memory location compares the search key with its stored data.
3. Match Detection:
o Memory locations that match the search key signal their presence.
4. Data Retrieval:
o The matched data is retrieved from the corresponding memory locations.

Example: Finding a Specific Student Record

Let's say we have an associative memory storing student records. Each record contains the
student's ID, name, and grade. We want to find the record for a student with the ID
"12345".
In this diagram:

 Search Key: "12345"

 Memory Locations: Each row represents a memory location storing a student
record.
 Parallel Comparison: The search key is compared with the ID field in each memory
location simultaneously.
 Match Detection: The memory location containing the student with ID "12345"
signals a match.
 Data Retrieval: The entire record for the student with ID "12345" is retrieved.

Key Points:

 Parallelism: Associative search is inherently parallel, making it very fast for certain
types of searches.
 Hardware Complexity: Implementing true associative memory can be complex and
expensive.
 Applications: Used in areas like network routers, database systems, and content-
based image retrieval.

Q16. Proving that a k stage pipeline can be at most k times faster than that of a
nonpipelined one.

Limitation of Pipeline Speedup

A k-stage pipeline can indeed significantly improve performance by overlapping the

execution of multiple instructions. However, its speedup is inherently limited by the number
of pipeline stages (k).
Theoretical Limit:

 In an ideal scenario, a k-stage pipeline can achieve a maximum speedup of k times.

 This ideal scenario assumes no pipeline hazards (data dependencies, structural
hazards, control hazards) and a continuous stream of instructions.
Diagram Explanation

The diagram should:

1. Show non-pipelined execution where each instruction completes sequentially.

2. Show pipelined execution where instructions overlap in stages.

Let me create a visual representation of this!

Here is a diagram comparing non-pipelined execution and 4-stage pipelined execution,

showing how instructions are executed sequentially in the non-pipelined approach and
overlapped in the pipelined approach to save time.
Q17. Explain pipeline hazards.

Pipeline hazards are situations that disrupt the smooth flow of instructions through the
pipeline, causing stalls or delays. They occur when the next instruction cannot execute in its
designated clock cycle due to various dependencies.

Types of Pipeline Hazards:

1. Structural Hazards:
o Occur when multiple instructions require the same hardware resource
simultaneously (e.g., two instructions trying to access the same memory
unit).
o Solution: Replicate hardware resources (e.g., multiple memory units).
2. Data Hazards:
o Occur when an instruction depends on the result of a previous instruction
that is still being processed in the pipeline.
o Types of Data Hazards:
 Read After Write (RAW): An instruction tries to read a data item
before a previous instruction has finished writing it.
 Write After Read (WAR): An instruction tries to write to a register
before a previous instruction has read from it.
 Write After Write (WAW): Two instructions try to write to the same
register simultaneously.
o Solutions:
 Data Forwarding: Forward the result of the first instruction directly to
the second instruction, bypassing the register file.
 Stalling: Insert "no-op" instructions to stall the pipeline until the data
dependency is resolved.
3. Control Hazards:
o Occur due to branch instructions (e.g., conditional jumps) that alter the
normal flow of instruction execution.
o The pipeline may fetch and decode incorrect instructions if the branch
outcome is not known immediately.
o Solutions:
 Branch Prediction: Predict the outcome of the branch and continue
fetching instructions along the predicted path.
 Branch Delay Slots: Insert instructions after the branch that will
always be executed, regardless of the branch outcome.
Q18. Explain how occurrence of branch effects the pipeline execution.

Branch Instructions and Pipeline Hazards

Branch instructions, such as conditional jumps (e.g., if, else), significantly impact pipeline
execution. They disrupt the smooth flow of instructions by altering the program counter,
which determines the next instruction to be fetched.

How Branches Affect Pipelines:

1. Pipeline Stalls:

 When a branch instruction is encountered, the pipeline must typically wait until the
branch condition is evaluated to determine the correct path to follow.
 This leads to a stall, as subsequent instructions cannot be fetched until the branch
outcome is known.
 The pipeline must stop fetching new instructions until the branch decision is
resolved.
 This causes a delay, reducing the pipeline's efficiency.

2. Misprediction Penalties:

 To mitigate the impact of stalls, modern processors employ branch prediction

techniques. These techniques atempt to guess the outcome of the branch (taken or
not taken) before the actual condition is evaluated.
 If the prediction is incorrect, the pipeline must be flushed, discarding the incorrectly
fetched instructions. This leads to a significant performance penalty, as the correct
instructions need to be fetched and processed.
 The number of cycles wasted due to a branch depends on the pipeline depth and
branch resolution time.

3. Flushing the Pipeline:

 If the branch prediction is incorrect, instructions already fetched must be discarded,

leading to wasted cycles.

Example: Branch Misprediction in a 5-Stage Pipeline

Consider the following sequence of instructions in a 5-stage pipeline (Fetch, Decode,

Execute, Memory, Write Back):
Pipeline Execution with Branch Misprediction:

 Assume: The branch predictor incorrectly predicts that the branch will not be taken.
 Consequences:
o Instructions 4 and 5 are fetched and processed, even though the branch
condition is actually true.
o Once the branch condition is evaluated and the branch is found to be taken,
the pipeline must be flushed.
o The correct instructions following the branch target address need to be
fetched and processed, resulting in a significant delay.

Mitigating Branch Penalties:

 Branch Prediction:
o Static Branch Prediction: Makes predictions based on historical branch
behaviour (e.g., most branches are not taken).
o Dynamic Branch Prediction: Uses hardware-based predictors (e.g., branch
history tables) to make more accurate predictions based on recent branch
outcomes.
 Branch Target Buffers: Store the target addresses of frequently taken branches to
reduce the time required to fetch the next instruction.
 Delayed Branches: Insert instructions after the branch that will always be executed,
regardless of the branch outcome, to hide the branch penalty.

Q19. Explain Flynn’s classification

Flynn's Classification

This scheme was introduced by Michael J. Flynn and is based on multiplicity of instruction
stream (IS) i.e., sequence of instructions executed by the machine and data stream (DS) i.e.,
sequence of instructions executed by the machine and data stream (DS) i.e., sequence of
data including input, temporary or partial results referenced by instructions. Following are
Flynn's four machine organizations

Diagrams:

Flynn's classification diagram

1.Single Instruction Stream-single Data stream (SISD):

 It may have more than one functional unit but under the supervision of one control
unit.
 Instructions are executed sequentially, but may be overlapped in their execution
stage.
 SISD uniprocessor systems are pipelined.
EXAMPLE: A traditional desktop computer or laptop with a single CPU.

DIAGRAM:

2.Single Instruction Stream-multiple Data stream (SIMD)

 Multiple processing elements are used and they are supervised by the same control
unit.
 Same instruction stream (from control unit) are received by all PEs operate on
different data sets from distinct data streams.
 Shared-memory subsystems may contain multiple modules.
 SIMD machines can be further divides into two modes: word-slice versus bit-slice.
 This class corresponds to array processors.

EXAMPLE: Graphics processing unit (GPUs)

DIAGRAM:

3.Multiple Instruction Stream-single Data stream (MISD):

 Multiple processor units receive distinct instructions from (distinct) control units but
they operate over the same data streams and its derivatives.
 This structure has received much less attention and no real example of this class
exists.

EXAMPLE: A hypothetical system where multiple processors analyse the same analyse
the same data set independently.

DIAGRAM:

4.Multiple Instruction Stream-multiple Data stream (MIMD):

 It operates on multiple data streams through multiple processing units.

 Most multiprocessor systems and multiple computer systems can be classified in this
category.
 An intrinsic MIMD computer implies interactions among a processors . An intrinsic
MIMD computer is tightly coupled if the degree of interactions is high otherwise they
are loosely coupled.

EXAMPLE: Multi-core processors and multiprocessor systems.

DIAGRAM:
Why is Flynn's Classification Important?

Flynn's classification provides a fundamental framework for understanding the different

approaches to parallel processing. It helps us:

 Analyse the potential for parallelism in different algorithms and applications.

 Design efficient computer architectures that can exploit parallelism effectively.
 Evaluate the performance and scalability of parallel systems.

Q20. Describe handler’s classification

Handler's Classiﬁca on is a system for describing computer architectures, focusing on the degree of
parallelism and pipelining within the system. It divides the computer into three levels:

1. Processor Control Unit (PCU): This level corresponds to the processor or CPU. It controls the
execu on of instruc ons.

 k: The number of PCUs.

 k': The number of pipelined stages within each PCU.

2. Arithme c Logic Unit (ALU): This level corresponds to the func onal units or processing elements.
It performs arithme c and logical opera ons.

 d: The number of ALUs controlled by each PCU.

 d': The number of pipelined stages within each ALU.

3. Bit-Level Circuit (BLC): This level corresponds to the logic circuits needed to perform one-bit
opera ons within the ALU.

 w: The word length (number of bits) processed by each ALU.

 w': The number of pipelined stages within the BLC.

Nota on:

Handler's classiﬁca on is expressed using three pairs of integers separated by operators:

Computer = (k * k', d * d', w * w')

Example:

 CDC 6600: A classic example, its classiﬁca on would be:

o Main Processor: (1, 1 * 10, 60)

 One PCU, 10 pipelined ALUs, 60-bit word length.

o I/O Processors: (10, 1, 12)

 10 PCUs, each controlling one ALU, 12-bit word length.

Key Points:

 Parallelism: The k and d values indicate the degree of parallelism at the PCU and ALU levels,
respec vely.

 Pipelining: The k', d', and w' values indicate the degree of pipelining at each level.

 Flexibility: Handler's classiﬁca on can describe a wide range of computer architectures, from
simple to highly complex.

DIAGRAM:

Design of FPGA-Based Computing Systems with OpenCL 1st Edition Hasitha Muthumala Waidyasooriya download pdf
100% (1)
Design of FPGA-Based Computing Systems with OpenCL 1st Edition Hasitha Muthumala Waidyasooriya download pdf
55 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
AMD Gem5 APU Simulator Micro 2015 Final PDF
No ratings yet
AMD Gem5 APU Simulator Micro 2015 Final PDF
62 pages
onur-digitaldesign-2020-lecture19-simd-beforelecture
No ratings yet
onur-digitaldesign-2020-lecture19-simd-beforelecture
64 pages
Onur 447 Spring15 Lecture14 Simd Afterlecture
No ratings yet
Onur 447 Spring15 Lecture14 Simd Afterlecture
60 pages
S 8 Mod 1
No ratings yet
S 8 Mod 1
33 pages
Advance Computer Architecture2
No ratings yet
Advance Computer Architecture2
36 pages
Assignment
No ratings yet
Assignment
6 pages
Advanced Computer Architecture Unit 1
No ratings yet
Advanced Computer Architecture Unit 1
23 pages
Multi Core
No ratings yet
Multi Core
7 pages
1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
No ratings yet
1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
17 pages
Vector Computers
No ratings yet
Vector Computers
43 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
Week 4 PDC
No ratings yet
Week 4 PDC
11 pages
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
No ratings yet
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
22 pages
CS82 Advanced Computer Architecture: Parallel Computer Models 1.2 Multiprocessors and Multicomputers
No ratings yet
CS82 Advanced Computer Architecture: Parallel Computer Models 1.2 Multiprocessors and Multicomputers
19 pages
15CS72 ACA Module1 Chapter1FinalCopy
No ratings yet
15CS72 ACA Module1 Chapter1FinalCopy
25 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
Why Multiprocessors?: Motivation: Opportunity
No ratings yet
Why Multiprocessors?: Motivation: Opportunity
20 pages
array & vector processor
No ratings yet
array & vector processor
17 pages
Seminar
No ratings yet
Seminar
85 pages
CA Classes-196-200
No ratings yet
CA Classes-196-200
5 pages
Multicore Question Bank
No ratings yet
Multicore Question Bank
5 pages
Parallel Computig Assignment
No ratings yet
Parallel Computig Assignment
15 pages
Parallel Processors: Session 2
No ratings yet
Parallel Processors: Session 2
32 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
15CS72 IAT1 Solution
No ratings yet
15CS72 IAT1 Solution
12 pages
Chapter 8
No ratings yet
Chapter 8
59 pages
Chapter 6 Parallel and Concurrent Computing
No ratings yet
Chapter 6 Parallel and Concurrent Computing
27 pages
Parallel Computer Models: PCA Chapter 1
No ratings yet
Parallel Computer Models: PCA Chapter 1
61 pages
Multi
No ratings yet
Multi
5 pages
QUIZ PREP
No ratings yet
QUIZ PREP
21 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
5
No ratings yet
5
13 pages
Chapter - 5 Parallel Processing
No ratings yet
Chapter - 5 Parallel Processing
117 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
QUIZ PREP
No ratings yet
QUIZ PREP
21 pages
Introduction To Parallel Processing
No ratings yet
Introduction To Parallel Processing
49 pages
23.L20 Multiprocessing Multithreading Vectorization
No ratings yet
23.L20 Multiprocessing Multithreading Vectorization
38 pages
Classification of Parallel Computation
No ratings yet
Classification of Parallel Computation
33 pages
5 4 Parallel
No ratings yet
5 4 Parallel
47 pages
Aca305 2000
No ratings yet
Aca305 2000
8 pages
Parallel Architectures Parallel Architectures: Ever Faster
No ratings yet
Parallel Architectures Parallel Architectures: Ever Faster
11 pages
Architecture1 1 (2012)
No ratings yet
Architecture1 1 (2012)
87 pages
Module 2 - Parallel Computing
No ratings yet
Module 2 - Parallel Computing
55 pages
HPC Pyq 2023
No ratings yet
HPC Pyq 2023
24 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
onur-digitaldesign-2020-lecture20-gpu-beforelecture
No ratings yet
onur-digitaldesign-2020-lecture20-gpu-beforelecture
73 pages
Advanced Computer Architecture - Parallelism Scalability & Programability - Kai Hwang
100% (2)
Advanced Computer Architecture - Parallelism Scalability & Programability - Kai Hwang
165 pages
NOTES
No ratings yet
NOTES
19 pages
module-4-chapter-2
No ratings yet
module-4-chapter-2
42 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Quantum Computer Vs Traditional Computer
From Everand
Quantum Computer Vs Traditional Computer
Arief Muinnudin
No ratings yet
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
From Everand
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
Sam Steed
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Operating Systems Interview Questions You'll Most Likely Be Asked
From Everand
Operating Systems Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Chapter 11 Interrupts
50% (2)
Chapter 11 Interrupts
29 pages
Multivector&SIMD Computers Ch8
No ratings yet
Multivector&SIMD Computers Ch8
12 pages
Cco Unit 5
No ratings yet
Cco Unit 5
41 pages
CS252 Graduate Computer Architecture Multithreading / Vector Processing March 2, 2011
No ratings yet
CS252 Graduate Computer Architecture Multithreading / Vector Processing March 2, 2011
26 pages
TechTalk Kruppe Espasa RISC V Vectors and LLVM
No ratings yet
TechTalk Kruppe Espasa RISC V Vectors and LLVM
23 pages
2023 24 Computer Architecture
No ratings yet
2023 24 Computer Architecture
2 pages
Parallel Computing
No ratings yet
Parallel Computing
34 pages
Pipelining in Computer Architecture
No ratings yet
Pipelining in Computer Architecture
46 pages
COA Full Notes
No ratings yet
COA Full Notes
70 pages
Intel CPU-The Instruction Set Architecture
No ratings yet
Intel CPU-The Instruction Set Architecture
2 pages
05 Huawei MindSpore AI Development Framework - ALEX
No ratings yet
05 Huawei MindSpore AI Development Framework - ALEX
57 pages
PowerPoint Slides To Chapter 07
No ratings yet
PowerPoint Slides To Chapter 07
49 pages
Advance Computer Architecture
86% (7)
Advance Computer Architecture
166 pages
Vector Processor
No ratings yet
Vector Processor
83 pages
IBM z13 Overview For DFW System Z User Group - 2015mar
No ratings yet
IBM z13 Overview For DFW System Z User Group - 2015mar
107 pages
Aehowto
No ratings yet
Aehowto
98 pages
CSPC2005
No ratings yet
CSPC2005
2 pages
Questions With Answers
No ratings yet
Questions With Answers
22 pages
Aca 3
No ratings yet
Aca 3
113 pages
Computer Architecture Simd Vector Gpu
No ratings yet
Computer Architecture Simd Vector Gpu
16 pages
Specification and Design of Embedded Systems - Daniel D Gajski, Jie Gong
No ratings yet
Specification and Design of Embedded Systems - Daniel D Gajski, Jie Gong
216 pages
Unit-4 Pipelinie and Vector Processing
No ratings yet
Unit-4 Pipelinie and Vector Processing
33 pages
Parallel Computing Pastpaper Solve by Noman Tariq
No ratings yet
Parallel Computing Pastpaper Solve by Noman Tariq
30 pages
@vtucode - in 21CS643 Module 1 2021 Scheme
No ratings yet
@vtucode - in 21CS643 Module 1 2021 Scheme
127 pages