Advanced Computer Architecture Assigment
Advanced Computer Architecture Assigment
Vector supercomputers are specialized machines designed to execute large-scale scien fic and
engineering computa ons efficiently. Unlike tradi onal scalar processors that operate on single data
items at a me, vector processors can process mul ple data elements simultaneously using a single
instruc on. This parallel processing capability significantly enhances computa onal performance,
making vector supercomputers ideal for applica ons like weather forecas ng, climate modeling, and
computa onal fluid dynamics.
1. Vector Registers: These are long registers that can store mul ple data elements, typically
floa ng-point numbers.
2. Vector Processing Units (VPUs): These units perform arithme c and logical opera ons on
the data stored in vector registers. They are op mized for high-speed vector opera ons.
3. Memory System: A high-bandwidth memory system is essen al to supply data to the VPUs
at a rate that matches their processing speed.
4. Vector Instruc ons: These instruc ons specify opera ons on en re vectors of data, rather
than individual elements.
5. Pipelining: To maximize throughput, VPUs employed pipelining, allowing mul ple opera ons
to be in progress concurrently.
6. Parallelism: Exploits data-level parallelism by opera ng on mul ple data elements
simultaneously.
1. Weather Predic on: Solving large-scale differen al equa ons for climate modelling.
1. Cray-1 (1976)
2. Fujitsu VP Series
o Developer: Fujitsu
3. NEC SX Series
Combines scalar and vector processing capabilities for scientific and engineering applications.
Diagram:
Q i). Draw the state transi on diagram for scheduling the pipeline.
Q. ii) List all simple and greedy cycles.
Q iii). Determine the op mal constant latency cycle and minimal average latency.
Q v). Let the pipeline clock period be T = 10ns. Determine the throughput of this pipeline.
Q.3 Explain the differences among UMA, NUMA, COMA and NORMA computers
These are different memory architectures used in mul processor systems. The differences lie in
how memory is organized, accessed, and shared among processors.
Defini on: In UMA systems, all processors share the same physical memory and have equal
access me to the memory, regardless of which processor accesses it.
Advantages:
Disadvantages:
o Performance bo leneck for systems with many processors because of conten on for
memory.
o Limited scalability.
Example:
o Tradi onal mul core systems like Intel Xeon processors in SMP configura on.
Diagram:
Defini on: In NUMA systems, each processor has its own local memory, but processors can
also access the memory of other processors. Access to local memory is faster than access to
remote memory.
Advantages:
Disadvantages:
Example:
Diagram:
3. Cache-Only Memory Architecture (COMA)
Defini on: In COMA, there is no separate main memory. Each processor has a large local
cache that serves as the memory. All memory in the system behaves like a distributed cache.
Advantages:
Disadvantages:
Example:
o Used in experimental and specialized systems, like KSR-1 (Kendall Square Research
machines).
Diagram:
4. No Remote Memory Access (NORMA)
Defini on: In NORMA systems, each processor has its own private memory, and there is no
shared memory or remote memory access. Processors communicate via message passing.
Advantages:
Disadvantages:
Example:
Diagram:
Key Comparison Table
Shared
Shared memory Distributed
Memory memory No shared
(non-uniform cache as
Sharing (uniform memory
access) memory
access)
Low (if
High (requires
cached),
Access Latency Uniform Non-uniform explicit
dynamic
communica on)
migra on
Moderate
Scalability Limited Moderate High
to high
Explicit
Implicit Implicit Implicit
Communica on (so ware, e.g.,
(hardware) (hardware/so ware) (hardware)
MPI)
SMP High-performance
Example KSR-1 Beowulf clusters
systems servers
Q4. What is data dependence and control dependence? Write the programs which shows this
dependency among data.
Data Dependence
Data dependence occurs when an instruc on depends on the result of a previous instruc on. There
are three types:
Control Dependence
Control dependence occurs when the execu on of an instruc on depends on the control flow of
a previous instruc on (e.g., a branch or condi onal).
Example:
Q5. What are data and control hazards? Describe various methods to resolve these hazards.
Data Hazards
Data hazards occur when instruc ons that exhibit data dependencies are executed
simultaneously in a pipeline and the desired order of execu on is disrupted. These
dependencies can cause incorrect program execu on if not handled properly.
Types of Data Hazards
1. Read A er Write (RAW) - Also known as a true dependency. It occurs when an
instruc on depends on the result of a previous instruc on.
Example:
LOAD R1, 0(R2) # Instruc on on 1 reads memory into R1
Problem: If Instruc on on 2 writes to R1 before Instruc on on 1 reads it, incorrect data will be read.
3. Write A er Write (WAW) - Also known as an output dependency. It occurs when two
instruc ons write to the same register or memory loca on.
Example:
ADD R1, R2, R3 # Instruc on on 1 write to R1
SUB R1, R4, R5 # Instruc on on 2 writes to R1
Problem: The final value of R1 depends on the order of comple on
Control Hazards
Control hazards occur due to branch instruc ons that change the flow of program
execu on. These hazards arise when the pipeline makes incorrect predic ons about the flow
of control.
Examples of Control Hazards
1. Branch Instruc ons:
o If the branch condi on is not resolved early enough, the pipeline might fetch
instruc ons from the wrong path.
2. Jump Instruc ons:
o These redirect the program counter (PC) to a new loca on, poten ally
invalida ng prefetched instruc ons.
Reduces pipeline
Pipeline Stalls Simple to implement
efficiency
Instruc on
Improves parallelism Compiler-dependent
Reordering
Specula ve
Improves branch handling High hardware complexity
Execu on
Advantages:
Disadvantages:
Complexity: More difficult to design and implement due to the need for careful
synchronization and data management.
Non-Determinism: The output may vary slightly between different runs due to the
non-deterministic nature of task execution.
Real-world Analogy:
Synchronized: A factory assembly line where each worker performs a specific task in
a sequential order.
Asynchronous: A group of students working on a project, each focusing on different
parts independently.
Synchronized Parallel
Feature Asynchronous Parallel Algorithms
Algorithms
Sequential with
Task Execution Independent and concurrent
synchronization
Data
Strict data dependencies Relaxed data dependencies
Dependency
Limited by
Performance Can achieve high performance
synchronization
Limited by
Scalability Can scale well
synchronization
Complexity Simpler to implement More complex to implement
Determinism Deterministic Non-deterministic
Task
Tasks are interdependent Tasks are independent
Dependency
Requires strict coordination and
Coordination Less coordination is needed
synchronization
Can be less efficient due to sequential Can be more efficient due to
Efficiency
nature parallel execution
Ease of Harder due to non-deterministic
Easier due to deterministic order.
Debugging order.
i) Bernstein's condition:
Bernstein's Conditions
Bernstein's conditions are a set of rules used to determine whether two statements in a
program can be executed concurrently without affecting the final result. If two statements
satisfy these conditions, they are considered independent and can be executed in parallel.
1. Read-Write Conflict: The first statement must not read any variable that the second
statement writes to.
o R(S1) ∩ W(S2) = ∅
2. Write-Read Conflict: The first statement must not write to any variable that the
second statement reads from.
o W(S1) ∩ R(S2) = ∅
3. Write-Write Conflict: Neither statement must write to the same variable.
o W(S1) ∩ W(S2) = ∅
Example:
Here:
Since all three conditions are satisfied, statements S1 and S2 can be executed concurrently.
Here's a Venn diagram illustrating the read and write sets for the two statements S1: x = y +
z and S2: a = b * c:
Explanation:
As we can see, the two circles do not intersect. This means that there is no overlap between
the read and write sets of the two statements. Therefore, they can be executed
concurrently without affecting each other's results.
Degree of Parallelism
The degree of parallelism (DOP) refers to the number of tasks, processes, or threads that
can execute simultaneously in a parallel system. It quantifies the level of concurrency
achievable in a program or system.
Formula:
Characteristics:
1. A high DOP indicates more tasks can run concurrently, improving performance on
systems with many processors.
2. Limited DOP often results from:
o Data dependencies.
o Resource constraints (e.g., limited processors or memory).
Example:
Task A: 2 seconds.
Task B: 3 seconds.
Task C: 4 seconds.
Task D: 2 seconds.
Sequential Execution:
Parallel Execution:
Degree of Parallelism:
Diagram:
1. Sequential Execution:
2. Parallel Execution:
In this case, three processors are utilized, and the execution time is reduced compared to
sequential processing.
Amdahl's Law
Amdahl's Law is a formula used to predict the maximum possible speedup of a program
using parallel processing. It states that the overall speedup of a program is limited by the
sequential portion of the program that cannot be parallelized.
Formula:
Where:
N: Number of processors.
Key Insights:
Diagram:
1. Sequential Portion: This represents the 30% of the task that cannot be parallelized. It
acts as a bottleneck and limits the overall speedup.
2. Parallel Portion: This represents the 70% of the task that can be parallelized. It is
divided into four equal segments, each representing a portion assigned to one of the
four processors (N=4).
3. Speedup: The overall speedup is limited by the sequential portion. Even though the
parallel portion can be sped up by a factor of 4, the total speedup is only about 2.11.
4. Maximum Speedup: As the number of processors (N) approaches infinity, the
speedup approaches the theoretical maximum, which is 1/(1-P) = 1/0.3 = 3.33.
However, in practice, achieving this maximum speedup is often not feasible due to
factors like communication overhead and resource limitations.
Tomasulo's Algorithm
Key Concepts:
Instruc on Issue: Instruc ons are fetched from the instruc on fetch unit and issued to the
reserva on sta ons.
Reserva on Sta ons: These are hardware buffers that hold instruc ons wai ng for their
operands.
Reorder Buffer (ROB): This buffer holds the results of instruc ons in the order they were
issued, ensuring correct program order.
Func onal Units: These are hardware units that perform arithme c, logical, and floa ng-
point opera ons.
Algorithm Steps:
1. Instruc on Issue:
o Instruc on is fetched and decoded.
o If operands are ready, the instruc on is immediately sent to the func onal unit.
o If operands are not ready, the instruc on is placed in a reserva on sta on.
2. Execu on:
o Func onal units execute instruc ons as soon as their operands are available.
3. Comple on:
o When an instruc on completes, its result is wri en back to the register file or to
other reserva on sta ons.
o The instruc on is removed from the reserva on sta on and the ROB.
Example:
Execu on:
1. Issue:
oInstruc ons 1 and 2 are issued to their respec ve func onal units.
oInstruc on 3 is issued but stalls because R1 and R4 are not ready.
2. Execu on:
o The addi on and mul plica on opera ons start execu ng.
3. Comple on:
o The addi on completes and its result is wri en to R1.
o Instruc on 3 can now be executed because R1 is ready.
o The mul plica on completes and its result is wri en to R4.
o Instruc on 3 completes.
Benefits of Tomasulo's Algorithm:
Drawbacks:
BOOK EXAMPLE:
v). Remote procedure call
Remote Procedure Call (RPC) is a programming paradigm that allows a program to call a
procedure (function) on another computer as if it were a local procedure. This enables
distributed computing, where different components of an application can be located on
different machines.
1. Client Stub: The client program calls a local procedure, which is a stub function.
2. Parameter Marshalling: The client stub marshals (packages) the arguments for the
remote procedure.
3. Network Transmission: The marshalled arguments are sent over the network to the
server.
4. Server Stub: The server receives the request and unmarshals the arguments.
5. Procedure Execution: The server stub calls the actual remote procedure.
6. Result Marshalling: The server marshals the return value or results.
7. Network Transmission: The marshalled results are sent back to the client.
8. Client Stub: The client stub unmarshals the results and returns them to the client
program.
Steps of RPC:
Example:
Diagram:
Key Points:
RPC Frameworks:
By using RPC, developers can create distributed systems that are more scalable, fault-
tolerant, and efficient.
Core Components:
1. Instruction Pool: This is where the instructions to be executed are stored. In a SIMD
architecture, a single instruction is broadcast to all the processing elements.
2. Vector Unit: This unit houses multiple Processing Elements (PEs). Each PE has its own
local memory and arithmetic logic unit (ALU).
3. Data Pool: This is where the data to be processed is stored. Each PE has access to a
portion of the data pool.
How it Works:
1. Instruction Fetch: The control unit fetches an instruction from the instruction pool.
2. Instruction Broadcast: The fetched instruction is broadcast to all the PEs
simultaneously.
3. Data Fetch: Each PE fetches the data it needs to process from the data pool.
4. Execution: All PEs execute the same instruction on their respective data elements in
parallel.
5. Result Storage: The results of the computation are stored in the local memory of
each PE.
Key Points:
Limitations:
In essence, the SIMD architecture leverages the power of parallelism to efficiently process
large amounts of data. By executing the same instruction on multiple data elements
simultaneously, it achieves significant performance gains in many applications.
Advantages of SIMD:
Array processors are a type of parallel computer architecture designed to efficiently process
large arrays of data. They are well-suited for tasks like matrix operations, signal processing,
and image processing.
1. Matrix Multiplication:
Divide and Conquer: The matrices are divided into smaller sub-matrices. Each
processor is assigned a sub-matrix multiplication task.
Systolic Array: A specialized hardware array where data flows through the array, and
each processing element performs a simple operation on the data it receives.
Parallel Radix-2 FFT: The FFT algorithm is divided into stages, and each stage can be
parallelized across multiple processors.
3. Sorting:
Parallel Merge Sort: The array is divided into smaller sub-arrays, which are sorted
independently. The sorted sub-arrays are then merged in parallel.
Parallel Quick Sort: Similar to sequential quick sort, but the partitioning and sorting
of sub-arrays can be parallelized
Consider a 2D array processor with 4x4 processing elements. We want to multiply two 4x4
matrices, A and B, to get the result matrix C.
2D array processor performing matrix multiplication
Steps:
1. Data Distribution: Matrix A and B are distributed across the processing elements.
2. Parallel Multiplication: Each processing element multiplies the corresponding
elements of its sub-matrices of A and B.
3. Partial Summation: The partial products are summed within each row and column of
processing elements.
4. Global Summation: The final results are obtained by summing the partial sums
across the entire array.
Limitations:
Q11. Discuss the scheduling and load balancing problem for a multi-processor system.
Give a suitable example with illustrative diagrams.
In a multi-processor system, efficient scheduling and load balancing are crucial for
maximizing performance and resource utilization. Scheduling involves assigning tasks to
processors, while load balancing aims to distribute the workload evenly among processors.
Scheduling Problem
The scheduling problem in a multi-processor system involves determining the optimal order
and timing for executing tasks. The goal is to minimize task completion time and maximize
system throughput.
Types of Scheduling:
Static Scheduling: Tasks are assigned to processors before execution begins. This
approach is suitable for tasks with predictable execution times.
Dynamic Scheduling: Tasks are assigned to processors at runtime, based on their
availability and the current workload. This approach is more flexible but can
introduce overhead due to dynamic scheduling decisions.
Load balancing ensures that the workload is evenly distributed among processors. This helps
prevent idle processors and resource bottlenecks.
Static Load Balancing: Tasks are assigned to processors based on static estimates of
their workload.
Dynamic Load Balancing: Tasks are dynamically migrated between processors to
balance the load at runtime.
Consider a multi-processor system with 4 processors (P1, P2, P3, P4) and a set of tasks with
different execution times.
Static Scheduling:
Dynamic Scheduling:
As T1 finishes, P1 can take on T5. If P2 finishes T3 before P1 finishes T1, P2 can take on T5.
Diagram:
Challenges:
Interconnection Networks
1. Direct Networks:
Fully Connected Network: Each PE is directly connected to every other PE. While this
offers high connectivity, it becomes impractical for large systems due to the high
number of connections.
Star Network: A central node (hub) connects to all other PEs. This network is simple
but prone to bottlenecks at the hub.
Bus Network: A single bus connects all PEs. It's simple and cheap but has limited
bandwidth and scalability.
Ring Network: PEs are connected in a circular fashion. It offers better scalability than
a bus but can suffer from congestion if many PEs need to communicate
simultaneously.
2. Indirect Networks:
Disadvantages:
Diagram:
+
Here is a diagram illustrating a 2D Mesh interconnection network. It shows 9 nodes
arranged in a 3x3 grid, each connected to its immediate neighbours (up, down, left, right).
Let me know if further adjustments or explanations are needed!
The choice of interconnection network depends on the specific requirements of the parallel
application, such as the communication pattern, the number of PEs, and the desired
performance. By carefully selecting and designing the interconnection network, it is possible
to achieve high performance and efficiency in parallel computing systems.
Instruction Pipeline
While one instruction is in the Write Back stage, the next instruction can be in the Memory
Access stage, and so on. This overlapping of instruction execution increases the overall
throughput of the processor.
Example:
Pipeline Execution:
As you can see, multiple instructions are being processed simultaneously in different stages
of the pipeline. This significantly improves the overall performance of the processor.
Pipeline Hazards:
o Structural Hazards: When multiple instructions require the same hardware
resource at the same time.
o Data Hazards: When an instruction depends on the result of a previous
instruction that has not yet completed.
o Control Hazards: When the instruction pipeline needs to be stalled due to
branch instructions or exceptions.
Clock Cycle Time: The pipeline stage with the longest delay determines the overall
clock cycle time.
To address these challenges, techniques like pipelining optimization, branch prediction, and
instruction scheduling are used to improve pipeline efficiency.
Cache Coherence
Maintaining the consistency among those copies raises the problem of cache coherence.
Following is the three cause
1. Sharing of Writable Data: When multiple processors modify the same data, their cached
copies can become outdated.
2. Process Migration: If a process with cached data moves to a different processor, its
cached data might not be consistent with the main memory.
3. I/O Activity: I/O operations can update memory directly making cached data invalid.
From the point of view of cache coherence, data structures can be divided into three
classes:
1. Read-Only: No coherence issues arise as there are no updates to the data.
3. Private Writable: Problems only occur during process migration, as the data is only
written by a single process.
There are several techniques to maintain cache coherence for the shared writable data
structure. These methods can be divided into two classes:
1. Memory Update Policy: This category focuses on how the memory is updated when
a cache block is modified.
Write Through: Each write operation is written to both the cache and main memory
simultaneously.
DIAGRAM:
Write Back: Writes are only made to the cache. The modified block is written back to
main memory only when it is evicted from the cache.
DIAGRAM:
2. Cache Coherence Policy: This category deals with how the cache maintains
consistency among multiple caches.
o Write Invalidate: When a block is modified, all other caches containing that
block are invalidated.
DIAGRAM:
o Write Update: When a block is modified, all other caches containing that
block are updated with the new value.
DAIGRAM:
3. Interconnection Scheme: This category determines how the different components of
the system, such as processors and memory modules, are connected and how data
flows between them, which significantly impacts the overall performance and
efficiency of the cache coherence mechanism.
o Single Bus: A single shared bus connects all components.
o Multistage Directory: A hierarchical directory structure is used to track cache
block locations.
o Multiple Bus Hierarchical Cache Coherence Protocols: Multiple buses are
used, and cache coherence is maintained using hierarchical protocols.
Full-Map Directories: Each directory entry contains a full map of the cache locations
for a particular block.
Limited Directories: Each directory entry contains a limited number of pointers to
caches containing the block.
Chained Directories: Directory entries for a block are chained together, forming a
linked list.
Main Categories:
Problem:
When multiple processors have cached copies of the same memory location, and one
processor modifies the data, the other copies become stale. This inconsistency can lead to
incorrect program execution.
Example:
1. Initial State:
o P1 reads the value of X from main memory and caches it.
o P2 also reads the value of X from main memory and caches it.
o Both P1 and P2 have identical copies of X in their caches.
2. Modification:
o P1 modifies the value of X in its cache.
3. Incoherence:
P2's cached copy of X is now stale. If P2 reads X from its cache, it will get the old, incorrect
value
Solu ons:
Cache coherence protocols are used to maintain consistency among caches. Common approaches
include:
Diagram (Write-Invalidate):
m\
Write-Update: When a processor writes to a shared loca on, it updates the corresponding
cache lines in other processors' caches.
Q15.write algorithm for associative search
Let's say we have an associative memory storing student records. Each record contains the
student's ID, name, and grade. We want to find the record for a student with the ID
"12345".
In this diagram:
Key Points:
Parallelism: Associative search is inherently parallel, making it very fast for certain
types of searches.
Hardware Complexity: Implementing true associative memory can be complex and
expensive.
Applications: Used in areas like network routers, database systems, and content-
based image retrieval.
Q16. Proving that a k stage pipeline can be at most k times faster than that of a
nonpipelined one.
Pipeline hazards are situations that disrupt the smooth flow of instructions through the
pipeline, causing stalls or delays. They occur when the next instruction cannot execute in its
designated clock cycle due to various dependencies.
1. Structural Hazards:
o Occur when multiple instructions require the same hardware resource
simultaneously (e.g., two instructions trying to access the same memory
unit).
o Solution: Replicate hardware resources (e.g., multiple memory units).
2. Data Hazards:
o Occur when an instruction depends on the result of a previous instruction
that is still being processed in the pipeline.
o Types of Data Hazards:
Read After Write (RAW): An instruction tries to read a data item
before a previous instruction has finished writing it.
Write After Read (WAR): An instruction tries to write to a register
before a previous instruction has read from it.
Write After Write (WAW): Two instructions try to write to the same
register simultaneously.
o Solutions:
Data Forwarding: Forward the result of the first instruction directly to
the second instruction, bypassing the register file.
Stalling: Insert "no-op" instructions to stall the pipeline until the data
dependency is resolved.
3. Control Hazards:
o Occur due to branch instructions (e.g., conditional jumps) that alter the
normal flow of instruction execution.
o The pipeline may fetch and decode incorrect instructions if the branch
outcome is not known immediately.
o Solutions:
Branch Prediction: Predict the outcome of the branch and continue
fetching instructions along the predicted path.
Branch Delay Slots: Insert instructions after the branch that will
always be executed, regardless of the branch outcome.
Q18. Explain how occurrence of branch effects the pipeline execution.
Branch instructions, such as conditional jumps (e.g., if, else), significantly impact pipeline
execution. They disrupt the smooth flow of instructions by altering the program counter,
which determines the next instruction to be fetched.
1. Pipeline Stalls:
When a branch instruction is encountered, the pipeline must typically wait until the
branch condition is evaluated to determine the correct path to follow.
This leads to a stall, as subsequent instructions cannot be fetched until the branch
outcome is known.
The pipeline must stop fetching new instructions until the branch decision is
resolved.
This causes a delay, reducing the pipeline's efficiency.
2. Misprediction Penalties:
Assume: The branch predictor incorrectly predicts that the branch will not be taken.
Consequences:
o Instructions 4 and 5 are fetched and processed, even though the branch
condition is actually true.
o Once the branch condition is evaluated and the branch is found to be taken,
the pipeline must be flushed.
o The correct instructions following the branch target address need to be
fetched and processed, resulting in a significant delay.
Flynn's Classification
This scheme was introduced by Michael J. Flynn and is based on multiplicity of instruction
stream (IS) i.e., sequence of instructions executed by the machine and data stream (DS) i.e.,
sequence of instructions executed by the machine and data stream (DS) i.e., sequence of
data including input, temporary or partial results referenced by instructions. Following are
Flynn's four machine organizations
Diagrams:
It may have more than one functional unit but under the supervision of one control
unit.
Instructions are executed sequentially, but may be overlapped in their execution
stage.
SISD uniprocessor systems are pipelined.
EXAMPLE: A traditional desktop computer or laptop with a single CPU.
DIAGRAM:
Multiple processing elements are used and they are supervised by the same control
unit.
Same instruction stream (from control unit) are received by all PEs operate on
different data sets from distinct data streams.
Shared-memory subsystems may contain multiple modules.
SIMD machines can be further divides into two modes: word-slice versus bit-slice.
This class corresponds to array processors.
DIAGRAM:
Multiple processor units receive distinct instructions from (distinct) control units but
they operate over the same data streams and its derivatives.
This structure has received much less attention and no real example of this class
exists.
EXAMPLE: A hypothetical system where multiple processors analyse the same analyse
the same data set independently.
DIAGRAM:
DIAGRAM:
Why is Flynn's Classification Important?
Handler's Classifica on is a system for describing computer architectures, focusing on the degree of
parallelism and pipelining within the system. It divides the computer into three levels:
1. Processor Control Unit (PCU): This level corresponds to the processor or CPU. It controls the
execu on of instruc ons.
2. Arithme c Logic Unit (ALU): This level corresponds to the func onal units or processing elements.
It performs arithme c and logical opera ons.
3. Bit-Level Circuit (BLC): This level corresponds to the logic circuits needed to perform one-bit
opera ons within the ALU.
Nota on:
Example:
Key Points:
Parallelism: The k and d values indicate the degree of parallelism at the PCU and ALU levels,
respec vely.
Pipelining: The k', d', and w' values indicate the degree of pipelining at each level.
Flexibility: Handler's classifica on can describe a wide range of computer architectures, from
simple to highly complex.
DIAGRAM: