Computer Architecture and OS
Computer Architecture and OS
Computer Architecture and OS
1. Describe the methods of reading data from a hard disk and their problems. Describe the
head positioning problem and the solution strategies. Describe and compare the different
RAID levels.
A magnetic disk consists of one or more aluminium platters with a magnetizable coating.
Originally these platters were as much as 50 cm in diameter, but at present they are typically
3 to 12 cm. A disk head containing an induction coil floats just over the surface, resting on a
cushion of air. When a positive or negative current passes through the head, it magnetizes the
surface just beneath the head, aligning the magnetic particles facing left or facing right,
depending on the polarity of the drive current. When the head passes over a magnetized area,
a positive or negative current is induced in the head, making it possible to read back the
previously stored bits. Thus as the platter rotates under the head, a stream of bits can be
written and later read back.
The circular sequence of bits written as the disk makes a complete rotation is called a track.
Each of the tracks is divided up into some number of fixed-length sectors typically containing
512 data bytes, preceded by a preamble that allows the head to be synchronized before
reading or writing. Following the data is an Error-Correcting Code (ECC), ex: hamming or
Reed-Solomon code are used to check if the data is corrupted. Between consecutive sectors is
a small intersector gap. The formatted capacity of the disk is typically 15% less than the
unformatted capacity.
All disks have movable arms that are capable of moving in and out to different radial
distances from the spindle about which the platter rotates. At each radial distance, a different
track can be written. The tracks are thus a series of concentric circles about the spindle. The
width of a track depends on how large the head is and how accurately the head can be
positioned radially. With current technology, disks have between 5000 and 10,000 tracks per
cm, giving track widths in the 1- to 2-micron range.
A little thought and the use of that old high-school math formula for the circumference of a
circle, c = 2πr, will reveal that the outer tracks have more linear distance around them than
the inner ones do. Since all magnetic disks rotate at a constant angular velocity, no matter
where the heads are, this observation creates a problem.
Deadlock refers to special condition where two or more processes are waiting indefinitely to
use the system resources. Deadlock is a common problem in multiprocessing where many
processes share a specific type of mutually exclusive resource known as a software, or soft,
lock. In older drives, manufacturers used the maximum possible linear density on the
innermost track, and successively lower linear bit densities on tracks further out. Nowadays, a
different strategy is used. Cylinders are divided into zones and the number of sectors per
track is increased in each zone moving outward from the innermost track. This change makes
keeping track of information harder but increases the drive capacity, which is viewed as more
important.
Associated with each drive is a disk controller, a chip that controls the drive. The controller’s
tasks include accepting commands from the software, such as READ, WRITE, and
FORMAT, controlling the arm motion, detecting and correcting errors, and converting for
potential future use, and remapping bad sectors.
In their 1988 paper, Patterson et al. suggested six specific disk organizations that could be
used to improve disk performance, reliability, or both (Patterson et al., 1988). These ideas
were quickly adopted by industry and have led to a new class of I/O device called a RAID
(Redundant Array of Independent Disks).
In a RAID 0 system data are split up into blocks that get written across all the drives in the
array. By using multiple disks (at least 2) at the same time, this offers superior I/O
performance.
Data are stored twice by writing them to both the data drive (or set of data drives) and a
mirror drive (or set of drives). If a drive fails, the controller uses either the data drive or the
mirror drive for data recovery and continues operation.
RAID 2: bit stripped and stores ECC - This uses bit level striping. i.e Instead of striping the
blocks across the disks, it stripes the bits across the disks. This uses Hamming error
correction code (ECC), and stores this information in the redundancy disks.
RAID 3: Byte stripped and dedicated parity disk - This uses byte level striping. i.e Instead of
striping the blocks across the disks, it stripes the bytes across the disks. Uses multiple data
disks, and a dedicated disk to store parity.
RAID 5: Block stripping and with single parity - Data blocks are striped across the drives and
on one drive a parity checksum of all the block data is written. The parity data are not written
to a fixed drive, they are spread across all drives, as the drawing below shows.
RAID 6: Block stripping and with double parity - This creates two parity blocks for each data
block, Can handle two disk failure, This RAID configuration is complex to implement in a
RAID controller, as it has to calculate two parity data for each data block.
2b. Describe briefly the static and dynamic memory allocation techniques and strategies.
Describe the segmentation virtual memory allocation. Describe the paging virtual memory
allocation, strategies, methods, problems and solutions. Compare the segmentation and
paging virtual memory allocation.
What is memory allocation?
It is process of providing memory to the variables Memory allocation is a very important
part. When the program is loaded into the system memory, the memory region allocated to
the program is divided into three broad regions: stack, heap, and code.
Stack region is used for storing program's local variables when they're declared. Also,
variables and arrays declared at the start of a function, including main, are allocated stack.
Heap region is exclusively for dynamic memory allocation.
Dynamic memory allocation refers to managing system memory at runtime. Programs may
request memory and may also return previously dynamically allocated memory. Memory
may be returned whenever it is no longer needed. Memory can be returned in any order
without any relation to the order in which it was allocated. The heap may develop "holes"
where previously allocated memory has been returned between blocks of memory still in use.
This might start well but eventually when the number of small holes in the memory increase,
the memory gets more and more fragmented. This phenomenon is known as external
fragmentation.
The solution for this is compaction. The OS shifts the processes so that they are continuous
and all the free memory is together as a block. The disadvantage with this is that it is time
consuming and wasting processor time.
Best-fit: chooses the block that are closest in size to the request.
First-Fit scans the memory from the beginning and chooses the first available block that is
large enough for the request.
Next-Fit scans the memory from the last location of placement and chooses the next available
block that is large enough.
Paging
P0 0-0=0
P1 5-1=4
P2 8-2=6
P3 16 - 3 = 13
P0 0 5 0
P1 1 3 5
P2 2 8 14
P3 3 6 8
Waiting time of each process is as follows −
P0 0-0=0
P1 5-1=4
P2 14 - 2 = 12
P3 8-3=5
P0 0 5 1 0
P1 1 3 2 11
P2 2 8 1 14
P3 3 6 3 5
Waiting time of each process is as follows −
P0 0-0=0
P1 11 - 1 = 10
P2 14 - 2 = 12
P3 5-3=2
Shortest remaining time (SRT) is the preemptive version of the SJN algorithm.
The processor is allocated to the job closest to completion but it can be preempted by
a newer ready job with shorter time to completion.
Impossible to implement in interactive systems where required CPU time is not
known.
It is often used in batch environments where short jobs need to give preference.
P0 (0 - 0) + (12 - 3) = 9
P1 (3 - 1) = 2
P3 (9 - 3) + (17 - 12) = 11
Multiple-level queues are not an independent scheduling algorithm. They make use of other
existing algorithms to group and schedule jobs with common characteristics.
Deadlocks can be avoided by avoiding at least one of the four conditions, because all this
four conditions are required simultaneously to cause deadlock.
Mutual Exclusion – a resource cannot be used by more than one process at a time. It is very
difficult to eliminate mutual exclusion as some resources cannot be shared.
Hold and Wait - In this condition processes must be prevented from holding one or more
resources while simultaneously waiting for one or more others. To do this allocate all
required resources to the process before the start of its execution. This solution may lead to
starvation.
No Pre-emption (the process requesting for new resources without releasing the resources) –
when there a process with high priority requesting for the resources then prevent low priority
process to use the resources.
Circular Wait - Circular wait can be avoided if we number all resources, and require that
processes request resources only in strictly increasing(or decreasing) order. Each resource
will be assigned with a numerical number. A process can request the resources
increasing/decreasing. order of numbering.
Handling deadlocks
1. We can take a resource from one process and give it to other. This will resolve the
deadlock situation, but sometimes it does causes problems.
3. The simplest solution is to kill the process and release the resources.
Ostrich algorithm – as the name suggests stick your head in the sand and pretend there is no
problem at all.
5b . Describe the different Input/Output methods of the operating systems. Describe their
evolution and different properties. Describe the different buffering techniques. Describe in
detail the DMA data transfer strategy.
The method that is used to transfer information between internal storage and external I/O
devices is known as I/O interface. The CPU is interfaced using special communication links
by the peripherals connected to any computer system. here exists special hardware
components between CPU and peripherals to supervise and synchronize all the input and
output transfers that are called interface units.
Data transfer to and from the peripherals may be done in any of the three possible ways:
1. Programmed I/O.
2. Interrupt- initiated I/O.
3. Direct memory access( DMA).
Programmed I/O refer to data transfer initiated by the CPU using the driver software control
to access registers or memory on a device.
A special commands to inform the interface to issue an interrupt request signal whenever data
is available from any device. In the meantime the CPU can proceed for any other program
execution.
Both the methods programmed I/O and Interrupt-driven I/O require the active intervention of
the processor to transfer data between memory and the I/O module.
Evolution of I/O
Direct Memory access (DMA) - when a processor wishes to read or write data it issues a
command to the DMA module by sending the following information:
DMA returns the bus after complete data transfer. A register is used as a byte count,
being decremented for each byte transfer, and upon the byte count reaching zero, the DMAC
will release the bus. When the DMAC operates in burst mode, the CPU is halted for the
duration of the data transfer.
Steps involved are:
1. Bus grant request time.
2. Transfer the entire block of data at transfer rate of device because the device is usually
slow than the
speed at which the data can be transferred to CPU.
3. Release the control of the bus back to CPU
Cyclic Stealing :
An alternative method in which DMA controller transfers one word at a time after which it
must return the control of the buses to the CPU. The CPU delays its operation only for one
memory cycle to allow the direct memory I/O transfer to “steal” one memory cycle.
Before moving on transfer next byte of data, device performs step 1 again so that bus isn’t
tied up and the transfer won’t depend upon the transfer rate of device.
Buffering - Input/output (I/O) buffering is a mechanism that improves the throughput of
input and output operations. There 3 ways to do this:
1. Single buffer
When a user process issues an I/O request, the O.S assigns a buffer in the system
portion of main memory to the operation.
In the block oriented devices, the techniques can be used as follows: Input transfers
are made to the system buffer. When the transfer is complete, the process moves the
block into user space and request another block. This is called reading ahead, it is
done in the expectation that the block will be needed sometimes in future.
This approach will generally provide a speed up compared to the lack of system
buffering. The O.S must keep track of the assignment of system buffers to user
processes.
Similar considerations apply to block oriented output. When data are being
transmitted to a device, they are first copied from user space into the system buffer,
from which they will ultimately be written. The requesting process is now free to
continue.
Suppose T is the time required to input one block and C is the computation time
required for input request.
o Without buffering: Execution time is T+C.
o Without buffering: Execution time is max [C,T]+M, where M is time requied
to move the data from system buffer to user memory.
2. Double buffering
An improvement over single buffering is by assigning two system buffers to the
operations.
A process transfers data to one buffer while operating system empties the other as
shown in fig.
For block oriented transfer execution time is Max[C,T]. It is possible to keep the
block oriented device going at full speed.
o If C<=T, i.e. computation time is less than the time required to input one
block.
o If C>T, i.e. computation time is greater than the time required to input one
block, then double buffering ensures that the process will not have to wait on
I/O.
3. Circular buffer
Double buffering may be inadequate, if the process performs rapid burst of I/O. When
two or more buffers are used.
The collection of buffers is called as a circular buffer, with each buffer being one unit
in the circular buffer.
6b .Describe the different methods for communications between processes. Describe the
problems. Describe mutual exclusion and different solution techniques.
Independent process.
Co-operating process.
An independent process is not affected by the execution of other processes while a co-
operating process can be affected by other executing processes. Inter process communication
(IPC) is a mechanism which allows processes to communicate each other and synchronize
their actions.
Processes can communicate with each other using these two ways:
1. Shared Memory
2. Message passing
Communication between processes using shared memory requires processes to share some
variable and it completely depends on how programmer will implement it.
Ex: process1 and process2 are executing simultaneously and they share some resources or use
some information from other process, process1 generate information about certain
computations or resources being used and keeps it as a record in shared memory. When
process2 need to use the shared information, it will check in the record stored in shared
memory and take note of the information generated by process1 and act accordingly.
Processes can use shared memory for extracting information as a record from other process as
well as for delivering any specific information to other process.
Producer-Consumer problem
There are two processes: Producer and Consumer. Producer produces some item and
Consumer consumes that item. The two processes shares a common space or memory
location known as buffer where the item produced by Producer is stored and from where the
Consumer consumes the item if needed. There are two version of this problem: first one is
known as unbounded buffer problem in which Producer can keep on producing items and
there is no limit on size of buffer, the second one is known as bounded buffer problem in
which producer can produce up to a certain amount of item and after that it starts waiting for
consumer to consume it.
If two processes p1 and p2 want to communicate with each other, they proceed as follow:
Establish a link
Start passing messages
A standard message can have two parts: header and body. The header part is used for storing
Message type, destination id, source id, message length and control information. And the
body contains the actual message.
Direct Communication links are implemented when the processes use specific process
identifier for the communication but it is hard to identify the sender ahead of time.
The process which want to communicate must explicitly name the recipient or sender of
communication.
e.g. send(p1, message) means send the message to p1. similarly, receive(p2,
message) means receive the message from p2.
In-directed Communication is done via a shred mailbox (port), which consists of queue of
messages. Sender keeps the message in mailbox and receiver picks them up. Each mailbox
has a unique id and processes can communicate only if they share a mailbox.
Suppose two process want to communicate though Indirect message passing, the required
operations are: create a mail box, use this mail box for sending and receiving messages,
destroy the mail box.
Suppose there are more than two processes sharing the same mailbox and suppose the process
p1 sends a message to the mailbox, which process will be the receiver? The port is owned by
the receiving process and created by OS on the request of the receiver process and can be
destroyed either on request of the same receiver process or when the receiver terminates
itself. Enforcing that only one process is allowed to execute the receive can be done using the
concept of mutual exclusion.
Problem in IPC:
- Starvation
- Deadlock
- Data inconsistency
- Shared buffer problem
On a single processor system the simplest way to achieve mutual exclusion is by disabling
the interrupts when a process is in the critical section. Although this solution is effective, it
leads to many problems because the timer interrupt is no longer serviced, so tracking time is
impossible during the critical section. Also, if a process halts during its critical section,
control will never be returned to another process.
There are other algorithms such as Dekker's algorithm, Peterson's algorithm, Lamport's
bakery algorithm etc that could used for enforcing mutual exclusion.
7b. Describe what multi-programming is. Explain processes, their states and models.
Describe threads. Compare processes and threads.
Multiprogramming is a form of parallel processing in which several programs are run at the
same time on a uniprocessor. Since there is only one processor, there can be no true
simultaneous execution of different programs. Instead, the operating system executes part of
one program, then part of another, and so on. To the user it appears that all programs are
executing at the same time.
If the machine has the capability of causing an interrupt after a specified time interval, then
the operating system will execute each program for a given length of time, regain control, and
then execute another program for a given length of time, and so on. In the absence of this
mechanism, the operating system has no choice but to begin to execute a program with the
expectation, but not the certainty, that the program will eventually return control to the
operating system.
If the machine has the capability of protecting memory, then a bug in one program is less
likely to interfere with the execution of other programs. In a system without memory
protection, one program can change the contents of storage assigned to other programs or
even the storage assigned to the operating system. The resulting system crashes are not only
disruptive, they may be very difficult to debug since it may not be obvious which of several
programs is at fault.
In computing, a process is the instance of a computer program that is being executed by one
or many threads. It contains the program code and set of data. Depending on the operating
system (OS), a process may be made up of multiple threads of execution that execute
instructions concurrently.
All the information related to a process is stored in a process control block that is created and
managed by OS.
In addition to this there a dispatcher program that switches the processor from one process to
another.
Thread - Thread is a single sequence stream within a process. Threads have same properties
as of the process so they are called as light weight processes. Threads are executed one after
another but gives the illusion as if they are executing in parallel. There are two types of
threads:
User-level - Is implemented in the user level library, they are not created using the system
calls.
Kernel-level - OS kernel provides system call to create and manage threads. Instead of thread
table in each process, the kernel itself has thread table (a master one) that keeps track of all
the threads in the system.
8b) What is the cost-speed compromise of computer design? What improvements and
changes make computers faster? Why do these result in rising costs? (CPU, memory, bus
systems, storage..)
The price / performance is usually decided in a microarchitecture level. The goal is always to
achieve high performance for a reasonable price.
Performance
There are three main methods to increase speed:
● reduce the clock cycles needed to execute an operation
● simplify the architecture → enables shorter clock length → higher clock rate
● overlap the execution of instructions with each other
Overlapping
A CPU core executes instructions one-by-one, one at a time. After the current instruction is
finished executing, the next one will be loaded from memory. This can be a huge waste of
time, because memory access is slow, and while it loads the instruction, the CPU is idle.
Loading the next instruction can be significantly faster if we overlap the loading phase with
the current instruction’s execution. By the time the CPU is done executing the instruction, we
can instantly feed it the next one, so it won’t spend any time doing nothing.
This can be implemented by separating the two circuits from each other:
- one is responsible for fetching operations
- one is responsible for executing them (MBR, PC registers, 8-bit memory port)
Overlapping these two phases can achieve a significant performance boost, and it is one of
today’s greatest problem in architecture design.
Size
From a price-perspective, another question is the amount of used components. Since
nowadays all components are integrated into a single die (integrated circuit), we no
longer count their amount. Instead we use the size of the die as a metric.
The larger the die, the more expensive it is to manufacture it. For a specific component in
an electric circuit, there are many possible implementations. Some might be very small,
simple, and slow, others might require more space but are also faster. The designer will
have to evaluate their necessity and choose between them, always keeping in mind the price-
performance ratio.
Clock speed
For components to perform at their best, we need to have a decent clock speed. The clock
speed indicates how many clock-cycles can happen in one second. In modern CPUs, it is
around 2-4 Ghz, meaning that the CPU has 4 billion cycles in a second.
Microarchitecture Level
There are also certain optimizations that we can do in this level.
For example, the IJVM that we used in a previous topic has a moderately complicated, and
moderately fast architecture, which is not a surprise: it was only intended for education
purposes.
Most modern PCs rely on various sizes of active air cooling systems. Low-power devices like
Smartphones and Tablets use passive cooling.
2) Main Memory
(as detailed in the “types of memory” and the “swapping” question)
Memory can limit the system’s performance in many ways:
Amount
Most modern operating systems run multiple programs at the same time. As detailed in the
topic about paging, the system can run out of Main Memory, and this case it has to rely on
Secondary Memory for storing information. Secondary Memory (HDDs, and even SSDs) are
significantly slower than the Main Memory. When the memory is full, less frequently used
parts of programs will get swapped out to this slow storage, and they get swapped back in,
when they are needed.
This causes huge delays and can result in a decreased CPU utilization, and poor user
experience. Always having a nearly-full memory can result in lots of page-in page-out
operations, putting an unnecessary load on the Secondary Memory.
While the Main Memory does not degrade and can serve for many many years, Secondary
Memory does have a life expectancy, and a constant read/write operations can cause it to fail
earlier.
- HDDs have many moving parts, and everything that moves, can and will fail at one
point.
- SSDs have a limited number of write cycles. The will not last too long if they are
constantly under heavy write operations.
Putting more stress on these devices will cause them to degrade faster. This can all be
avoided by increasing the amount of Main Memory in the system.
→ speeds up data access
→ reduces stress on Secondary Memory
→ increases the maximum number of processes
Frequency / speed
The speed of the memory is always a bottleneck, the incredibly fast CPU often needs to read
instructions, or operands from the Main Memory. Having a slow connection speed to the
memory increases the number of cycles that the CPU needs to wait to receive the requested
data.
DDR-SDRAM memories are much faster than the older SDRAM generations. Most modern
systems rely on DDR3 and DDR4 memories. DDR5 is under development and will be
released shortly.
Bandwidth???
Having a wider BUS allows the CPU to transfer more bytes at a time. 64 bit memory is faster
than it’s 32 bit counterparts.
64 bit address space also allows for having more memory in the system.
- A 32 bit address space can support up to 4GB of RAM (232 bytes). Even for smaller
systems, 4GB of memory is not enough.
- A 64 bit wide address space can support up to 18 EB (264 bytes) of RAM. This is by
far more than any system would need.
Nowadays, 64 bit systems are used in servers, PCs, and even smartphones. 32 or 16 bit
architectures are used in smaller, embedded systems.
Timing
Timing indicates the amount of clock cycles the memory needs to wait before it can start
responding to the CPU’s requests. This is usually a value between 10 and 20. Having a lower
timing will result in less wasted clock cycles, and can result in a significantly faster overall
performance.
2) BUS
As detailed in the “busses” question.
+ separate, parallel busses that only communicate with the system bus through a bridge
Having a faster BUS can grant a faster access to our storage and network devices.
3) Storage
(As detailed in the “1B question”)
Having a larger storage makes it possible to store more programs, although nowadays this is
not really an issue.
In my opinion, having a faster storage is a more important aspect: it will accelerate both
reading programs from the storage, and also makes it possible to use swapping much faster.
Modern systems have two main options for storage, each having their pros and cons:
HDD
HDDs are slower, but can store incredible amounts of data. They are also relatively cheap.
Capacity of consumer hard drives is between 500 GB and 12 TB.
Transfer speeds are about 50 - 250 MB / second.
Because of their moving parts, HDDs have a high latency and will respond to requests
slower.
SSD
SSDs are a more expensive storage. Their capacity used to be inferior to HDDs, but
nowadays more and more SSDs appear with a higher capacity, often reaching more than 10
TBs.
Most consumer SSDs are between 256GB and 2TB in capacity.
Transfer speeds are about 400 - 3.000 MB / second.
Using zero moving parts, SSDs have a lower latency and can respond to requests relatively
fast.
Bus systems are a limiting factor for SSDs:
- SATA 3.0 SSDs are limited to about 500 MB/s
- PCI-Express 3.0 SSDs are limited to about 3000 MB/s
A next generation of computers using PCI-E 4.0 will enable transfer speeds up to 6000 MB/s.
This technology is already available, but yet extremely expensive.
A smart combination of these two storage technologies can result in a system that is both
const-efficient and fast:
- we can store the OS, and some programs on the SSD
- we can store less frequently accessed applications and data files on the HDD
RAID can allow to use multiple storage devices in parallel, increasing their speed.
9 b. Instruction and process level parallelism. Branch prediction and why is it necessary.
Computer architects are constantly striving to improve performance of the machines they
design. most computer architects look to parallelism (doing two or more things at once) as a
way to get even more performance for a given clock speed. Parallelism comes in two general
forms, namely, instruction-level parallelism and processor-level parallelism. In the former,
parallelism is exploited within individual instructions to get more instructions/sec out of the
machine. In the later, multiple CPUs work together on the same problem.
Pipeline divides the instruction execution into many parts each handled by a dedicated piece
of hardware all of which run in parallel.
Pipelining allows a trade-off between latency and processor bandwidth. If one pipeline is
good, then surely two pipelines are better. One possible design for a dual pipeline CPU.
single instruction fetch unit fetches pairs of instructions together and puts each one into its
own pipeline, complete with its own ALU for parallel operation. To be able to run in parallel,
the two instructions must not conflict over resource usage (e.g., registers), and neither must
depend on the result of the other. The main pipeline, called the u pipeline, could execute an
arbitrary Pentium instruction. The second pipeline, called the v pipeline, could execute only
simple integer instructions.
Instructions were always executed in order. Thus Pentium-specific compilers that produced
compatible pairs could produce faster-running programs than older compilers.
Processor level parallelism
First parallel system with multiple full-blown CPUs is the multiprocessor, a system with
more than one CPU sharing a common memory, like a group of people in a room sharing a
common blackboard. Since each CPU can read or write any part of memory they must co-
ordinate (in software) to avoid getting in each other’s way. When two or more CPUs have the
ability to interact closely, as is the case with multiprocessors, they are said to be tightly
coupled. The simplest one is to have a single bus with multiple CPUs and one memory all
plugged into it. It does not take much imagination to realize that with a large number of fast
processors constantly trying to access memory over the same bus, conflicts will result.
Multicomputer
Although multiprocessors with a modest number of processors (≤ 256) are relatively easy to
build, large ones are surprisingly difficult to construct. The difficulty is in connecting all the
processors to the memory. To get around these problems, many designers have simply
abandoned the idea of having a shared memory and just build systems consisting of large
numbers of interconnected computers, each having its own private memory, but no common
memory. These systems are called multicomputer. As a result, messages from one computer
to another often must pass through one or more intermediate computers or switches to get
from the source to the destination.
Branch prediction
Modern computers are highly pipelined. Pipelining works best on linear code, so the fetch
unit can just read in consecutive words from memory and send them off to the decode unit in
advance of their being needed. The problem here is programs are not linear codes. They are
full of branch instructions. Branch prediction is a technique used to speed execution of
instructions on processors that use pipelining. Branch prediction breaks instructions down
into predicates, similar to predicate logic. A CPU using branch prediction only executes
statements if a predicate is true. Branch prediction is implemented in CPU logic with a
branch predictor.
Examples
Static - Static prediction is the simplest branch prediction technique because it does not rely
on information about the dynamic history of code executing. Instead, it predicts the outcome
of a branch based solely on the branch instruction.
Dynamic - Dynamic branch prediction uses information gathered at run-time to predict the
outcome of a branch.
Static Branch Prediction Techniques
1) Branch Always Not Taken (Predicted-Not-Taken)
2) Branch Always Taken (Predicted-Taken)
3) Backward Taken Forward Not Taken (BTFNT)
4) Profile-Driven Prediction
5) Delayed Branch
10. Describe the compilation and interpretation methods, their steps, the differences. What
are language-levels? How does interpretation work inside CPU-s? What are the abstraction
layers of computer design?
We generally write a computer program using a high-level language. A high-level language is
one which is understandable by us humans. It contains words and phrases from the English
(or other) language. But a computer does not understand high-level language. It only
understands program written in 0's and 1's in binary, called the machine code. A program
written in high-level language is called a source code. We need to convert the source code
into machine code and this is accomplished by compilers and interpreters. Hence, a compiler
or an interpreter is a program that converts program written in high-level language into
machine code understood by the computer.
The solution is to come up with human readable languages: L1, L2, L3, …
- these are abstractions
- hide underlying complexity
First method
Each L1 command gets substituted by a series of L0 commands.
Transforming this L1 program to a L0 program is called compiling.
Compiling is made by a program called a compiler.
The code generated by the compiler is called target code.
The language we’re compiling to is called the target language.
Second method
L1 program is executed command-by-command. A program called the interpreter translates
the L1 commands to L0 commands in real time.
No target language code is generated.
❏ virtual machine
The higher level a language is → More special purpose, more easy to understand and use
An Nth level programmer does not have to know all n-m levels.
11b) How can we raise the performance of computers (processor level)? What are the limits
of our possibilities? What are todays design methods for that? (cycle length decreasing,
cycle number maximization, ..)
However, doubling the number of cores will not simply double a computer's speed. CPU
cores have to communicate with each other through channels and this uses up some of the
extra speed.
By increasing the data bus from 32-bit to 64-bit, the computer can transfer twice as much
information at one time. Therefore, increasing the size of the data bus improves the system
performance of the computer.
Cache memory
Cache is a small amount of memory which is a part of the CPU - closer to the CPU
than RAM. It is used to temporarily hold instructions and data that the CPU is likely to reuse.
The CPU control unit automatically checks cache for instructions before requesting data from
RAM. This saves fetching the instructions and data repeatedly from RAM – a relatively slow
process which might otherwise keep the CPU waiting. Transfers to and from cache take less
time than transfers to and from RAM.
The more cache there is, the more data can be stored closer to the CPU.
L1 isusually part of the CPU chip itself and is both the smallest and the fastest to
access. Its size is often restricted to between 8 KB and 64 KB.
L2 and L3 caches are bigger than L1. They are extra caches built between the CPU
and the RAM. Sometimes L2 is built into the CPU with L1. L2 and L3 caches take
slightly longer to access than L1. The more L2 and L3 memory available, the faster
a computer can run.
Not a lot of physical space is allocated for cache. There is more space for RAM, which is
usually larger and less expensive.
Clock speed
The clock speed - also known as clock rate - indicates how fast the CPU can run. This is
measured in megahertz (MHz) or gigahertz (gHz) and corresponds with how
many instructioncycles the CPU can deal with in a second. A 2 gHz CPU performs two
billion cycles a second. A faster CPU uses more energy and creates more heat.
A computer will normally have a maximum clock speed set by default, but it is possible to
change this speed in the computer BIOS. Some people increase a CPU clock speed to try to
make their computer run faster - this is called overclocking.
There are limits to how fast a CPU can run and its circuitry cannot always keep up with an
overclocked speed. If the clock tells the CPU to execute instructions too quickly, the
processing will not be completed before the next instruction is carried out. If the CPU cannot
keep up with the pace of the clock, the data is corrupted. CPUs can also overheat if they are
forced to work faster than they were designed to work.
12b. What kind of memory-types do you know? What are the parameters of these? What is
the difference between SRAMs and DRAMs? What are their working principles? What are
they typically good for? How do caches work?
Memory is the most essential element of a computing system because without it computer
can’t perform simple tasks. Computer memory is of two basic type – Primary memory(RAM
and ROM) and Secondary memory(hard drive,CD,etc.). Random Access Memory (RAM) is
primary-volatile memory and Read Only Memory (ROM) is primary-non-volatile memory.
The programs and data that the CPU requires during execution of a program are
stored in this memory.
It is a volatile memory as the data loses when the power is turned off.
RAM is further classified into two types- SRAM (Static Random Access
Memory) and DRAM (Dynamic Random Access Memory).
RAMs come in two varieties, static and dynamic. Static RAMs (SRAMs), are constructed
internally using circuits similar to our basic D flip-flop. These memories have the property
that their contents are retained as long as the power is kept on: seconds, minutes, hours, even
days. Static RAMs are very fast. A typical access times is a few nsec. For this reason, static
RAMS are popular as level 2 cache memory.
Dynamic RAMs (DRAMs), in contrast, do not use flip-flops. Instead, a dynamic RAM is an
array of cells, each cell containing one transistor and a tiny capacitor. The capacitors can be
charged or discharged, allowing 0s and 1s to be stored. Because the electric charge tends to
leak out, each bit in a dynamic RAM must be refreshed (reloaded) every few milliseconds to
prevent the data from leaking away.
There are several types of DRAM’s Ex: (Fast Page Mode DRAM), (Extended Data
Output)DRAM, SDRAM (Synchronous DRAM) and DDR (Double Data Rate)
Parameters
Memory access time is the time that separates sending a memory access request and the
reception of the requested information. The access time determines unitary speed of a
memory (the reception time of unitary data). The access time is small for fast memories.
Memory cycle time is the shortest time that has to elapse between consecutive access
requests to the same memory location. The memory cycle time is another parameter that
characterizes the overall speed of the memory. The speed is big when the cycle time is small.
Memory transfer rate is the speed of reading or writing data in the given memory, measured
in bits/sec or bytes/sec.
By combining a small amount of fast memory with a large amount of slow memory to get the
speed of the fast memory (almost) and the capacity of the large memory at a moderate price.
The small, fast memory is called a cache. One of the most effective techniques for improving
both bandwidth and latency comes from the use of multiple caches. A basic technique that
works very effectively is to introduce a separate cache for instructions and data. There are
several benefits from having separate caches for instructions and data, often called a split
cache.
Today, many memory systems are more complicated than this, and an additional cache,
called a level 2 cache, may reside between the instruction and data caches and main memory.
In fact, there may be three or more levels of cache as more sophisticated memory systems are
required. The CPU chip itself contains a small instruction cache and a small data cache,
typically 16 KB to 64 KB. Then there is the level 2 cache, next to the CPU chip and
connected to high speed path. A typical size for the L2 cache is 512 KB to 1 MB. The third-
level cache is on the processor board and consists of a few megabytes of SRAM.
Caches depend on two kinds of address locality to achieve their goal. Spatial locality is the
observation that memory locations with addresses numerically similar to a recently accessed
memory location are likely to be accessed in the near future. Caches exploit this property by
bringing in more data than have been requested, with the expectation that future requests can
be anticipated. Temporal locality occurs when recently accessed memory locations are
accessed again. If that entry is valid, the TAG field of the memory address and the Tag field
in cache entry are compared. If they match then it is called cache hit and the word requested
for can be read from the cache directly. If they don’t match then it is called cache miss and in
this case the cache line will fetch the requested data from the memory and replace it in the
cache memory.
13b) What is the task of buses (ISA, eISA, PCI, PCIe, USB)? What are synchronous and
asynchronous buses? How does the bus arbitration work?
A bus is a common pathway through which information flows from one computer component
to another. This pathway is used for communication purposes.
Functions of a bus
Data sharing - The buses transfer or send data either in the serial or parallel method of
data transfer. This allows for the exchange of 1, 2, 4 or even 8 bytes of data at a time.
Buses are classified depending on how many bits they can move at the same time,
which means that we have 8-bit, 16-bit, 32-bit or even 64-bit buses.
Addressing - This allows data to be sent to or from specific memory locations.
Powering – The bus supplies power to various peripherals connected to it.
Timing – it provides a system clock signal to synchronize the peripherals attached to
it with the rest of system.
There three types of buses :
1. Address Bus - Address bus is unidirectional because data flow in one
direction, from microprocessor to memory or from microprocessor to
Input/output devices. It carries only the address and not the actual data. In
8085 microprocessor the length of address bus is 16 bits. The Length of the
address bus determines the amount of memory a system can address.
2. Data bus – It carries the actual information. Data bus is bidirectional because
data flow in both directions, from microprocessor to memory or Input/output
devices and from memory or Input/output devices to microprocessor. In 8085
the length of a data bus is 8 bits. The width of the data bus is directly related to
the largest number that the bus can carry.
3. Control bus – It is used to generate timing and control all the associated
peripherals.
Bus expansions:
ISA
An Industry Standard Architecture bus (ISA bus) is a computer bus that allows additional
expansion cards to be connected to a computer's motherboard. It is a standard bus architecture
for IBM compatibles. Introduced in 1981, the ISA bus was designed to support the Intel 8088
microprocessor for IBM’s first-generation PC.
The ISA bus provides direct memory access using multiple expansion cards on a memory
channel allowing separate interrupt request transactions for each card. Depending on the
version, the ISA bus can support a network card, additional serial ports, a video card etc.
eISA
Extended Industry Standard Architecture (EISA) is a bus architecture that extends the
Industry Standard Architecture (ISA) from 16 bits to 32 bits. EISA was introduced in 1988.
EISA has 32-bit direct memory access (DMA), central processing unit (CPU) and bus master
devices. EISA also has improved data transfer rates (DTR) up to 33 MB, automatic
configuration, synchronous data transfer protocol (SDTP) and a compatible structure for
older ISA buses with 8 or 16-bit data paths.
PCI
A Peripheral Component Interconnect Bus (PCI bus) connects the CPU and expansion boards
such as modem cards, network cards and sound cards. These expansion boards are normally
plugged into expansion slots on the motherboard. During system start up the operating system
searches for all PCI buses to attain information about the resources needed for each device.
The OS communicates with each device and assigns system resources, including memory,
interrupt requests and allotted input/output (I/O) space.
PCIe
PCI Express is a serial connection that operates more like a network than a bus. Instead of
one bus that handles data from multiple sources, PCIe has a switch that controls several
point-to-point serial connections. These connections fan out from the switch, leading directly
to the devices where the data needs to go. Every device has its own dedicated connection, so
devices no longer share bandwidth like they do on a normal bus.
USB - A Universal Serial Bus (USB) is basically a newer port that is used as a common
interface to connect several different types of devices such as keyboards, printers etc. One of
the important features of the USB is hot swapping. This feature allows a device to be
removed or replaced without the past prerequisite of rebooting and interrupting the system.
Another USB feature is the use of direct current (DC). In fact, several devises use a USB
power line to connect to DC current and do not transfer data.
Synchronous Bus
This bus is used to interconnect devices that comprise a computer system where the timing of
transactions between devices is under the control of a synchronizing clock signal. A device
connected to a synchronous bus must guarantee to respond to a command within a period set
by the frequency of the clock signal or a transmission error will occur. Such buses are usually
employed in closely controlled processor backplane environments where device
characteristics and inter device signal delays are known.
Asynchronous Bus
A bus that interconnects devices of a computer system where information transfers between
devices are self-timed rather than controlled by a synchronizing clock signal. A connected
device indicates its readiness for a transfer by activating a request signal. A responding
device indicates the completion of this transfer by activating an acknowledge signal. The time
required to complete the transaction is determined by the response times of the devices and
the delays along the interconnecting bus and may vary for different devices.
Bus Arbitration
It refers to the process by which the current bus master accesses and then leaves the control
of the bus and passes it to the another bus requesting processor unit. The controller that has
access to a bus at an instance is known as Bus master. A conflict may arise if the number of
DMA controllers or other controllers or processors try to access the common bus at the same
time, but access can be given to only one of those. Only one processor or controller can be
Bus master at the same point of time. To resolve these conflicts, Bus Arbitration procedure is
implemented to coordinate the activities of all devices requesting memory transfers. The
selection of the bus master must take into account the needs of various devices by
establishing a priority system for gaining access to the bus. The Bus Arbiter decides who
would become current bus master.
There are two approaches :
Centralized bus arbitration - A single bus arbiter performs the required arbitration.
Distributed bus arbitration - All devices participate in the selection of the next bus master.
There are three methods of bus arbitration
(i) Daisy Chaining method – It is a centralized bus arbitration method. During any
bus cycle, the bus master may be any device – the processor or any DMA
controller unit, connected to the bus.
Advantages –
Simplicity and Scalability, The user can add more devices anywhere along the chain up to a
certain maximum value.
Disadvantages –
The value of priority assigned to a device depends on the position of master bus, Propagation
delay arises in this method, If one device fails then entire system will stop working.
(ii) Polling or Rotating Priority method – In this method, the devices are assigned
unique priorities and complete to access the bus, but the priorities are dynamically
changed to give every device an opportunity to access the bus.
Advantages – This method does not favour any particular device and processor, The method
is also quite simple, If one device fails then entire system will not stop working.
Disadvantages –Adding bus masters is different as increases the number of address lines of
the circuit.
(iii) Fixed priority or Independent Request method –
In this method, the bus control passes from one device to another only through the centralized
bus arbiter.
Advantages – This method generates fast response.Disadvantages – Hardware cost is high as
large no. of control lines are required.
14b. What are the tasks at microarchitecture level? What is the datapath? What controls its
work? What are microinstructions? What registers are used between the datapath and the
main memory (IJVM!)?
The design of the microarchitecture level depends on the ISA being implemented, as well as
the cost and performance goals of the computer. Many modern ISAs, particularly RISC
designs, have simple instructions that can usually be executed in a single clock cycle. More
complex ISAs, such as the Pentium 4, may require many cycles to execute a single
instruction. Executing an instruction may require locating the operands in memory, reading
them, and storing results back into memory. The sequencing of operations within a single
instruction often leads to a different approach to control than that for simple ISAs.
Microarchitecture will contain a microprogram (in ROM), whose job is to fetch, decode, and
execute IJVM instructions. The microprogram has a set of variables, called the state of the
computer, which can be accessed by all the functions. Each function changes at least some of
the variables making up the state. For example, the Program Counter (PC) is part of the state.
It indicates the memory location containing the next function (i.e., ISA instruction) to be
executed. During the execution of each instruction, the PC is advanced to point to the next
instruction to be executed.
Each instruction has a few fields, the first of every instruction is the OPCODE which
identifies the instruction telling whether to ADD or BRANCH. Many instructions need extra
fields to specify the operands. Ex: An instruction which needs to access a local variables
should have an additional field to tell which variable.
This model of execution, sometimes called the fetch-execute cycle, is useful in the abstract
and may also be the basis for implementation for ISAs like IJVM that have complex
instructions. The data path is that part of the CPU containing the ALU, its inputs, and its
outputs. It contains a number of 32 bit registers which are accessible only at
microarchitecture level. The ALU needs two data inputs: a left input (A) and a right input
(B). Attached to the left input is a holding register, H. Attached to the right input is the B bus.
H can be loaded by choosing an ALU function that just passes the right input (from the B
bus) through to the ALU output. It is explicitly possible to read and write the same register on
one cycle. For example, it is allowed to put SP onto the B bus, disable the ALU’s left input,
enable the INC signal, and store the result in SP, thus incrementing SP by 1.
The start of subcycle 1 is triggered by the falling edge of the clock. It is as follows:
4. The results propagate along the C bus back to the registers (Δz).
At the rising edge of the next clock cycle, the results are stored in the registers. The machine
has two different ways to communicate with memory: a 32-bit, word-addressable memory
port and an 8-bit, byte-addressable memory port. The 32-bit port is controlled by two
registers, MAR (Memory Address Register), and MDR (Memory Data Register). The 8 bit is
controlled by PC register and can only read data from the memory but not write. In order to
control the data we need 29 signals that can be divided into 5 functional groups, they are :
Stores the address of the 32-bit word that is being written to or read from memory. This is a
write-only register.
The MDR is used to hold the 32-bit word that is being written to or read from memory. When
no memory access is being performed, this register can be used as a general register.
MBR memory buffer register : 1 byte register, with actually processed instruction
Hold (H) - This register holds data that is to be supplied to the left (A) input of the arithmetic
logic unit. This is the only register that can perform this function.
15b)Why can transistors be the elementary components of computers? What are gates?
What is the Boole-algebra? What are equivalent electric circuits?
Why can transistors be the elementary components of computers? What are gates?
A Digital circuit is an electronic circuit where only two logical values can be present: 0, 1. A
circuit consists of many Gates.
A Gate is a tiny electronic device that can compute an output of some function, based on it’s
inputs. A Gate consists of a few Transistors.
A Transistor is tiny electronic switch that can compute an output of some very basic
operation, based on it’s inputs. It’s output is either a 0 or 1.
What is the Boole-algebra?
The Boolean algebra is an algebra that works only with two values: 1s and 0s.
Because of this property, it’s perfectly capable to describe digital circuits.
Each Gate can represent one elementary bool operation (like AND, OR).
In the Boole-algebra, we can represent the inputs and outputs of basic operations using the
truth table. Eg.:
In the digital circuit itself, it is physically the easiest to implement the following logical gates:
- NOT → requires 1 transistor
- NAND → requires 2 transistors
- NOR → requires 2 transistors
These three logical gates are the basic building blocks of any digital circuit.
But in bool algebra, we would often like to use AND and OR operations, which are a little bit
trickier to implement:
- AND = NAND+NOT → requires 3 transistors
- OR = NOR+NOT → requires 3 transistors
So because of their lower complexity, we mostly use the NAND and NOR gates in
production. 66.6% of a usual digital circuit consists of exclusively these two gates.
So we can see that reorganizing gates can lead to simpler solutions. It’s exactly the same
principle, as re organizing equations in math class!
Just as we do in math, there are several laws to follow that make these equations simpler: