Fig: Superscalar Architecture of Pentium
Fig: Superscalar Architecture of Pentium
Fig: Superscalar Architecture of Pentium
The Pentium family of processors originated from the 80486 microprocessor. The term ''Pentium processor'' refers to a
family of microprocessors that share a common architecture and instruction set. The first Pentium processors were
introduced in 1993. It runs at a clock frequency of either 60 or 66 MHz and has 3.1 million transistors. Some of the features
of Pentium architecture are
Complex Instruction Set Computer (CISC) architecture with Reduced Instruction Set Computer (RISC)
performance.
64-Bit Bus
Upward code compatibility.
Pentium processor uses Superscalar architecture and hence can issue multiple instructions per cycle.
Multiple Instruction Issue (MII) capability.
Pentium processor executes instructions in five stages. This staging, or pipelining, allows the processor to overlap
multiple instructions so that it takes less time to execute two instructions in a row.
The Pentium processor fetches the branch target instruction before it executes the branch instruction.
The Pentium processor has two separate 8-kilobyte (KB) caches on chip, one for instructions and one for data. It
allows the Pentium processor to fetch data and instructions from the cache simultaneously.
When data is modified, only the data in the cache is changed. Memory data is changed only when the Pentium
processor replaces the modified data in the cache with a different set of data
The Pentium processor has been optimized to run critical instructions in fewer clock cycles than the 80486
processor.
The Pentium's basic integer pipeline is five stages long, with the stages broken down as follows:
1. Pre-fetch/Fetch : Instructions are fetched from the instruction cache and aligned in pre-fetch buffers for
decoding.
2. Decode1 : Instructions are decoded into the Pentium's internal instruction format. Branch prediction also takes
place at this stage.
3. Decode2 : Same as above, and microcode ROM kicks in here, if necessary. Also, address computations take
place at this stage.
4. Execute : The integer hardware executes the instruction.
5. Write-back : The results of the computation are written back to the register file.
A31:A3 Address bus lines (output except for cache snooping) determines where in the 4GB
memory space or 64K I/O space the processor is accessing.
Since address lines A2:0 do not exist on the Pentium, the CPU uses A31:3 to identify a group
of 8 locations known as a Quadword (8 bytes -- also know as a chunk).
Without A2:0, the CPU is only capable of outputting every 8th address. (e.g. 00H, 08H, 10H,
18H, 20H, 28H, etc.)
Output on CPU Address Lines for addresses within a Quadword Addresses 00000000 - 000000F
Output on CPU Address Lines for addresses within a Quadword Addresses 00000000 - 000001F
BE7#:BEO# (outputs): Byte enable lines - to enable each of the 8 bytes in the 64-bit data
path.
In effect a decode of the address lines A2-A0 which the Pentium does not generate.
Which lines go active depends on the address, and whether a byte, word, double word or
quad word is required.
Data bus: The data bus provides a path for data to flow.
The data can flow to/from the microprocessor during a memory or I/O operation.
D63:DO (bi-directional): The 64-bit data path to or from the processor. The signal W/R#
distinguishes direction.
Control bus: The control bus is used by the CPU to tell the memory and I/O chips what the CPU
is doing.
BRDY# (input): This signal ends the current bus cycle and is used to extend bus cycles.
M/IO# (output): Defines if the bus cycle is a Memory access or an Input/Output Port access.
D/C# (output): Defines if the bus cycles is Data or Code for Memory access.
Cache#. (output): Processor indication of internal cacheability. Cache# and Ken# are used
together to determine of a read will be turned into a linefill. (Burst cycle).
BRDY#
Zero WS ADS#
DRAM-Read
TW TW
TW TW
TW
Special Cycles Definition
If interrupts are enabled, an INTR (e.g Timer Tick) will force the microprocessor out
of a halt.
Triple Fault: The CPU detects a further exception (e.g. General Protection
Fault, Invalid Op Code, Stack Overflow) while executing the Double Fault
Exception handler.
Shutdown is decoded by the system board and generates a soft reset (INIT to Pentium) in a PC.
Burst Cycles can transfer larger quantities of data in fewer clocks than single transfer cycles.
Burst cycles are limited to an address area that begins at a 32-byte limit and the system
board must calculate the other 3 burst addresses.
l The system board must generate the subsequent addresses(2nd, 3rd, 4th) in the following
sequence.
Address Output by Pentium
This is required in order to fill the Pentium 32t (20H) byte cache on 32 byte boundaries.
0 0 0 0 1 0 0 0
0 0 0 1 0 0 0 0
0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0
0 0 0 1 0 0 0 0
Pipelining
The u-pipe can execute any instruction, while the v-pipe can execute simple
instructions as defined in the Instruction Pairing Rules.
When instructions are paired, the instruction issued to the v-pipe is always the next
in sequential after the one issued to u-pipe.
1. Prefetch(PF) :
2. Decode1(D1):
Two parallel decoders attempt to decode and issue the next two sequential instructions
It checks whether the instructions can be paired
It decodes the instruction to generate a control word
A single control word causes direct execution of an instruction
Complex instructions require microcoded control sequencing
3. Decode2(D2):
Decodes the control word
Address of memory resident operands are calculated
4. Execute (EX):
The instruction is executed in ALU
Data cache is accessed at this stage
For both ALU and data cache access requires more than one clock.
5. Writeback(WB):
The CPU stores the result and updates the flags
1. Prefetch(PF) :
2. Instruction Decode(D1):
Two parallel decoders attempt to decode and issue the next two sequential
instructions
Data is converted to floating point format before being loaded into the floating
point unit
Floating point results are rounded and the result is written to the target
floating point register.
8. Error Reporting(ER)
They change the sequence causing all the instructions that entered the pipeline after
program transfer instruction invalid.
This causes bubbles in pipeline, where no work is done as the pipeline stages are
reloaded.
If the prediction turns out to be true, the pipeline will not be flushed and no clock
cycles will be lost.
If the prediction turns out to be false, the pipeline is flushed and started over with the
correct instruction.
It results in a 3 cycle penalty if the branch is executed in the u-pipeline and 4 cycle
penalty in v-pipeline.
It is implemented using a 4-way set associative cache with 256 entries. This is
referred to as the Branch Target Buffer(BTB).
The directory entry (tag) for each line contains the following information:
History Bits: track how often the branch has been taken
Source memory address that the branch instruction was fetched from (address
of I3)
If its directory entry is valid, the target address of the branch (address of I50)
is stored in corresponding data entry in BTB
BTB is a look-aside cache that sits off to the side of D1 stages of two pipelines and
monitors for branch instructions.The first time that a branch instruction enters either
pipeline, the BTB uses its source memory address to perform a lookup in the cache.
Since the instruction has not been seen before, this results in a BTB miss.It means the
prediction logic has no history on instruction. It then predicts that the branch will not
be taken and program flow is not altered. Even unconditional jumps will be predicted
as not taken the first time that they are seen by BTB.When the instruction reaches the
execution stage, the branch will be either taken or not taken. If taken, the next
instruction to be executed should be the one fetched from branch target address. If not
taken, the next instruction is the next sequential memory address. When the branch is
taken for the first time, the execution unit provides feedback to the branch prediction
logic. The branch target address is sent back and recorded in BTB. A directory entry
is made containing the source memory address and history bits set as strongly taken
History Resulting Prediction If branch If branch is
Bits Description Made is taken not taken
To increase the speed and efficiency of real-number computations, computers or FPUs typically
represent real numbers in a binary floating-point format. In this format, a real number has three
parts: a sign, a significand, and an exponent. Figure 31-2 shows the binary floating-point format that
the Intel Architecture FPU uses. This format conforms to the IEEE standard.
FPU Architecture
From an abstract, architectural view, the FPU is a coprocessor that operates in parallel with the
processors integer unit (see Figure 31-4). The FPU gets its instructions from the same instruction
decoder and sequencer as the integer unit and shares the system bus with the integer unit. Other
than these connections, the integer unit and FPU operate independently and in parallel. (The actual
microarchitecture of an Intel Architecture processor varies among the various families of
processors. For example, the Pentium Pro processor has two integer units and two FPUs; whereas,
the Pentium processor has two integer units and one FPU, and the Intel486 processor has one
integer unit and one FPU.)
The instruction execution environment of the FPU (see Figure 31-5) consists of 8 data registers
(called the FPU data registers) and the following special-purpose registers:
The status register.
The control register.
The tag word register.
Instruction pointer register.
Last operand (data pointer) register.
Opcode register.
These registers are described in the following sections.
The FPU Data Registers
The FPU data registers (shown in Figure 31-5) consist of eight 80-bit registers. Values are stored in
these registers in the extended-real format shown in Figure 31-17. When real, integer, or packed
BCD integer values (in any of the formats shown in Figure 31-17) are loaded from memory into
any of the FPU data registers, the values are automatically converted into extended-real format (if they are not
already in that format). When computation results are subsequently transferred back
into memory from any of the FPU registers, the results can be left in the extended-real format or
converted back into one of the other FPU formats (real, integer, or packed BCD integers) shown in
Figure 31-17.
The FPU instructions treat the eight FPU data registers as a register stack (see Figure 31-6). All
addressing of the data registers is relative to the register on the top of the stack. The register
number of the current top-of-stack register is stored in the TOP (stack TOP) field in the FPU status
word. Load operations decrement TOP by one and load a value into the new top-of-stack register,
and store operations store the value from the current TOP register in memory and then increment
TOP by one. (For the FPU, a load operation is equivalent to a push and a store operation is
equivalent to a pop.)
Operating Modes
The Pentium and Pentium Pro processor has three operating modes:
Real-address mode. This mode lets the processor to address "real" memory address.
It can address up to 1Mbytes of memory (20-bit of address). It can also be called
"unprotected" mode since operating system (such as DOS) code runs in the same
mode as the user applications. Pentium and Prentium Pro processors have this mode
to be compatible with early Intel processors such as 8086. The processor is set to this
mode following by a power-up or a reset and can be switched to protected mode using
a single instruction.
Protected mode. This is the preferred mode for a modern operating system. It allows
applications to use virtual memory addressing and supports multiple programming
environment and protections.
System management mode. This mode is designed for fast state snapshot and
resumption. It is useful for power management.