Fig: Superscalar Architecture of Pentium

Pentium Architecture
The Pentium family of processors originated from the 80486 microprocessor. The term ''Pentium processor'' refers to a
family of microprocessors that share a common architecture and instruction set. The first Pentium processors were
introduced in 1993. It runs at a clock frequency of either 60 or 66 MHz and has 3.1 million transistors. Some of the features
of Pentium architecture are
Complex Instruction Set Computer (CISC) architecture with Reduced Instruction Set Computer (RISC)
performance.
64-Bit Bus
Upward code compatibility.
Pentium processor uses Superscalar architecture and hence can issue multiple instructions per cycle.
Multiple Instruction Issue (MII) capability.
Pentium processor executes instructions in five stages. This staging, or pipelining, allows the processor to overlap
multiple instructions so that it takes less time to execute two instructions in a row.
The Pentium processor fetches the branch target instruction before it executes the branch instruction.
The Pentium processor has two separate 8-kilobyte (KB) caches on chip, one for instructions and one for data. It
allows the Pentium processor to fetch data and instructions from the cache simultaneously.
When data is modified, only the data in the cache is changed. Memory data is changed only when the Pentium
processor replaces the modified data in the cache with a different set of data
The Pentium processor has been optimized to run critical instructions in fewer clock cycles than the 80486
processor.
Fig : Superscalar Architecture of Pentium
The Pentium processor has two primary operating modes -

1. Protected Mode - In this mode all instructions and architectural features are available, providing the highest
performance and capability. This is the recommended mode that all new applications and operating systems
should target.
2. Real-Address Mode - This mode provides the programming environment of the Intel 8086 processor, with a
few extensions. Reset initialization places the processor in real mode where, with a single instruction, it can
switch to protected mode
The Pentium's basic integer pipeline is five stages long, with the stages broken down as follows:
1. Pre-fetch/Fetch : Instructions are fetched from the instruction cache and aligned in pre-fetch buffers for
decoding.
2. Decode1 : Instructions are decoded into the Pentium's internal instruction format. Branch prediction also takes
place at this stage.
3. Decode2 : Same as above, and microcode ROM kicks in here, if necessary. Also, address computations take
place at this stage.
4. Execute : The integer hardware executes the instruction.
5. Write-back : The results of the computation are written back to the register file.
Fig: Pentium pipeline stages
Floating Point Unit :

There are 8 general-purpose 80-bit Floating point registers. Floating point unit has 8 stages of pipelining. First five are
similar to integer unit. Since the possibility of error is more in Floating Point unit (FPU) than in integer unit, additional
error checking stage is there in FPU. The floating point unit is shown as below
Fig: Floating Point Unit

FRD - Floating Point Rounding
FDD - Floating Point Division
FADD - Floating Point Addition
FEXP - Floating Point Exponent
FAND - Floating Point And
FMUL - Floating Point Multiply
Pentium Processor with 64 Bit Wide Memory
I/O Address Space is limited to 64 Kbytes (0000H-FFFFH).
This limit is imposed by a 16 bit CPU Register.
A 16 bit register can store up to FFFFH (1111 1111 1111 1111 ).
Which CPU Register limits I/O space to 64K?

Address bus: The microprocessor provides an address to the memory & I/O chips.
The number of address lines determines the amount of memory supported by the processor.
A31:A3 Address bus lines (output except for cache snooping) determines where in the 4GB
memory space or 64K I/O space the processor is accessing.
The Pentium address consists of two sets of signals:
Address Bus (A31:3)
Byte Enables (BE7#:0#)
Since address lines A2:0 do not exist on the Pentium, the CPU uses A31:3 to identify a group
of 8 locations known as a Quadword (8 bytes -- also know as a chunk).
Without A2:0, the CPU is only capable of outputting every 8th address. (e.g. 00H, 08H, 10H,
18H, 20H, 28H, etc.)
A2:0 could address from 000 to 111 in binary (0-7H)
Output on CPU Address Lines for addresses within a Quadword Addresses 00000000 - 000000F
Output on CPU Address Lines for addresses within a Quadword Addresses 00000000 - 000001F
The Pentium uses Byte Enables to address locations within a QWORD.
BE7#:BEO# (outputs): Byte enable lines - to enable each of the 8 bytes in the 64-bit data
path.
In effect a decode of the address lines A2-A0 which the Pentium does not generate.
Which lines go active depends on the address, and whether a byte, word, double word or
quad word is required.
Relationship of Byte Enables to Locations Addressed within a QWORD
Byte Enable Data Path Used Location in Qword

BE0# D07:00 FIRST
BE1# D15:08 SECOND
BE2# D23:16 THIRD
BE3# D31:24 FOURTH
BE4# D39:32 FIFTH
BE5# D47:40 SIXTH
BE6# D55:48 SEVENTH
BE7# D63:56 EIGHT
Data bus: The data bus provides a path for data to flow.
The data can flow to/from the microprocessor during a memory or I/O operation.
D63:DO (bi-directional): The 64-bit data path to or from the processor. The signal W/R#
distinguishes direction.
Control bus: The control bus is used by the CPU to tell the memory and I/O chips what the CPU
is doing.
Typical control bus signals are these:
ADS# (output): Signals that the processor is beginning a bus cycle:
BRDY# (input): This signal ends the current bus cycle and is used to extend bus cycles.
M/IO# (output): Defines if the bus cycle is a Memory access or an Input/Output Port access.
D/C# (output): Defines if the bus cycles is Data or Code for Memory access.
W/R# (output): Indicates if bus cycle is a Write or a Read operation.
Cache#. (output): Processor indication of internal cacheability. Cache# and Ken# are used
together to determine of a read will be turned into a linefill. (Burst cycle).
Ready Logic State Machine Example
BRDY#
Zero WS ADS#
IOCHRDY Cache Read TW= Time Wait

0=Add Wait
States for
ISA BUS TW TW
DRAM-Read
TW TW
ISA Bus Read

Access /EPROMs
TW TW
TW TW
TW
Special Cycles Definition
Halt Cycle: CPU waits for INTR, NMI, Reset, or INIT.
Generated when Pentium executes a HLT instruction.
If interrupts are enabled, an INTR (e.g Timer Tick) will force the microprocessor out
of a halt.
Shutdown Cycle: CPU waits for NMI, Reset, or INIT.
Generated by the Pentium:
Triple Fault: The CPU detects a further exception (e.g. General Protection
Fault, Invalid Op Code, Stack Overflow) while executing the Double Fault
Exception handler.
Internal Parity Error detected by the CPU.
Shutdown is decoded by the system board and generates a soft reset (INIT to Pentium) in a PC.
Microprocessor Single Bus Cycle

Basic Burst Read Cycle
Burst Cycles can transfer larger quantities of data in fewer clocks than single transfer cycles.
e.g. - Single Cycle: 8 bytes (64 bits) in 2 clocks.
4 Singles cycles = 32 bytes in 8 clocks.
e.g. - Burst Cycle: 32 bytes (4*64 bits) in 5 clocks.
The Pentium uses burst mode for:
Cacheable read cycles
Write-back cycles when writing back a cache line.
Burst cycles are limited to an address area that begins at a 32-byte limit and the system
board must calculate the other 3 burst addresses.
l The system board must generate the subsequent addresses(2nd, 3rd, 4th) in the following
sequence.
Address Output by Pentium
1st 2nd 3rd 4th
00H 08H 10H 18H
08H 00H 18H 10H
10H 18H 00H 08H
18H 10H 08H 00H
This is required in order to fill the Pentium 32t (20H) byte cache on 32 byte boundaries.
e.g. 32 bytes for addresses 100H - 11FH
100H, 108H, 110H, 118H; or 110H, 118H, 100H, 108H, etc
These addresses can be generated as follows:
A3 toggles for every address; A4 toggles every OTHER address
A7 6 5 4 3 2 1 A0 Address Output by Pentium
0 0 0 0 0 0 0 0 00H 08H 10H 18H
0 0 0 0 1 0 0 0
0 0 0 1 0 0 0 0
0 0 0 1 1 0 0 0
A7 6 5 4 3 2 1 A0 Address Output by Pentium
0 0 0 0 1 0 0 0 08H 00H 18H 10H
0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0
0 0 0 1 0 0 0 0
Pipelining
The pipelines are called u and v pipes.
The u-pipe can execute any instruction, while the v-pipe can execute simple
instructions as defined in the Instruction Pairing Rules.
When instructions are paired, the instruction issued to the v-pipe is always the next
in sequential after the one issued to u-pipe.
The integer pipeline stages are as follows:
1. Prefetch(PF) :
Instructions are prefetched from the on-chip instruction cache
2. Decode1(D1):
Two parallel decoders attempt to decode and issue the next two sequential instructions
It checks whether the instructions can be paired
It decodes the instruction to generate a control word
A single control word causes direct execution of an instruction
Complex instructions require microcoded control sequencing
3. Decode2(D2):
Decodes the control word
Address of memory resident operands are calculated
4. Execute (EX):
The instruction is executed in ALU
Data cache is accessed at this stage
For both ALU and data cache access requires more than one clock.
5. Writeback(WB):
The CPU stores the result and updates the flags
The floating point pipeline has 8 stages as follows:
1. Prefetch(PF) :
Instructions are prefetched from the on-chip instruction cache
2. Instruction Decode(D1):
Two parallel decoders attempt to decode and issue the next two sequential
instructions
It checks whether the instructions can be paired
It decodes the instruction to generate a control word
A single control word causes direct execution of an instruction
Complex instructions require microcoded control sequencing
3. Address Generate (D2):

Decodes the control word
Address of memory resident operands are calculated
4. Memory and Register Read (Execution Stage) (EX):
Register read or memory read performed as required by the instruction to

access an operand.
5. Floating Point Execution Stage 1(X1):
Information from register or memory is written into FP register.
Data is converted to floating point format before being loaded into the floating
point unit
6. Floating Point Execution Stage 2(X2):
Floating point operation performed within floating point unit.
7. Write FP Result (WF):
Floating point results are rounded and the result is written to the target
floating point register.
8. Error Reporting(ER)
If an error is detected, an error reporting stage is entered where the error is

reported and FPU status word is updated
Branch Prediction Logic

Flushing of pipeline problem
Performance gain through pipelining can be reduced by the presence of program

transfer instructions (such as JMP,CALL,RET and conditional jumps).
They change the sequence causing all the instructions that entered the pipeline after
program transfer instruction invalid.
Suppose instruction I3 is a conditional jump to I50 at some other address(target

address), then the instructions that entered after I3 is invalid and new sequence
beginning with I50 need to be loaded in.
This causes bubbles in pipeline, where no work is done as the pipeline stages are
reloaded.
Branch Prediction Logic

To avoid this problem, the Pentium uses a scheme called Dynamic Branch Prediction.
In this scheme, a prediction is made concerning the branch instruction currently in

pipeline.
Prediction will be either taken or not taken.
If the prediction turns out to be true, the pipeline will not be flushed and no clock
cycles will be lost.
If the prediction turns out to be false, the pipeline is flushed and started over with the
correct instruction.
It results in a 3 cycle penalty if the branch is executed in the u-pipeline and 4 cycle
penalty in v-pipeline.
It is implemented using a 4-way set associative cache with 256 entries. This is
referred to as the Branch Target Buffer(BTB).
The directory entry (tag) for each line contains the following information:
Valid Bit : Indicates whether or not the entry is in use
History Bits: track how often the branch has been taken
Source memory address that the branch instruction was fetched from (address
of I3)
If its directory entry is valid, the target address of the branch (address of I50)
is stored in corresponding data entry in BTB
BTB is a look-aside cache that sits off to the side of D1 stages of two pipelines and
monitors for branch instructions.The first time that a branch instruction enters either
pipeline, the BTB uses its source memory address to perform a lookup in the cache.
Since the instruction has not been seen before, this results in a BTB miss.It means the
prediction logic has no history on instruction. It then predicts that the branch will not
be taken and program flow is not altered. Even unconditional jumps will be predicted
as not taken the first time that they are seen by BTB.When the instruction reaches the
execution stage, the branch will be either taken or not taken. If taken, the next
instruction to be executed should be the one fetched from branch target address. If not
taken, the next instruction is the next sequential memory address. When the branch is
taken for the first time, the execution unit provides feedback to the branch prediction
logic. The branch target address is sent back and recorded in BTB. A directory entry
is made containing the source memory address and history bits set as strongly taken
History Resulting Prediction If branch If branch is
Bits Description Made is taken not taken
11 Strongly Branch Remains Downgrades

Taken Taken Strongly to Weakly
Taken Taken
10 Weakly Branch Upgrades to Downgrades

Taken Taken Strongly to Weakly Not
Taken Taken
01 Weakly Not Branch Not Upgrades to Downgrades to

Taken Taken Weakly Strongly Not
Taken Taken
00 Strongly Not Branch Not Upgrades to Remains

Taken Taken Weakly Not Strongly Not
Taken Taken
Floating-Point Unit
Floating-Point Format
To increase the speed and efficiency of real-number computations, computers or FPUs typically
represent real numbers in a binary floating-point format. In this format, a real number has three
parts: a sign, a significand, and an exponent. Figure 31-2 shows the binary floating-point format that
the Intel Architecture FPU uses. This format conforms to the IEEE standard.
FPU Architecture
From an abstract, architectural view, the FPU is a coprocessor that operates in parallel with the
processors integer unit (see Figure 31-4). The FPU gets its instructions from the same instruction
decoder and sequencer as the integer unit and shares the system bus with the integer unit. Other
than these connections, the integer unit and FPU operate independently and in parallel. (The actual
microarchitecture of an Intel Architecture processor varies among the various families of
processors. For example, the Pentium Pro processor has two integer units and two FPUs; whereas,
the Pentium processor has two integer units and one FPU, and the Intel486 processor has one
integer unit and one FPU.)
The instruction execution environment of the FPU (see Figure 31-5) consists of 8 data registers
(called the FPU data registers) and the following special-purpose registers:
The status register.
The control register.
The tag word register.
Instruction pointer register.
Last operand (data pointer) register.
Opcode register.
These registers are described in the following sections.
The FPU Data Registers
The FPU data registers (shown in Figure 31-5) consist of eight 80-bit registers. Values are stored in
these registers in the extended-real format shown in Figure 31-17. When real, integer, or packed
BCD integer values (in any of the formats shown in Figure 31-17) are loaded from memory into
any of the FPU data registers, the values are automatically converted into extended-real format (if they are not
already in that format). When computation results are subsequently transferred back
into memory from any of the FPU registers, the results can be left in the extended-real format or
converted back into one of the other FPU formats (real, integer, or packed BCD integers) shown in
Figure 31-17.
The FPU instructions treat the eight FPU data registers as a register stack (see Figure 31-6). All
addressing of the data registers is relative to the register on the top of the stack. The register
number of the current top-of-stack register is stored in the TOP (stack TOP) field in the FPU status
word. Load operations decrement TOP by one and load a value into the new top-of-stack register,
and store operations store the value from the current TOP register in memory and then increment
TOP by one. (For the FPU, a load operation is equivalent to a push and a store operation is
equivalent to a pop.)
FPU Status Register

The 16-bit FPU status register (see in Figure 31-8) indicates the current state of the FPU. The flags
in the FPU status register include the FPU busy flag, top-of-stack (TOP) pointer, condition code
flags, error summary status flag, stack fault flag, and exception flags. The FPU sets the flags in this
register to show the results of operations.
The contents of the FPU status register (referred to as the FPU status word) can be stored in
memory using the FSTSW/FNSTSW, FSTENV/FNSTENV, and FSAVE/FNSAVE instructions. It
can also be stored in the AX register of the integer unit, using the FSTSW/FNSTSW instructions.
FPU Control Word
The 16-bit FPU control word (see in Figure 31-10) controls the precision of the FPU and rounding
method used. It also contains the exception-flag mask bits. The control word is cached in the FPU
control register. The contents of this register can be loaded with the FLDCW instruction and stored
in memory with the FSTCW/FNSTCW instructions.
When the FPU is initialized with either an FINIT/FNINIT or FSAVE/FNSAVE instruction, the
FPU control word is set to 037FH, which masks all floating-point exceptions, sets rounding to
nearest, and sets the FPU precision to 64 bits.
Operating Modes
The Pentium and Pentium Pro processor has three operating modes:
Real-address mode. This mode lets the processor to address "real" memory address.
It can address up to 1Mbytes of memory (20-bit of address). It can also be called
"unprotected" mode since operating system (such as DOS) code runs in the same
mode as the user applications. Pentium and Prentium Pro processors have this mode
to be compatible with early Intel processors such as 8086. The processor is set to this
mode following by a power-up or a reset and can be switched to protected mode using
a single instruction.
Protected mode. This is the preferred mode for a modern operating system. It allows
applications to use virtual memory addressing and supports multiple programming
environment and protections.
System management mode. This mode is designed for fast state snapshot and
resumption. It is useful for power management.

Fig: Superscalar Architecture of Pentium

Uploaded by

Copyright:

Available Formats

Fig: Superscalar Architecture of Pentium

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fig: Superscalar Architecture of Pentium

Uploaded by

Copyright:

Available Formats

Pentium Architecture

Fig : Superscalar Architecture of Pentium

The Pentium processor has two primary operating modes -

Fig: Pentium pipeline stages

Floating Point Unit :

Fig: Floating Point Unit

This limit is imposed by a 16 bit CPU Register.

A 16 bit register can store up to FFFFH (1111 1111 1111 1111 ).

Which CPU Register limits I/O space to 64K?

The Pentium address consists of two sets of signals:

Address Bus (A31:3)

Byte Enables (BE7#:0#)

A2:0 could address from 000 to 111 in binary (0-7H)

The Pentium uses Byte Enables to address locations within a QWORD.

Relationship of Byte Enables to Locations Addressed within a QWORD

Byte Enable Data Path Used Location in Qword

Typical control bus signals are these:

ADS# (output): Signals that the processor is beginning a bus cycle:

W/R# (output): Indicates if bus cycle is a Write or a Read operation.

Ready Logic State Machine Example

IOCHRDY Cache Read TW= Time Wait

ISA Bus Read

Halt Cycle: CPU waits for INTR, NMI, Reset, or INIT.

Generated when Pentium executes a HLT instruction.

Shutdown Cycle: CPU waits for NMI, Reset, or INIT.

Generated by the Pentium:

Internal Parity Error detected by the CPU.

Microprocessor Single Bus Cycle

e.g. - Single Cycle: 8 bytes (64 bits) in 2 clocks.

4 Singles cycles = 32 bytes in 8 clocks.

e.g. - Burst Cycle: 32 bytes (4*64 bits) in 5 clocks.

The Pentium uses burst mode for:

Cacheable read cycles

Write-back cycles when writing back a cache line.

1st 2nd 3rd 4th

00H 08H 10H 18H

08H 00H 18H 10H

10H 18H 00H 08H

18H 10H 08H 00H

e.g. 32 bytes for addresses 100H - 11FH

100H, 108H, 110H, 118H; or 110H, 118H, 100H, 108H, etc

These addresses can be generated as follows:

A3 toggles for every address; A4 toggles every OTHER address

A7 6 5 4 3 2 1 A0 Address Output by Pentium

0 0 0 0 0 0 0 0 00H 08H 10H 18H

A7 6 5 4 3 2 1 A0 Address Output by Pentium

0 0 0 0 1 0 0 0 08H 00H 18H 10H

The pipelines are called u and v pipes.

The integer pipeline stages are as follows:

Instructions are prefetched from the on-chip instruction cache

The floating point pipeline has 8 stages as follows:

Instructions are prefetched from the on-chip instruction cache

It checks whether the instructions can be paired

It decodes the instruction to generate a control word

A single control word causes direct execution of an instruction

Complex instructions require microcoded control sequencing

3. Address Generate (D2):

Address of memory resident operands are calculated