Computer Architecture Arm Substitles5
Computer Architecture Arm Substitles5
Computer Architecture Arm Substitles5
Module 1, Intro
[music] I'm Richard Grisenthwaite, Chief Architect at Arm. What does a Chief
Architect do? The Chief Architect is responsible for the evolution of the Arm
architecture. One of the things about an architecture is, you want your software to
continue running on future products. So it's a very long term commitment, and the
role of the chief architect is essentially to curate the architecture over time,
adding new features as needed by the partner. Why are microprocessors so important
and ubiquitous? I think the fundamental reason is that they are a general purpose
device. They describe a basic instruction set that does the stuff that you can
construct pretty much any program out of. Indeed, they're Turing complete, which
means, by definition, you can create any computable problem, solving a computable
problem with the processor. They're not customized to any particular usage, but they
do two things really rather well, which are data processing and decision making. In
other words, having, for example, added two numbers and compared them, you're then
taking a branch and making a decision based off that. Now in reality, an awful lot
of general purpose problems are ones that involve: work out some value or some
criteria, compare against that, and then make a decision on it. And essentially that
allows you to solve a huge number of different problems and that's why the
microprocessor has become so ubiquitous. What is the difference between architecture
and microarchitecture? Architecture is what it has to do, and microarchitecture is
how it does it. So for example, we will define in the architecture a set of
instructions that do "add" or "load" or "branch", but it says nothing about whether
you've got a ten-stage pipeline or a three-stage pipeline. It says nothing about
branch prediction. All of those sort of features which actually determine the
performance of the device; all of those are part of the microarchitecture. The key
point is that software that is written to be compliant on the architecture will
actually run on lots of different implementations, lots of different
microarchitectures that essentially implement that architecture in different ways,
choosing a trade-off between the power, the performance and the area of the design.
What does it take to build a commercially successful processor today? If you
actually start from scratch with an established architecture, but we want to create
a new microarchitecture, we reckon, for a high end processor, you're talking about
three-to-four hundred person-years' worth of work in order to create a properly
competitive multi-issue out-of-order machine compared with the state-of-the-art that
you can get from Arm. In addition to that — and that's just to come up with the RTL
— in addition to that, you've then got to do the implementation. If you're going to
build that on a three-nanometer process, the leading edge to get the best possible
performance, you're talking about tens of millions of dollars for the mask sets.
There's a whole bunch of software you've got to go and build on top of that and do
all of that work. When we've looked at companies that are interested in actually
starting up, taking an Arm architecture license — say I want to go and build my own
business — we reckon that you need to be prepared to spend getting on for half a
billion dollars before you're actually going to be successful because it takes time.
Your first product is not necessarily going to be fully competitive because it would
be slightly surprising if the first thing that you built was as good as what people
have been building for many years. It takes time to build up that expertise and
skills. And so you're going to see a couple of iterations before even the best teams
end up really being competitive. And so as I say, I think if you went from nothing
and wanted to essentially create something, and that's using the Arm architecture
with all of its existing software and so on, you're talking the best part of half a
billion. What is the Arm business model? The Arm business model fundamentally is the
licensing of semiconductor IP to as wide a range of companies as possible and in as
many ways as possible, in order to maximize the uptake of the IP. When we talk about
IP, we're talking about essentially designs for processors and we license that both
as an architecture license, where you effectively gain the right to build your own
processor to the specification of the Arm architecture, or an implementation license
where we are licensing, actually, an implementation that is compliant with the Arm
architecture in the form of some richer transfer-level RTL. What makes Arm different
from its competitors? So if you go back to where we were when Arms started out in
the early 90s, there were many, many different architectures available and they were
all kind of doing the same thing but slightly differently. They would have different
instructions, different instruction sets, and so software needed to be ported. A
huge part of Arm's success actually comes from the fact that we created a business
model of licensing the IP to make it very easy for people to build processors, to
incorporate the designs into their SoCs, into their systems. And that then made it
very straightforward for people to be able to use the Arm architecture. Now what
this then meant was, people said: I will port more software to this because it's
more widely available and you get this positive feed-forward effect whereby more
availability of hardware encourages more software, encourages more hardware, and so
on. And essentially that meant that a lot of people said: there's no point in me
having a different architecture. I'm not getting a valuable difference from doing
that. All I've got is, kind of, a needless change to the software that I need to
make. So actually the whole model, we went on at Arm, which was: let's license our
IP to a great number of different players to come up with different solutions to
meet the form factor of a camera or a PDA, or whatever it was back in the day. Those
things made it much more straightforward for people to incorporate our technology.
[music]
Module 1, Video 1
Computer architecture is the study of tools and techniques that help us to design
computers. More precisely, it helps us understand how to meet the needs of
particular markets and applications, using the technology and components that are
available. For example, we might need to produce the chip at the heart of a
smartphone using 10 billion transistors and a power of less than two Watts. How do
we achieve the performance a customer wants? The challenge is a fascinating one, and
one that requires a broad understanding. For example, what target market are we
designing for? What are the characteristics of the applications to be run? How will
programming languages and compilers interact with the microprocessor? How best to
craft the narrow interface between the hardware and software? How to organize the
components of our microprocessor? And how to design a circuit, given the
characteristics of individual transistors and wires? Like many design problems,
computer architecture requires many trade-offs to be made and evaluated. Each design
decision will impact trade-offs between size, performance, power, security,
complexity, and cost. Trade-offs must be re-evaluated regularly, due to advances in
fabrication technology, applications, and computer architecture. Computer
architecture must be grounded in quantitative techniques and experimentation, but
the endless number of possible designs means that the field depends on a high degree
of human ingenuity and art. Perhaps surprisingly, the earliest computers and today's
most advanced machines have much in common. They both execute a stored program
constructed from machine instructions. These instructions perform simple operations
such as adding two numbers. Nevertheless, greater numbers of faster transistors, and
the application of a relatively small number of computer architecture concepts, have
enabled us to construct machines that can perform billions of instructions per
second, and shrink these machines to fit in hand-held battery-powered devices. It is
this rapid progress that has supported breakthroughs in machine learning, drug
discovery, climate modeling, and supports our modern world where computation and
storage are almost free. The task of designing a microprocessor is split into
different levels of abstraction: "Architecture;" "Microarchitecture;" and
"Implementation." "Architecture" focuses on the contract between programmers and
hardware. It allows compatible families of microprocessor products to be built. The
ARMv8-A architecture is an example of this. Architecture includes the "Instruction
Set Architecture," or ISA, which defines what instructions exist. It also defines
precisely the behavior of memory and other features needed to build a complete
processor. "Microarchitecture" focuses on the organization and structure of the
major components of a microprocessor. It has to match the rules set by the
architecture. The microarchitecture still has flexibility though; and so the
implementation specifies the circuit detail precisely. This culminates in the exact
circuit design for manufacture. Each of these levels is vital, and each comes with
its own challenges and opportunities.
Module 1, Video 2
At the beginning of the 20th century, "computers" were people employed to perform
calculations. These computers used mechanical calculators to help them perform
arithmetic. They followed instructions to decide what calculation to perform next.
These instructions defined the algorithm or program they were executing. They would
consult paper records to find the inputs to their calculations, and would store
their intermediate results on paper so they could refer back to them later. A modern
electronic computer is organized in a similar way. We will look in more detail at
these components of a microprocessor in the next video, but for now let's look at
how it operates. Microprocessors also follow instructions one by one, and then
perform relevant calculations. This idea of fetching instructions from memory and
executing them is called the "Fetch-Execute Cycle." In the "Fetch" stage, the
computer reads the next instruction from the program. This instruction is encoded in
binary as ones and zeroes, so it must be decoded to understand what it means. This
is done in the "Decode" stage. Once it is clear what to do, we move to the "Execute"
phase. This can involve different tasks such as reading memory, performing a
calculation, and storing a result. Once done, the computer is then ready to begin
the cycle again, by fetching the next instruction. Instructions are normally fetched
sequentially in order, but some special instructions called "branches" can change
which instruction will be executed next. For branches, a calculation determines the
next instruction. This can mean evaluating a condition, or reading a register to
determine the next instruction's location. Branches allow computers to make
decisions, and to re-use instruction sequences for common tasks. A modern computer
program, like a web browser, contains millions of instructions, and computers
execute billions of instructions per second, but they all conceptually follow this
"Fetch-Execute Cycle."
Module 1, Video 3
Modern microprocessors are circuits built using anywhere from 1,000 to 100 billion
tiny transistors. The key to designing circuits with such huge numbers of parts is
to build them from re-usable blocks. Let's take a look at some of these. Each
transistor is an electrically-controlled switch. When there is too little voltage at
the gate, the switch is off, so the electrical signal cannot propagate from the
drain to the source. However, when there is sufficient voltage at the gate, the
switch is on, so the signal does propagate. When designing processors, we use
digital electronics. The only voltage or current values we consider represent zero
and one, enabling us to build robust circuits from imprecise components, even in the
event of electrical noise or manufacturing imperfections. This means our circuits
represent binary numbers, since they only have two states: zero and one. We use
transistors to build increasingly complex circuits. We can design circuits that can
remember binary values, or select between multiple inputs, or even do arithmetic,
like addition. We can then use those circuits as building blocks in even larger
designs. When designing a digital system, we must keep everything synchronized to
control when its behavior occurs. To do this, we use a "Clock Signal," which is a
wire in the circuit whose signal cycles between zero and one. We measure the rate in
Hertz. For example, if you hear a processor has a clock speed of two Gigahertz, it
means 2 billion ones and zeroes per second. The maximum speed of the clock signal is
determined by the longest, and therefore slowest, path in the circuit between 2
clocked flip-flops. This is referred to as the "Critical Path." The signal must have
time to propagate all the way along the critical path before the clock cycle
completes. A microprocessor needs to do many types of arithmetic. For this, we build
an "Arithmetic Logic Unit" or "ALU." This circuit receives 2 numbers as input, as
well as an indication of what operation to perform, for example, addition or
subtraction. In addition to logic, a microprocessor needs memory. Memory is
organized as arrays of memory cells that are able to store many "Words" of data. A
specific word, commonly 32 bits in length, can be accessed by specifying its
"Address." Each address is a number that indicates the location in the memory that
should be read or written. Memory cells range from hundreds of bits to millions of
bits in size, but larger ones are slower to access, as signals and their long
internal wires take longer to propagate. For that reason, almost all microprocessors
include at least two types for storing data: a big slow "Data Memory," and a small
fast memory called a "Register File." In reality, the "Data Memory" may be
implemented using many different sizes of memory, as we'll see in Module 4. As well
as storing data in memory, we also use some memory to store the instructions. We
need a way to keep track of which instruction we will fetch next, so we have a
"Program Counter." This stores the address in Instruction Memory of the next
instruction to be accessed. Since instructions are encoded in binary, we also have
"Instruction Decode Logic" that converts that binary to the various signals needed
to control the microprocessor.
Module 1, Video 4
We've already seen how a microprocessor is controlled by instructions. But what are
they really? An instruction is a simple command that the microprocessor hardware can
perform directly. We write them as text like this to make them easier to read. But
for the microprocessor we encode them in binary. We use a program called an
assembler to translate between the human text and the binary. In this video we'll be
looking specifically at an Arm instruction set called A64 but other instruction sets
follow similar principles. Arithmetic and logic instructions are the simplest type
of instruction. The first word tells us what operation will be performed such as
addition or multiplication. The values after this tell the processor where to put
the result and where to get the inputs. Values starting with X are addresses in the
register file. Arithmetic instructions read one or two registers and then put the
result into a third register. Branch instructions are used to make decisions and to
repeat instructions. Normally the microprocessor executes instructions in sequential
order but branches change that, and explicitly tell the microprocessor the address
of the instruction to run next. This is done by giving the address of the next
instruction in the instruction memory. Some branches are unconditional, meaning they
always occur and always affect the next instruction address. Other branches are
conditional, meaning the processor will perform a calculation to decide whether to
follow the branch or to continue executing instructions sequentially following the
branch. These are preceded by comparison instruction to calculate the condition.
Loads and stores are the instructions for accessing the data memory. Loads copy
values from memory to the register file. Stores do the opposite. In both cases, the
instruction needs to know the address in the data memory and the location in the
register file to copy between. For data memory, loads and stores read an address
from a base register. They can also optionally add to this base address by reading
another register, or by simply specifying a number in the instruction itself. Using
sequences of instructions we can build programs. Here is an example of a small
program that implements Euclid's greatest common divisor algorithm. Let's take a
look at it working, one instruction at a time. To start with, the input stored in X1
and X2 are compared. If they're equal, because we have found the greatest common
divisor a conditional branch instruction moves to instruction 7. If they're not
equal, another conditional branch instruction can be used to determine whether X1 is
smaller. Finally, we use an arithmetic instruction to subtract either X1 or X2
depending on which was larger and then unconditionally branch back to the start.
Here we can see an example of the program running. One instruction executes at a
time. As we step through the program, the values in the registers X1 and X2 are
shown. And we see these are modified each time we reach instruction 3 or 5. The
instructions repeat in a loop, as is common for many different types of program.
Each time round the loop instruction 0 and 1 are checking whether or not to continue
the loop. When they detect X1 and X2 have become equal, the program finishes and the
processor moves on to the next task.
Module 1, Lab
In this exercise, we're going to be using ASim. ASim is a behavioral simulator for a
subset of the Arm AArch64 instruction set. Behavioral simulator means that it's not
simulating real circuit level details of a microprocessor it's just simulating the
behavior of each instruction as it runs. Nevertheless, behavioral simulators are
vital tools for computer architects. We use them to check the designs we've built,
match with the behavior we intended and we can also use them to explore new ideas,
for example adding new instructions to the instruction set. We're just going to be
using it though to get familiar with the basics of Arm AArch64. So we can create a
new file which will allow us to type in some Arm AArch64 instructions and when we
press assemble that will cause ASim to begin the simulation of the instructions. If
we scroll down we can see that below we've put a little cheat sheet of Arm AArch64
instructions that you can use as a reference. Do note that these are just examples
though. So for example this first one is mov X1 X2 and the descriptive text says
that this copies the value of register X2 to register X1. But equally that would
imply that if we use mov X3, X4 instead that would copy the value of X4 to the
register X3. So we don't have to just use these exact instructions, we can tweak
them as we need to, to achieve our goals. So going back to the ASim, let's say that
we wanted to add the number two to the number three. In our cheat sheet we could see
that there is an 'add' instruction but it requires that the inputs for the addition
are stored in registers. So first of all, we're going to need to load up two
registers with the numbers that we want to add in this case two and three. So
looking at our cheat sheet again we can see that there's an instruction mov that
allows us to do that. So if I write mov X2, #3 this would cause the register X2 to
be loaded with the value 3 and similarly I could write mov X1,#2 and this would
cause the register X2 X1 to be loaded with the value two. Last but not least, we
could then do the addition we wanted to do, such as add X3 X1 X2. This will cause X1
to be added to X2 and the results stored in X3. If I press assemble now, we will see
that the UI changes slightly. Here we can now see the simulated state of the
processor. ASim is only going to, because it's a behavioral simulator it's only
going to show the state before and after each instruction. And so right now we are
before executing the first instruction. That's the yellow highlighting there and we
can also see it in the simulated memory. If I press step we can now see that that
instruction has completed, and it's before executing the mov X1 X2. And notably we
can see that X2 has been loaded with the number 3 which is exactly what we would
expect. If I press step again, we can see that mov X1 X2 has now occurred. MOV X1
#2, sorry has now occurred which has loaded the value 2 into the register X one. And
last but not least, if we step again we see that X3 has become equal to five which
is exactly what we would expect if registry X1 was added to registry X2. So this
allows us to get a feel for what it's like for a machine to execute these
instructions. We can reset back to the beginning if we want to watch it go through
again. If we want to do an edit for example adding a new instruction we can just do
that. Press assemble and that will add the new instruction to our program, and we
can see its effects by simulating. Now in the exercise you will be invited to take
an existing program and add 1 new instruction to it at the indicated position. Do
note that when you assemble the exercise program, you'll be taken to a gray bit of
code, which is our testing code. But as you step, you'll see that the simulator
flicks between the testing code and the code that we wrote the exercise one code.
You can also click between them using this. Feel free to try to work out what our
testing code does but you can just ignore it if you want to. The point of the
exercise is just to add the instruction in the indicated position. When you're
comfortable you've got the right instruction you can press run to get all the way to
the end. And if you really think that's correct if you scroll down to the bottom of
the page you'll see the submit button which you can use to send in your answer. Best
of luck.
Module 2, Intro
[music] Hello, my name is Martin Weidmann. I'm an engineer and product manager with
Arm's Architecture and Technology Group. I look after the A-profile architecture and
I maintain Arm's Interrupt Controller specifications. Computer architecture is
sometimes called a science of trade-offs. Why is everything a trade-off when it
comes to designing a processor? Let's take an example of where we've had to make
trade-offs when developing process architecture. So the Arm architecture has a
feature called Pointer Authentication. This is often abbreviated to PAC for Pointer
Authentication Code. What this feature is trying to do is protect against a form of
attack called ROP and JOP. These are Return Orientated and Jump Orientated
Programming, and it's where an attacker tries to subvert things like the call stack
to run legitimate code, but in ways that weren't expected by the programmer or the
compiler. PAC or Pointer Authentication tries to defend against this kind of attack
by using part of an address to provide an encrypted signature. So we can check the
signature and the address match and if they don't, we can spot an attack in
progress. So why is this a trade-off? Well, because to add security, we want that
signature to be as big as possible. The bigger the signature, the more bits we use
for that, the stronger cryptographically that signature is. The trade-off is: the
more bits we use for the signature, the fewer bits we have available for other
things, such as the address. So you can have a big signature with a small address,
but if you want the address to get bigger, then you get a smaller signature, and
that's then cryptographically weaker. So the trade-off we have to make when
designing a technology like that is: What's the right amount of bits for the
signature? What's the strength of cryptography we need from that signature in order
to get the design goal, which is to defeat these attacks and give us more robust
computing? What sort of guiding principles do you use when designing a processor? So
when you're designing a processor, the key thing you have to bear in mind is: What's
it going to be used for? What you don't want to end up with is a very expensive
paperweight. So we need to understand the requirements that the processor has to
meet. We have to understand the design trade-offs we're making and how they work
into meeting those requirements. We also have to consider not just the processor
itself, but how we're going to show that that processor is correct. We'll put as
much time, if not more, into testing and validating the design as we do into
designing it. How do you design a new microprocessor? If you wanted to create a new
processor from scratch, the first thing you're going to have to do is understand the
market that that processor is going to address and to then build a team to design
that processor. There isn't such a thing as one processor for every possible market.
The requirements for something like an embedded microcontroller are going to be very
different to what you want from the processor in your mobile phone, your laptop, or
your server. So you need to understand those requirements as the first step into
building a new processor. What determines the best design for a microprocessor? So
when you're designing a processor, you need to work out what the best design for a
given market or application is going to be. There's no magic formula for this. It's
going to depend a lot on what you're trying to achieve with that processor. You need
to understand things like the power requirements, the performance requirements. Is
it going to work in a highly noisy electrical environment? There's a big difference
between the reliability requirements you need from something like a watch versus a
satellite. So you would take those requirements and you'd work out what the best set
of trade-offs is going to be, and that's an art more than it is a science. How do
the underlying technologies contribute to this best design? A lot of technologies go
into our processor. There's the design of the microarchitecture, the implementation
of the processor. There's the silicon process you're going to use, how you're going
to integrate that processor into an SoC or ASIC. Is it a single die, or is it going
to be using chiplets or multiple sockets? All of those different technologies are
going to be factors into how you design the process or what trade-offs you make, and
what performance and power you get out of the design once you're done. In reality
there may be many different 'best' designs, so how do you pick one? So when you're
designing a processor, what you want is the best design. But often there isn't "a"
best design, there's just different trade-offs. You have to decide what the best set
of trade-offs is for the particular use case you're going for. And that's also going
to depend on: Is this a device that will be off the shelf, used for lots of
different applications—a general purpose processor? Or is this being designed for
one specific use case? Again, there isn't really a magic bullet or single answer for
this type of question. You need to understand how the processor is going to be used
and then use your experience to judge the trade-offs, and what will give you the
best mix of power, performance, area, cost, reliability for your target use case.
Module 2, Video 1
[music] In this module, we're going to explore how to improve the simple
microprocessor design from Module 1 in order to allow it to execute programs more
efficiently. First, let's find out how long a program takes to execute. The time
taken to perform the average instruction is equal to the number of clock cycles
taken to perform an instruction multiplied by the duration of one clock cycle. The
time taken to run our program is found by multiplying the average time to perform an
instruction by the number of instructions in our program. How could we make this
faster? One thing we could try is to reduce the number of instructions in a program.
We might be able to optimize the code removing unnecessary and repeated work and
selecting instructions to minimize code size and maximize performance. We could give
our microprocessor the ability to perform more operations in order to help
programmers or compilers further reduce the number of instructions in their program.
For example, allowing the loading of two data values at the same time might allow
fewer instructions to be used in the program. The downside to this approach, is that
adding more instructions will require extra circuitry in the processor and therefore
we likely increase the clock period. If the extra instructions are rarely used this
could even mean an overall decrease in performance. We see this theme often in
computer architecture trade-offs that we have to carefully balance. Another approach
is to use faster transistors perhaps constructed from a more recent fabrication
technology. This would reduce the clock period but may increase costs. The rest of
this module focuses on an optimization to reduce the clock period called pipelining.
This is the most important optimization we use when designing processors. It uses a
similar concept to an assembly line in a factory where work can start on the next
item before the previous one finishes. Let's take a closer look. Imagine that each
instruction has to go through four circuits in a processor. If we attempt to do all
of these in one clock cycle this means our clock period is the latency of all four
circuits added together. If we were to pipeline this, we would add a pipeline
register in the middle. This divides the circuit into two sections called stages.
Notice that although each instruction takes a similar amount of time to travel down
the whole pipeline, the pipeline design can execute nearly twice as many
instructions per second. The throughput has doubled. This is because we can set the
clock period much shorter. It's now the maximum latency of the two stages. We can
pipeline into many stages and this allows for much faster execution of programs.
Unfortunately, though, pipelining a real microprocessor design is not quite as
simple because the processor has various feedback signals and loops in the circuit.
In the next video, we'll take a look at the challenges of pipelining in practice.
[music]
Module 2, Video 2
[music] In this video, we're going to look at applying the pipeline optimization to
a realistic microprocessor design. In the first module, we met the components of a
microprocessor, so let's look at how these are really connected. This diagram shows
all the connections needed for a real, unpipelined microprocessor. Each clock cycle,
the processor starts by fetching an instruction from the instruction memory. Once
the instruction reaches the decode logic, it is decoded to produce the control
signals necessary to execute it. The exact control signals vary depending on the
type of instruction. For example, arithmetic instructions access the register file
and interact with the ALU. Ultimately, no matter how the instruction was executed,
the last step of each clock cycle is to update the program counter. This is done by
the branch unit. For non-branch instructions, this just means incrementing the
program counter. However, for branch instructions, the branch unit has to do some
calculations. When we apply our pipeline optimization to this design, we face some
challenges. The design has several loops because instructions have dependencies. How
can we break these cycles? The key observation is that not every instruction is the
same. In real programs, branch instructions usually make up less than 20 percent of
the program. For non-branches, the branch unit doesn't actually need to wait for the
ALU before calculating the result. Let's look at how we can use this fact to
pipeline the processor. Once the first instruction shown in yellow reaches the
pipeline register, we're ready to begin fetching the next instruction, shown in
blue. The yellow instruction can be in the execute stage whilst the blue instruction
is being fetched. Once the yellow instruction is finished, the blue instruction is
ready to enter the execute stage and a new green instruction enters the fetch stage.
What about the branches though? Let's imagine this next yellow instruction is a
branch. The fetch stage works normally until the branch unit, but the branch unit
cannot proceed. Consequently, the pipeline stalls. The fetch stage spends a cycle
waiting whilst the execute stage executes the branch. Finally, once the ALU is done,
the branch unit can proceed and the next instruction, in this case blue, can be
fetched. Overall, this means that the processor wasted one cycle stalling due to the
branch. Since only 20 percent of instructions are branches, this means that each
instruction would require on average 1.2 cycles. The same idea of stalling the
pipeline can be used to create even longer pipeline designs. This diagram shows a
typical five-stage processor pipeline. In the next video, we'll look at how we can
manage or prevent some of the stalls in a design like this. [music]
Module 2, Video 3
[music] Instructions within a program may be dependent on each other. That is, one
instruction may produce a value that a subsequent instruction consumes. Data values
may be communicated through registers or memory. The simple program shown has a
number of so-called true data dependencies. This means we must take care to execute
these instructions in order, and make sure results are correctly communicated.
Additionally, the outcomes of branch instructions may affect the path taken through
the program, and consequently, this affects whether an instruction is actually
executed. This sort of dependency is known as a control dependency. In the previous
video, we met a realistic processor pipeline with five stages. Circumstances that
prevent an instruction making progress in our pipeline are known as pipeline
hazards. Let's take a look at how dependencies cause hazards. This program has a
true data dependency. The first instruction writes to register one, which is then
read by the second instruction. If we send this down our pipeline, we see that the
second instruction must stall, waiting for register one to be written, before it can
read and proceed. This is a data hazard. Unfortunately, dependent instructions are
common and stalling in this way would significantly increase the average cycles per
instruction. Let's take a closer look at the hazard though. The ADD instruction is
in the execute stage, meaning its result is being computed. The SUB instruction
needs that result to proceed. Rather than waiting for the ADD to reach the writeback
stage, we could add an extra path into our pipeline to carry the output of one stage
to a later instruction, making the result available straight away. We call this a
forwarding path. In this case, the ALU result is forwarded to the SUB instruction to
be used as X1. This small piece of extra circuitry allows this data hazard to be
eliminated completely. Unfortunately, even if we add forwarding paths everywhere,
it's not possible to eliminate all data hazards. For example, this program has a
data hazard due to the load instruction. There are other types of hazard too. This
program contains a control hazard. We cannot be sure which instruction to fetch
until after the branch instruction executes. Consequently, this program has two
stall cycles. We will look in more detail at how control hazards can be mitigated in
the next module. Another class of hazards, called "structural hazards" occur when
two instructions require the same resources simultaneously. For example, if
instructions and data were stored in the same memory, and this could only be
accessed once per cycle, we would have to very frequently stall our pipeline to let
these stages access memory one-by-one. [music]
Module 2, Video 4
[music] In the previous videos, we explored how pipelining could improve performance
by reducing our clock period and by overlapping the execution of different
instructions. We also saw that it was sometimes necessary to stall our pipeline to
ensure that instructions were executed correctly. Ideally, our average cycles per
instruction, or CPI, will remain at 1.0. If we must stall, however, this will
increase. For example, if 20 percent of our instructions were loads and each of
these caused one stall cycle, our CPI would be 1.2. If a further 20 percent of
instructions were branches, and each of these caused two stall cycles, our CPI would
be 1.6. The longer we make our pipeline, the more stall cycles there will be, and
eventually the cost of stalls may outweigh the benefit of the faster clock period.
For example, let's imagine we added a stage to our five-stage pipeline from the
previous video. Now the number of stalls after a branch instruction increases to
three, hurting our CPI. On the other hand, our clock period would improve. So
whether or not this helps speed program execution would depend on the exact details.
It may eventually become more difficult to reduce our clock period by adding further
pipelining stages. This is because it becomes harder to perfectly balance the logic
between stages and because of the constant delays associated with clocking and our
pipeline registers. To mitigate these issues, we will need to invest in more
transistors and our design will require more area and power. The deeper our pipeline
gets, the greater the investment we need to make in terms of area and power for the
same incremental improvement. Commercial processes today have anywhere from two to
twenty pipeline stages. The faster, more expensive and power-hungry processors tend
to have longer pipelines than the smaller, cheaper processes in embedded devices. As
with many techniques in computer architecture, eventually it becomes more profitable
to invest our time and resources in an alternative way of improving performance. In
later modules, we'll explore how we can reduce the CPI, even in heavily pipelined
processes. [music]
Module 2, Lab
[music] In this exercise, we're going to be using a model of a processor pipeline to
explore the effect of the pipelining optimization. Computer architects use models
like this to make high level decisions early on about what parameters they will use
for a processor and using a model such as this saves the burden of actually building
the processor to find out its performance. The model is not simulating accurately
the performance of the processor but rather it's giving us an idea for what
performance we might expect. So what can we do with this model? Well, we can
configure the number of pipeline stages, which we can see affects the diagram. And
we can also turn on or off the forwarding optimization. As we change these numbers
notice that the design parameters change down here. So for example, the clock
frequency is improved by increasing the number of pipeline stages but the design
area will get bigger. And so this may be a consideration depending on the problem.
We can also choose which of two programs we're going to put down our pipeline. When
we press the step forward button the pipeline advances to the next clock cycle and
we can see the instructions have started to flow down our pipeline and interesting
events such as forwarding will be noted in the simulation. Sometimes the simulation
will detect that there will be a stall for example, in this case, we can see that
there is a data hazard because the instruction in the red memory stage writes to
register X13 which is read by the instruction in the yellow decode stage and
therefore a stall cycle is necessary in order to allow the result to be generated.
If we press the play button, the simulation will proceed automatically and we can
see various stall events happening as the simulation proceeds. But notice that the
the program we're simulating is nearly 1,000,000 cycles long so watching it play out
at this speed is going to take quite a while. So we can use the fast forward slider
to simulate much, much faster. Notice that the statistics down the bottom have
updated depending on the results of the simulation, and at this point we can see
that the program is finished and the simulation of the program, the simulated
program took 3.98 milliseconds to execute. We can also see that just below, the
results of past simulations are stored in little tables so we can easily refer back
to them when we're doing later experiments. So as an experiment, let's imagine what
would happen if we disabled the forwarding optimization but change nothing else and
we'll just run this program through. What we can see immediately is the design side
is slightly better which is what we would expect. It's got 1% better in fact in this
case because of the lack of the forwarding wires. But now that the program is
finished, we can see that the program execution time is a lot worse. 6.34
milliseconds is about 50% worse. So again, looking in our table, we can compare the
execution times in the area and we can see that in most cases the forwarding
optimization would be a big optimization here because at the cost of an increase in
the area of about 1%, we've had a improvement in execution time of about 50% which
is likely to be a good trade off, but not always. It would depend on the exact
scenario. Perhaps that 1% area is more important than the performance of this
program. In the exercise, you'll be invited to design a or suggest a design using
the number of pipeline stages and whether forwarding is enabled that will meet
certain criteria. You can play about and do as many simulations as you wish to
figure out what the best program might be. Once you've got it set up select the
processor that you're happy with at the top and then scroll down to the submit
button and press that. Good luck. [music]
Module 3, Intro
[music] Hi, I'm Nigel Stevens. I'm Lead Instruction Set Architect for the Arm A-
Profile architecture. I've been at Arm for about 14 years and I have responsibility
for the Arm V8-A instruction set including recent developments such as the Scalable
Vector Extension and Scalable Matrix Extension. What do we mean by the instruction
set architecture? The instructions in architecture, primarily, most people think of
I guess as the OP codes, the encodings of instructions that are executed by an Arm-
based processor. But it also includes other aspects as well such as the exception
model, system programming features, memory management and suchlike. The architecture
for Arm is rather distinct from what other companies may call an architecture. For
Arm, architecture is a legal contract, if you will, between hardware and software.
If software uses only those instructional codes and features of the ISA that are
described by the Arm architecture to perform its work, and the hardware that it's
running on implements all of those op codes and features exactly as defined by the
architecture, then any architecturally compliant software will run on any
architecturally compliant hardware that implements that Arm architecture. And that
doesn't mean just processors from Arm itself, but also processors that are designed
by our partners and which we have validated are conformant with our architecture.
How do you decide which instructions to include in an ISA? When we are looking at
requests from partners or from internal research to add a new instruction, we go
through quite a long process of trying to justify that instruction, or, quite
commonly, a set of instructions rather than a single instruction. We have to show
that it gives us some real benefit in performance, the performance of your code
running on that CPU. Or maybe not performance. Maybe it's security you're trying to
achieve. But it has to give you some really concrete benefit that is worth the cost
of adding all of the software, the validation software, the implementation costs for
all of the different implementations, compiler support, so on and so forth. It has
to meet the... It has to answer that cost benefit analysis. What is the difference
between an ISA and a microarchitecture? The difference between an instruction set
architecture, or ISA, and the microarchitecture is that the ISA is an abstract
concept. It defines a set of instruction encodings which software can use, and which
hardware has to recognize and implement. How that is implemented is a choice for the
microarchitecture. So the instruction set architecture is fixed, it's defined by
Arm. The microarchitecture is defined by whatever team of people is designing that
CPU. And there are many different approaches to implementing the Arm architecture,
from very small, efficient cores with in-order pipelines up to very high-
performance, state-of-the-art, out-of-order execution, and everywhere in between. So
the microarchitecture is implementation-specific, the architecture is generic, and
software written for the architecture should run on any microarchitecture. Why does
Arm produce processors with different instruction sets? Arm supports multiple
instruction sets. Some of that is to do with legacy: you can't abandon your legacy
software, your legacy ecosystem. So as the architecture has advanced and we've
introduced major new instruction sets, we still have to continue to support old
software. It takes years, maybe 10 years to move the software ecosystem to a major
new ISA. So for example, AArch64, which is the 64-bit architecture that we
introduced with Arm V8, also supported the AArch, what we called AArch32, the old
32-bit architecture that was implemented in the Arm V7 architecture, and prior to
that including the Arm and the Thumb instruction sets. And we needed to do that
because, while some software might start to migrate to the 64-bit architecture,
there's still a lot of software on the planet which is going to continue using the
32-bit architecture, and that has to survive. So that's part of the reason: it's
about legacy. You can't just obsolete the whole world when you introduce a new
architecture, a new instruction set architecture in particular. There are other
reasons as well, which is there are certain instruction sets that are different for
reasons of the ecosystem that they're working with. So if you were to compare, for
example, the A-profile architecture that's designed for application processors that
run rich operating systems with virtual memory supporting SMP, symmetric multi
processing operating systems running large applications, whatever it may be: web
browsers on your phone or something, or a web server in a server farm somewhere. You
have your R-profile architecture, which is designed for high-performance, real-time
embedded systems. The constraints there are somewhat different. The instruction set
is actually fairly similar to the A-profile, but some of the underpinnings of the
architecture, the system side of the architecture, are simplified in order to give
more consistent and predictable real-time response to things like interrupts or
memory translation and suchlike for real-time systems. And then at the other extreme
you have the M-profile architecture which is designed to be capable of being built
in a very simple, ultra-low power implementation with low code size and again,
similar to the R profile, very predictable real-time performance. So the answer is
there are different instruction sets for reasons of the market that they're trying
to address, and then there are different instruction sets because, well, we have
history. [music]
Module 3, Video 1
[music] In the previous module, we explored how pipelining can be used to improve
performance. We also saw how it is sometimes necessary to stall our pipeline to
ensure our program is executed correctly. In a simple pipeline, it will be necessary
to stall the pipeline whenever we encounter a branch instruction. This is because we
must wait until our branch is executed before we can be sure which instruction to
fetch next. As a recap, branches are instructions that change which instruction in
the program will be executed next. There are two types of branches: conditional
branches and unconditional branches. Unconditional branches always change which
instruction executes next, whereas conditional ones may or may not, depending on the
computations in the program. In real programs, between approximately one fifth and
one quarter of all instructions are branches, and the majority of these are
conditional. Executing a branch involves calculating the new address to load into
our program counter. This is the branch's "target address." However, conditional
branches have an extra task: we must first determine whether the branch is taken. If
the branch is not taken, we can effectively ignore the branch and fetch the next
instruction as normal. Recall the processor performance equation from an earlier
video. Since we have to wait for branches to complete before fetching the next
instruction, we generate stall cycles. These increase the average number of "cycles
per instruction," which reduces our microprocessor's performance. The longer our
pipeline gets, the longer it is before each branch is resolved, and the more costly
branches become. Can you think of a way to avoid some of this stalling? One idea is
to evaluate branches earlier in the pipeline, for example in the Decode stage
instead of in the Execute stage. This can indeed help to reduce the number of
stalls, but we may still need to stall if the branch depends on other instructions
that haven't been executed yet. Another idea is to continue fetching instructions in
program order, effectively assuming that each branch is not taken. The number of
stalls in the pipeline for a not-taken branch is zero in this design. On the other
hand, if the branch is in fact taken, the subsequent instructions that we fetched
will be incorrect. So, we must remove all instructions that have been fetched on
this incorrect path from our pipeline. This is called "flushing" the pipeline.
Unfortunately, in real programs, branches are taken much more than not taken. Could
we then simply assume instead that all branches will be taken? Sadly not, no,
because then we would also need to know the specific "target address" immediately,
in order to know which instruction to fetch next. It may at first seem impossible to
know this before the instruction is decoded. However, computer architects have found
a way to do exactly this. The next video will look at "dynamic branch prediction:"
the idea of predicting the behavior of the branch instruction before it has even
been fetched. [music]
Module 3, Video 2
[music] In this video, we'll explore how to predict the behavior of a branch
instruction. This can sometimes eliminate the cost of branches altogether.
Fundamentally, a branch involves changing the value of the program counter, which is
the address of the next instruction in the program. If we could predict what this
change will be, quickly and accurately, we would have no need to stall. Precisely,
we need to predict that we are fetching a branch, predict it as taken or not taken,
and predict what its target address is. How could we ever make such predictions?
What would we base them on? Well, since instructions are often executed multiple
times, we can accurately make these predictions based on just the instruction
address. If we've previously seen that a particular address contains a branch, and
we see that address again, we can predict whether that branch will be taken, and its
target address, based on its behavior last time. Amazingly, for real programs,
simply predicting repeating behavior is typically around 90 percent accurate. This
means we could eliminate stalls caused by branch instructions 90 percent of the
time. Let's apply these insights to try to build a branch predictor. We will add two
extra blocks to the Fetch stage of the pipeline we met in Module 2. The first will
remember information about recently executed branches. This will include the program
counter values of branches and their target addresses. This memory is called the
"Branch Target Buffer" or BTB. The second block will make predictions about whether
an address containing a branch is taken or not. We simply call this the "branch
predictor." In the next video, we will look at these in more detail. Combining these
two gives us all the information we need to predict the next value of the program
counter based solely on the current value of the program counter. We don't even need
to decode the instruction to make this prediction. Here we can see how a running
branch predictor behaves for a sample program. Each program counter is checked in
the BTB to see if it's predicted to be a branch and to identify its predicted
target. We also simultaneously check the branch predictor to see if the branch is
predicted to be taken. Based on these predictions, the branch unit computes the
predicted next program counter. Many cycles after the prediction, feedback will be
given by the rest of the pipeline as to whether or not the prediction was correct.
Whenever the prediction is wrong, we have to flush the pipeline and update the BTB
and branch predictor. The pipeline will then resume fetching from the correct
program counter as computed by the pipeline. The majority of instructions are not
branches, so most of the time the branch unit just adds 1 to the program counter.
The BTB contains the instruction address and target address of some recently
executed branches. The exact number of entries in the BTB varies considerably. BTBs
in large modern processors contain many thousands of branches. In operation, the BTB
checks the supply program counter against its memory to see whether it has a match,
and if so, it returns the target address. Otherwise, it predicts that the
instruction is not a branch. After each branch executes, the BTB is updated with the
true target address. BTB cannot be arbitrarily large, so it may have to forget an
existing branch to remember a new one. A simple BTB design like this is typically
around 90 percent accurate at predicting target addresses in real programs. [music]
Module 3, Video 3
[music] In the previous video, we met the two major components of dynamic branch
prediction, the BTB and the branch predictor. In this video, we'll take a deeper
look at the branch predictor. It predicts whether or not a branch will be taken,
based on the program counter. A simple branch predictor would try to remember what
the branch did last time and predict that the same behavior will repeat. Let's see
how such a predictor might be organized. Remembering a branch prediction for every
possible instruction address would take up far too much memory, so we reuse the same
memory for different branches via a process called "hashing." We hash the address of
the branch to a smaller number. This does unfortunately lead to a problem called
"aliasing," where two different branches can hash to the same value, but this is
rare in practice. Let's see what happens now, when we execute a simple loop. We see
a misprediction when it encounters our branches for the first time, and another when
we exit the loop. The first case will be dependent on the value in our predictor's
memory, and it may be that we are able to predict the branch correctly the first
time we see it. The second case is hard to avoid, although some more sophisticated
branch predictors will learn how many iterations a loop will make. A common
improvement to this simple scheme is to avoid instantly flipping our prediction just
because the branch does something unexpected once. This can be achieved with a
saturating counter, which instead remembers how many times the branch has been taken
recently, versus not taken. The counter increments when the branch is taken and
decrements when not taken. It predicts "taken" if the majority of recent executions
of this branch were taken. When building higher-performance processors, we often
have to discard many instructions every time we mispredict a branch, so accurate
branch prediction is very important. Therefore, branch prediction is still an active
area of research. One of the key ideas used is correlation between branches. In real
programs, a common pattern is for example a pair of branches that always have
opposite behavior. A branch predictor can take advantage of this by remembering a
history of whether or not recent branches were taken or not taken and incorporating
this in the hash. Another idea is to combine multiple different types of predictor
in a "tournament predictor." We use a "meta predictor" to predict which of two
branch predictors will do a better job. As you can see, branch prediction can get
quite complicated. [music]
Module 3, Video 4
[music] In this module, we've met the concept of branch prediction. Even using
relatively small simple circuits, we can accurately predict real branch behavior
more than 95 percent of the time. Could we do better, and do we really need to? In
small, simple processors, these prediction accuracies are fine, because each
misprediction causes only a few stall cycles. Increasing the accuracy is not that
impactful. However, in complex processors with very long pipelines, the difference
between 98 percent and 99 percent prediction accuracy can be significant for
performance as a misprediction could incur dozens of stall cycles. Accurate
prediction really does matter. One of the problems we face in making accurate
predictors is that they need to be small enough and fast enough to fit in a
microprocessor. We can imagine all sorts of ways to do accurate branch prediction,
but if their circuit would be slower than simply executing the branch, they would
not be useful. Modern high-performance processors will often have multiple branch
predictors, for example a small fast one and a slower complex one. The slower one
can override the prediction of the fast one if it thinks it got it wrong, which does
incur some stall cycles, but fewer than a total misprediction. Another problem we
face is that some branches are just very hard to predict. No matter what technique
is used, there are some branches in real programs that are effectively "random". For
example, when compressing or decompressing data, the combination of the underlying
algorithm and the input data may provide no clear patterns to a prediction. No
matter how hard we try, some branches will never be predicted correctly 100 percent
of the time. A final problem is that since the predictors work based on observing
the program, there will always be a period of time when the predictors train on a
program to learn its behavior. [music]
Module 3, Lab
[music] In this exercise, we're going to be using a branch predictor simulator to
explore dynamic branch prediction. This simulator will accurately simulate the
details of a branch predictor but it uses a trace of a real program executing on a
real machine to avoid the need to simulate the rest of the processor. Computer
architects use techniques such as this to quickly explore one area of process design
understanding that perhaps the accuracy may not be completely correct given that
we're not simulating the full process of design. The interface allows us to
configure details of our branch predictor for example the maximum of the saturating
counters used in the branch predictors table or the hash function. And we can see
the impact of changes on the delay of the branch predictor and also the design size
which are two key metrics when designing a branch predictor. Once we've happily
configured the design we want we can press run to simulate a program and the results
of that program's execution will be displayed in the rest of the stats. So for
example, we can see here that the predictor predicted 95.24% of the branches
correctly. Which is fairly good and this resulted in overall execution time of the
program of 5.335 milliseconds. Just below we can see a table that records previous
simulation runs as well so we can do multiple experiments and see which produces the
best results. For example, if we were curious about the effects of using a
saturating counter with a maximum of three rather than one we could change that and
notice that the design size has substantially increased and when we press run, we
noticed that the predictor accuracy, however, has also increased. It's gone up to
96.31% and consequently the execution time the program has fallen slightly and so we
can compare these two designs and see whether or not this represents a good trade-
off for our processor. Perhaps the area cost is justified or perhaps it's too much.
It would depend on the exact processor we're trying to design In the problems you'll
be invited in to come up with designs that are suitable for particular constraints.
For example, particular constraints on the runtime of the program or the design
size. Once you've configured the branch predictor that you think meets the
objectives you can scroll down to the submit button and click that and you'll be
told whether the answer is great or not. Good luck. [music]
Module 4, Video 1
[music] So far, we've looked at the microprocessor's "datapath"— meaning its
execution units, registers, and control circuitry. We have given less attention to
its memory. We usually implement memory using a different type of chip than the
microprocessor, using a technology called DRAM. It is very dense, allowing us to
store lots of data in a small area. However, one issue with DRAM is its speed. Since
the 1980s, processor performance has increased very rapidly at roughly 55 percent
per year, so CPUs of today are many orders of magnitude faster than those of 40
years ago. In contrast, memory performance has grown much more modestly. Whilst
memories are also much faster than they were in previous decades, their performance
has not kept pace with processors. This leads to a processor-memory performance gap,
with it becoming more costly to access memory as time goes on. One of the issues
that makes these memories slow is their size. We can make a memory as fast as our
microprocessor, if it is very small. However, this is the opposite of what
programmers want. Programmers can use extra memory to solve more complex problems,
or to solve existing problems faster. This complicates microprocessor design because
although we want a large memory to hold all our data, large memories are slow and
this slows down our processor. How do we overcome this? What if we could get the
speed benefit of a small memory alongside the size benefit of a large memory? One
solution is to have a large, slow memory and a small, fast memory in our system. For
this to be useful, we need the small memory to hold the data that we use most often,
so that we often get the speed benefits of accessing it, and only rarely have to
access the slow memory. Whilst there are different arrangements of these two
memories, the configuration that is most often used is an "on-chip cache memory."
The small memory sits inside the processor between the pipeline and the large
memory. We keep all data in the large main memory, and put copies of often-used data
in the small memory, which we call a cache. But we are not limited to only one
cache! Our pipeline reads memory in two places: when fetching the instructions; and
when accessing the data. It makes sense to have two caches here, each optimized for
different purposes. The instruction cache is optimized for fast reading of
instructions at Fetch. The data cache is optimized for reading and writing data from
the memory stage. We will often put a larger "level 2" cache between these two
caches and the main memory. The L2 cache is a "medium-sized" memory: faster and
smaller than main memory, but slower and larger than the two L1 caches. Using a
hierarchy of caches reduces the bandwidth requirements to main memory and the energy
cost of moving data around. [music]
Module 4, Video 2
[music] We previously looked at the need for a cache, which can be used to store
often-used data for fast access by the processor. But which data is used often
enough that it is worth including in the cache? Programs are mostly made of loops.
Here's a simple one that sums values in memory. It displays the two characteristics
that we can exploit to decide what data to put into a cache. Let's look briefly at
how it works. Each time round the loop, there are two loads and one store. The first
load is to load the data we're summing. The other load, and the store, are to update
the running sum. Notice two things here. First, we access the running sum over and
over again; each time round the loop. Second, when we access part of the data in one
loop iteration, we've already accessed its predecessor in the previous iteration,
and will access its successor in the next. Caches exploit two types of "locality" in
programs in order to be effective. The first is temporal locality: if a piece of
data is accessed, it is likely that it will be accessed again in the near future.
The running sum has temporal locality. The second type of locality is spatial
locality: If a piece of data is accessed, then its close neighbors are quite likely
to be accessed in the near future. By close neighbors, we mean data whose memory
addresses are not far from each other. The data accesses have spatial locality. It
turns out that most programs exhibit lots of temporal and spatial locality. We can
exploit this to determine what to put in the cache. Exploiting temporal locality is
fairly easy: we simply see which values have been accessed and place them in the
cache. Exploiting spatial locality is also quite simple: when a piece of data is
accessed, place its neighbors in the cache. We'll see in the next module how this is
actually achieved. We use locality to guess what values will be accessed next and
store them in the cache. If we are correct, we get the benefit of fast memory, but
if we are wrong, we must perform a slow access to main memory— just as we would if
the cache were not present. In real programs, we see hit rates of 90 percent or more
in microprocessor caches, resulting in vast performance improvements. However, it's
worth noting that some programs have less locality, and for those programs, caches
offer little performance benefit. [music]
Module 4, Video 3
[music] We've looked at the reasons why we build caches, but how do they actually
work? To the outside world, a cache simply takes an address as input, and either
provides the data that is stored at that location at output, or returns a signal to
say that it doesn't have it. If the data is found, this is called a "cache hit". If
the data is not in the cache, this is called a "cache miss", and it means we must
look for the data in main memory instead. After each miss, we update the contents of
the cache. The fundamental building block of a cache is called the "cache line".
It's a number of data bytes from consecutive addresses in main memory. Cache lines
are typically 32 or 64 bytes long, and a cache typically has an array of hundreds to
many thousands of lines. The line captures spatial locality, because it is larger
than the data read by a single load instruction. When a cache miss occurs, the whole
line containing that data is copied to the cache from main memory, meaning we have
nearby values for future accesses. When a request comes in, we use some bits from
that address to index the line array. Just like in the last module on branch
prediction, this leads to the problem of aliasing again, since selecting only some
of the bits to index into the array is like a hash. This means that many addresses
map to the same line in the cache, but we can only store one of their lines of data.
We need to note down which addresses' line is currently stored in the data array, in
what we call the "tag array". There is one tag for each cache line. When we access
the cache, we access the tag array with the same index to see if the data we need is
present. This design is called a "direct-mapped cache". Direct-mapped caches work
fairly well, but for some programs, we can be unlucky with the program accessing to
aliasing lines frequently. We can do something about this by duplicating both the
arrays, so that each line of data can now be stored in one of two places in the
cache. When we access the cache, we look at both arrays and only get a miss if
neither of the tags match. This is called a "2-way set-associative cache", because
each line has a set of two places or "ways" it could reside. The "associativity" of
this cache is therefore two. Set-associative caches introduce a further
complication: when we want to add data, where do we put it? In a 2-way set-
associative cache, there are two choices. How we decide which cache line to evict is
called the "replacement policy". There are many different types of replacement
policy. A simple one is just to make a pseudo-random choice from all the possible
cache lines. Another option is to keep track of when each cache line was last
accessed and to evict the one last used furthest in the past. This is called a
"least recently used policy", and takes advantage of temporal locality. It does,
however, means storing extra information in the tags to track usage. Many other
ideas are possible too. [music]
Module 4, Video 4
[music] Now that we've seen how caches work, let's see how they affect the
performance of a processor. Recall the processor performance equation, where the
processing time is proportional to the average cycles per instruction. Without a
data cache, if 20 percent of instructions are loads, and main memory takes 20 cycles
to access, our CPI figure must be at least 5. However, if we provide a cache that
holds the required data 80 percent of the time... ...and only takes 2 cycles to
access, our CPI reduces to 2.2, which is a significant improvement! We can isolate
the memory terms in this equation to get the average memory access time —abbreviated
to AMAT— which allows us to compare different cache configurations more easily.
Changing the cache configuration will impact the AMAT. There are many different
cache parameters we can change, such as the size, replacement policy, associativity,
whether we put data in the cache for stores or just for loads, and so on. For
example, reducing the size of the cache will improve the access time for a hit, but
will also increase the miss rate. Let's say that we can halve the access time to 1
with a corresponding halving of the hit rate. This alters the AMAT to 13, which in
this case is worse for performance overall. It's also useful to look at why an
address might miss in the cache. Broadly speaking, we can divide cache misses into
three different categories. Compulsory misses occur when we attempt to access an
address that we have never seen before and so never had the opportunity to cache it.
Capacity misses occur when there is more data being accessed than the cache could
hold, even if we had complete freedom in where to put each cache block. Conflict
misses occur in caches where there are more addresses hashing to the same index than
arrays to hold the data. We can alter our cache configurations to lower these
misses, but as always, there are trade-offs involved. Compulsory misses can be
reduced by increasing the cache block size, to take advantage of spatial locality.
But for a fixed cache size, this reduces the number of different addresses or cache
lines that can be stored. A technique called "pre-fetching" can also be used to
predict the addresses that will soon be accessed, and bring their data into the
cache early. But this increases energy consumption, and may make the cache perform
worse if the predictions are not highly accurate. Capacity misses can be reduced
through increasing the size of the cache. Although, as we saw before, this impacts
the number of cycles taken to determine a hit. Conflict misses can be reduced
through increasing the number of cache blocks in each set, with an increase in
energy consumption as a side effect of this. [music]
Module 4, Lab
[music] In this exercise, we're going to be using a cache memory simulator to
explore the effects of cache memories on processor performance. The simulator
accurately simulates the behavior of the cache memory system, but it's using a model
of the rest of the processor to quickly allow us to simulate the effects of the
cache on the processor without needing to simulate the full processor. We can
configure a number of parameters about our cache, for example, the number of levels
in our cache, whether or not the cache separates instructions and data, or is
unified, keeping them both in the same cache. We can also configure the size, the
line size, and the associativity, and changing these numbers will affect design
parameters, such as the access times in the case of an L1 level one cache hit or a
cache miss, and also the design size. Once we're happy we've found design we'd like
to investigate, we can press "run", at which point the simulator will run through
the program. We can see the hit rates for instructions in the level one cache, and
also data in the level one cache are displayed. And we can also see the average
memory access time, the results from this. And then below everything, we can see a
table of past simulations so that we can quickly refer back to our previous
experiments when we do new ones. So, for example, let's say we were curious about
the effects of increasing the size of the cache. If we change the parameter and then
press "run", we can immediately see that the design size has substantially
increased, which sort of makes sense because we've doubled the contents of the
cache, the size of the contents of the cache, and therefore we'd expect roughly
double the area. And we can also see that the hit rates have improved. So there's
about a 1% improvement to the L1 instruction cache hits and a 1% improvement to the
L1 data cache hits, which has reduced the overall average memory access time. And so
we can compare these two designs to see which of them we think is better. It's a
trade-off of course, though. The larger design has got better performance, but it is
larger, and so depending on the context, we may need to pick the smaller design or
the bigger design depending on our performance goals. In the exercises, you'll be
invited to come up with a series of designs for caches that meet certain performance
goals. For example, you'll have a constraint on the area of the design or the
execution time of the program, and you need to optimize the cache to meet those
goals. Best of luck. [music]
Module 5, Video 1
[music] In this module, we'll look at how to further improve performance by
exploiting "instruction-level parallelism." In Module 2, we explored how pipelining
can improve the performance of our processor. This reduced our clock period, and
allowed execution of instructions to be overlapped, improving throughput. One way to
boost performance further would be to create a much deeper pipeline. At some point,
this would mean even the ALU in our Execute stage will be pipelined. Consider the
simple program shown in the slide. Some instructions are dependent on a result from
the previous instruction. Remember in our 5-stage pipeline that these dependent
instructions could be executed in consecutive clock cycles with the aid of data
forwarding. If execution takes place over two pipeline stages within the pipeline,
we need to stall if adjacent instruction share a dependency. This allows time for
the result to be computed. The programmer may be able to rewrite their program to
get the same result with fewer stalls, by placing an independent instruction between
our pair of dependent instructions. In this case, we can move the third and fifth
instructions earlier to optimize performance. The performance of programs that run
on our "super-pipelined" processor would, to some degree, be determined by the
availability of independent instructions that could be executed in parallel in the
pipeline. This is "instruction-level parallelism"—or ILP. Very deep pipelines are
problematic as they would require: a very high-frequency clock to be distributed
across the chip very precisely; careful balancing of logic between many, very short,
pipeline stages; the pipelining of logic that is difficult to divide further into
stages; and the division of logic at points requiring many pipelining registers to
be inserted. A different approach to exploit ILP is to make our pipeline wider
rather than deeper. In this design, the processor will fetch, decode, and
potentially execute multiple instructions each cycle. Such a design avoids the
problems of a super-pipelined processor, although as we'll see in the next video, it
does introduce some new complications. [music]
Module 5, Video 2
[music] In this video, we are going to explore "superscalar" processors, which can
process multiple instructions in each pipeline stage. In our simple 5-stage
pipeline, there is at most one instruction per pipeline stage. At best, we can
complete one instruction per cycle. We call such a design a "scalar" processor. In a
2-way superscalar version of this processor, we would extend this design so it is
able to fetch, decode, execute and writeback up to two instructions at a time. In
general, superscalar processors may vary the number of instructions that can be
processed together in each stage. Let's step through the design. Our instruction
cache will need to supply two instructions per cycle. Typical superscalar processors
only ever fetch adjacent instructions on a given cycle. This can lower performance
if, for example, the first instruction fetched is a taken branch, because then the
second would not be required. Note that now every cycle lost due to control hazards
will cost us two instructions rather than one, so accurate branch prediction matters
even more in superscalar designs. The Decode stage must now decode and read the
registers for two instructions simultaneously. Fortunately, we are able to extend
the register file design to read many register values at the same time. The Decode
stage also needs to check whether the two instructions are independent. If so, and
if the functional units they both need are available, it can "issue" them for
execution in parallel on the next clock cycle. Otherwise, in this simple design, it
will only issue the first, and keep the second back. A simple design such as this—
where two instructions are fetched, decoded and issued— is called a "2-way" or
"dual-issue" processor. In other designs, the width may vary at different stages of
the pipeline. To support the execution of multiple instructions at the same time,
the Execute stage is expanded and contains two execution pipelines. It's common for
these to have slightly different capabilities to save area. For example, the top
pipeline can execute both ALU and memory instructions, while the second pipeline
only executes ALU instructions. To ensure that dependent instructions can execute on
consecutive clock cycles, we must add data forwarding paths. These data forwarding
paths must allow results stored in either execution pipeline to be forwarded to the
input of either ALU. During writeback, we need to store both results to the register
file. This means the register file must be redesigned to allow two writes per clock
cycle. Overall, these changes typically require 25 percent more logic circuitry in
our processor, compared with a scalar processor. But we'd expect an improvement in
execution time of between 25 and 30 percent for real world programs. [music]
Module 5, Video 3
[music] We've seen that instruction-level parallelism can be used on superscalar
processors, to run them faster than would be possible on a scalar processor. But how
much of a speedup is this in practice? Ultimately, this depends on how much
instruction-level parallelism is possible in a typical program. How might we measure
this? We can do this initially without considering any constraints that will be
imposed by the processor it will run on. Let's consider the instructions executed by
the program. Let's assume that we can predict all the branches in the program
perfectly. Then we can ignore branch instructions, as they don't need to flow down
our pipeline. Now let's imagine we can execute any instruction as soon as the data
it needs is ready. That is, we are only restricted by the presence of true data
dependencies. Note that some dependencies are carried through writes and reads to
memory. Rather than considering program order, we can now just look at the order the
dependencies impose on instructions. This is referred to as "data-flow analysis."
Assuming each instruction takes exactly one cycle to execute, the fastest possible
execution time of the whole program in cycles is given by the longest path in the
data-flow graph. The instruction-level parallelism of this program is the number of
instructions divided by this duration, as this gives the average number of
instructions we would need to be able to execute each cycle to achieve this
duration. In real programs, this can be anywhere from around five, to hundreds or
even thousands. An active area of research and innovation for computer architects is
to imagine processor designs that can expose and exploit as much of this parallelism
as possible. One insight architects have had is that superscaler processors need to
have a fast supply of instructions to be able to analyze dependencies effectively.
This often means that the front end of our processor pipeline is much wider than the
rest of the pipeline, so that it can "run ahead" and see what behavior the program
will have next. Fast and accurate branch prediction is vital, as we often have to
predict multiple branches ahead accurately, to achieve good performance. Another key
insight is that we don't have to wait to execute the instructions in program order.
If all the dependencies of an instruction are satisfied, the instruction can proceed
down the pipeline even if previous instructions are yet to execute. This can reduce
program execution time by taking advantage of more instruction-level parallelism. In
practice though, this creates extra complications, as we will see in the next
module. [music]
Module 5, Lab
[music] In this exercise, we will be using a simulator to explore superscalar
microprocessor design. The simulator has a number of parameters that we can
configure, such as the number of pipeline stages, the width of the Fetch stage, the
width of the issue, and the number of ALUs in the design. We can see a diagram of
the processor that we've created and we can also see a number of parameters about
that design, for example the clock frequency and the overall area of the design.
When we press step, the simulator will advance one clock cycle and so for example
here we can see that the fetch stage has fetched the first 4 instructions. However,
immediately we see one of the problems with designs such as this, which is the three
of the four instructions that have been fetched are in fact useless because the
first instruction was an unconditional taken branch and therefore the remaining
three instructions will not be executed by the program and so these will immediately
be discarded on the next cycle. Pressing the run button allows us to simulate and we
can use the fast-forward option to simulate much quicker in order to get to the end
of the long program execution. In this case we can see that our design achieved an
average cycles per instruction less than one, which is to say that we on average
executed more than one instruction per cycle, which means that our superscaler
design has fundamentally worked, and we can see that, for example, the overall
execution time of the program is 1.6 milliseconds. In a table below, we can also see
a record of our previous simulation runs. So let's say for example, we were curious
about the effect of increasing the issue width by one. We can make that change and
then press run again in order to run our new design, and when it finishes we can
scroll down to take a look and we can see that the program execution time has indeed
improved down to 1.51 milliseconds at a cost of only 1% area. So it looks like this
was a very good improvement to our design and it's almost surely going to be a
beneficial trade-off in practice. In the exercise you will be invited to configure a
number of different superscalar processor designs with various targets in terms of
clock frequency, design area and execution time. Once you're happy that you've
configured the process that you think completes the exercise, you can scroll all the
way down to the bottom, where you'll see the submit button that you can press to
have your answer checked. Good luck. [music]
Module 6, Video 1
Introduction:
So hi, my name is Peter Greenhalgh. I'm Senior Vice President of Technology and an
Arm fellow. I'm responsible for the Central Technology Group at Arm. We're about 250
people. We work on everything from machine learning to CPU, GPU, system IP, and the
solutions that we create as well. And we basically path-find future technology at
the product level that goes into all of our products and the IP that we produce. Arm
is known for the power efficiency of its microprocessors. How have you managed to
keep a focus on power when building processors with very complex and power-hungry
features? We've got some really great design teams. In fact, we churn out I think
more CPUs than pretty much anyone else does on the planet. I think we're producing
something like four or five CPUs per year. So we've got a lot of experience in
designing for power efficiency and performance, and in fact we can leverage the
understanding that we have all the way down to microcontrollers through to the
smaller A-class processors, all the way up to the high performance. There's a lot of
sharing between the teams in terms of strong knowledge and capability, and insight
into how to design for both performance and power efficiency. More specifically, I
mean, ultimately, you have a performance goal that you need to achieve, and then as
part of that you have to figure out how to get the best possible power out of the
design when you're achieving that performance goal. And to do that, there's kind of
some different ways of looking at it. There's the really detailed orientated work
that you need to do around things, like clock gating, data gating, all the things to
try and stop unnecessary power use deep within the microarchitecture when the
instructions are flowing through the pipeline or data's moving through the pipeline.
And then there's essentially the structure of the design that you've created. And
that then dictates fundamentally what the power of the design is going to be. You
can't fix a design that's got bad structure with improved clock gating, data gating,
and just good low-level design. You have to marry the two together. And that high-
level work that you do is around making sure that the pipeline is well balanced,
that you aren't opening up the pipeline, going too wide too soon; you're extracting
data, you're extracting information as late as you possibly can and just when you
need it, and not just pipelining it down through the design for the sake of it; and
then, fundamentally, good microarchitecture around branch prediction, which stops
you putting things down through the pipeline that you're just ultimately going to
flush; good pre-fetching on the data side so that you make sure you get the data in
the design when you need it, and you're not sitting around waiting for it. So you
have to marry that altogether, and we've got a lot of great techniques in order to
achieve that, which fundamentally, I say, you need to achieve the performance
target, and then everything else comes together to achieve that performance target
in the best possible energy efficiency. How did Moore's Law affect computer
architectures of the past, and what will its influence be on future designs? Gordon
Moore's influence on the industry has been massive, and the tenets behind the law
still continue today, albeit in a slightly different form. I mean, I started
designing at .18 Micron, essentially 180 nanometers, and here we are today working
on 3 nanometers. So it's a vast difference now compared to when I started 22 years
ago. And there's no way we could have got to where we are today without the process
scaling from all of the foundries out there and all the companies that provide the
foundry technology. So it's a little bit like magic, all of the work that they do. I
can't say I understand it in detail, but it's incredible technology, and that allows
us... If it hadn't happened, we'd still be stuck in all the designs which were
fairly simple. There's no way that we'd have got to the sort of designs that we have
today of massively out of order, deep, deep amount of instruction and depth, very,
very wide designs. All of that has been made possible by the steady improvement, a
predictable improvement of the foundries. And that's kind of one of the key points
which really was captured by Moore's law or is captured by Moore's law of that kind
of predictable knowledge that you will get an improvement in the process. Is it 10%,
is it 15? Is it 20% on, say, power, for example? It kind of doesn't matter in a way
because you can work with what you eventually get. You can do things like voltage
scaling to be able to make use of the power that's available to you. Is it 5%? Is it
10% on frequency? Again, it kind of doesn't matter in a way. But what matters is
when we start designing a processor today and we finish it in 18 months time, and
then two years after that it arrives in the product in the shops that consumers can
buy. We know that over that period, the process improvements have happened, which
allows us to liberate essentially more performance, more energy efficiency from the
design. And we don't mind too much if it takes another three months or six months to
get to the process. We don't mind too much if the performance or power is not
exactly where it was predicted at the start. But, ultimately, we know we'll get an
improvement, and we know there'll be an improvement in two years, and three years,
and four years, and Moore's Law may have slowed, but it's certainly not stopped.
[music] As we saw in the last module, instruction level parallelism can be used to
improve program execution time in our microprocessor designs. To enable this, the
compiler creates an optimized instruction schedule when the program is converted
into machine code. Unfortunately, the compiler cannot know precisely what will
happen at run-time, so this design is constrained by the order of instructions in
the program. The compiler won't know what the program's input data will be, whether
branches will be mispredicted, or whether memory accesses hit or miss in our data
cache. In contrast, a superscalar processor with "out-of-order" execution can
produce an instruction schedule at run-time, only constrained by true data
dependencies and its hardware limits. This schedule is produced on demand and so can
even change each time the code runs. To do this, we introduce an "issue window" or
"issue queue" after the Decode stage. This holds instructions until they can be
executed, not necessarily in the order they arrived in. Within this window,
instructions can be issued whenever their dependencies are available, and when a
functional unit is available to process it. To be able to detect when an instruction
is ready to be issued, we must know whether the instruction's dependencies are ready
when it enters the issue window. We must then update this status as new results are
produced. To implement this, the names of result registers of executed instructions
are broadcast to the issue window. The instructions waiting there compare the
register names to the registers they require. However, this scheme has a problem: A
register will be written multiple times in the program, and since the instructions
are executed out-of-order, the register name alone is not sufficient to record
dependencies. It also means that instructions would have to wait until all previous
reads of a register had finished before executing. These are called "false
dependencies." These problems can be resolved by "renaming" register names at run-
time so that each "in-flight" instruction writes to a unique destination register.
We use a "physical register file" that is large enough to ensure we don't run out.
We keep a "register mapping table" to store the mapping between architectural,
compiler-assigned registers, and physical registers. Register reads to the same
architectural register are renamed consistently, so that dependencies can be tracked
correctly with physical register names. Physical registers are reused only when they
are no longer used by any instruction currently in-flight or any entry in the
register mapping table. The other big issue with out-of-order execution is memory
dependencies. Load and store instructions can have memory dependencies because they
access the same memory location. To detect this, we need to compare the computed
memory addresses that the instructions access. We thus split memory operations into
two steps: address calculator and memory access. We issue their address calculation
step as soon as the dependencies are available. Then, the memory access step is
placed in a special load-store queue to be sent to our data cache as soon as
possible. We carefully ensure that operations that access the same address are kept
properly ordered, but independent accesses can be reordered if beneficial. No access
can occur until the addresses of all previous accesses are known. Since memory
writes are irreversible, store instructions must also wait until we are certain that
they will execute. [music]
Module 6, Video 2
[music] In the previous video, we outlined the concepts of out-of-order execution,
and register renaming. The issue window will be filled with instructions fetched
along the path that our branch predictor believes the program will take. While we
hope our branch predictor will be correct in most cases, it will sometimes be wrong.
How do we handle such cases? A simple approach is to start by recording the original
program order of the instructions, and then to monitor their progress. We call the
structure that stores the instructions the "reorder buffer." As each instruction
executes and produces a result, we can mark it as done. When the oldest instruction
has completed, we can remove it from the end of the reorder buffer, and the
instruction is said to have "committed." This stream of committed instructions
represents how our program would be executed on a simple in-order pipeline or by an
unpipelined processor. It usefully also provides a point at which we can process
exceptions. For example, if the program divides by zero or attempts to access memory
that does not exist. We also check branch instructions as they complete in order. If
they have been mispredicted, we flush the reorder buffer, our instruction window and
any currently executing instructions and start fetching down the correct path. To
preserve correctness, we must also restore our registers and the register map table
to the values they had when we mispredicted the branch. This can be done with the
aid of a second register map table, updated only when instructions commit in program
order. This can simply be copied to the map table used by our renaming hardware to
"rewind time" for the processor. All the register values we need will be present, as
we don't recycle registers before we know they will not be needed again. In reality,
handling branches in this way is too slow. Processors instead take many copies of
the register map tables and can handle branches as soon as they are resolved, and we
discover they have been mispredicted. They can also selectively neutralize the in-
flight instructions in the datapath that are on the wrong path, rather than flushing
all of these instructions away. [music]
Module 6, Video 3
[music] We can now bring everything together and look at what a typical pipeline for
an out-of-order superscalar processor might look like. The Fetch stage is aided by
an accurate branch predictor as we met in Module 3. It will fetch a group of
instructions on every clock cycle. This group of instructions will be requested from
the instruction cache, and will be from consecutive memory locations. Branches may
reduce the number of useful instructions that can, in practice, be fetched in on
each cycle. The Decode stage decodes multiple instructions in parallel. At this
point, modern high-performance processors may also split complex instructions into
simpler operations or "macro-ops." In some cases, there may also be opportunities to
combine simple instructions into a single operation. The next step on an
instruction's journey is renaming to receive a unique destination register. As we
saw in the last video, this increases opportunities for out-of-order execution.
Remember, there are several times more physical registers in our processor than
those available to the compiler. Instructions are placed in the reorder buffer, and
are also "dispatched" to the Issue stage. They will wait in the window as necessary,
and are ready to be issued once all their operands are available. In the most
complex of today's superscalar processors, there may be hundreds of instructions
buffered in the issue window at the same time. Instructions finally commit in
program order. At this point, any physical registers that are no longer needed can
be added back to the pool of free registers. These are then assigned during the
register renaming step. Once an instruction is issued, it reads its operands from
the physical register file. The Execute stage consists of many functional units
operating in parallel. These may each support different operations and take
different numbers of cycles to execute. A network of forwarding paths is also
provided to ensure we can execute any dependent instruction on the next clock cycle
after the generation of the result. This requires being able to quickly communicate—
or "forward"— a result from the output of any functional unit, to the input of any
other. Some instructions will need access to memory. After computing their
addresses, they are placed in the processor's load-store queues. "Stores" are sent
to memory in program order, but "loads" can often be sent out of order, and ahead of
other older stores or loads that are not yet ready to be issued to memory. The
memory system reduces the average memory access time by providing numerous levels of
cache memory. After generating results, we write them back to the register file.
This overview is representative of the fastest modern microprocessors found today in
laptops, smartphones and servers. Whilst much extra innovation goes into real
designs, they generally follow the ideas discussed in the course. [music]
Module 6, Video 4
[music] One question computer architects always ask themselves is: "how much can we
scale up our design?" Let's take a look at some further potential optimizations to
our out-of-order superscalar processor. We could try to make it wider. For example,
by doubling the number of parallel instructions, we can fetch, decode and execute
more instructions per cycle. Would this double our performance? Sadly, no, things
are not that simple! In practice, some components quickly become very complex, and
performance gains may be hard to extract. For example, today's largest machines
fetch at most ten instructions per cycle from their instruction caches. Fetching
more instructions than this offers minimal performance gain, despite a large
hardware cost. If we increase the number of registers, the size of our issue window,
the size of our load-store queues, or perhaps use a larger and more accurate branch
predictor, our processor's performance will only improve slightly despite a
significant increase in the size of these structures. After a point, the increase in
performance is no longer worth the cost of the extra transistors. It's also possible
that performance might reduce overall as we may need to lower our clock frequency as
the structures get larger. Finally, we could introduce more pipeline stages, but we
know this doesn't necessarily lead to higher performance, as mispredictions may
become more costly. The combination of these issues means that extracting
performance using instruction-level parallelism alone becomes more expensive as more
performance is sought. This graph shows how the energy cost of executing an
instruction grows quickly as we try to build higher performance processors. Let's
look at some example designs. Suppose we have a core, which requires a certain area.
If we double its area, its performance improves, although there is a small rise in
energy per instruction. If we quadruple its area instead, its performance has now
doubled compared to our original core, while energy has increased by 50 percent.
Going further, if we increase our processor's area by a factor of 10, performance is
only 2 point 5 times our original core, but energy per instruction is now 3 times
higher. Its performance does not improve as fast as the cost of running it! Of
course, engineers are clever and determined, and are constantly developing new
techniques to bypass many of these issues. This means the performance of processors—
even ones running a single thread or program— still improves by around 10 to 25
percent each year. Nevertheless, ultimately we often need more performance than can
be provided by instruction-level parallelism alone. A modern solution is to employ
multiple processor cores on the same chip—called a "multicore" processor. This
changes the task for programmers; they may need to redesign their programs to take
advantage of such parallelism, but if they can, it can give vast performance
benefits. As we've learned throughout the course, every decision involves trade-offs
and compromise. We are faced with a fascinating but often highly-constrained design
problem. We've seen how performance bottlenecks, that at first seem impassable, can
be overcome with innovative designs. What might the future hold for microprocessors?
Can you think of ideas? What would you design? [music]
Module 6, Lab
[music] In this exercise, we'll be using a simulator to explore an out-of-order
superscalar processor design. The simulator allows us to configure a number of
aspects of our processor, for example, the number of pipeline stages, the width of
the Fetch stage, the size of the issue window, the number of ALUs, and the size of
our re-order buffer. The changes will be reflected in the pipeline diagram, which we
can see below, and also in the statistics below that, with key design metrics, such
as the clock frequency, clock period, and design size, visible below. When we press
"step," the simulation will advance by 1 clock cycle, and so we can see, for
example, on the first clock cycle the 1st 4 instructions are loaded. Although three
of them are unusable because they follow a taken branch and therefore these will not
be executed and will be discarded on the next clock cycle. We can press "run" to
watch our design in action and in order to quickly get to the end, we can use the
"fast forward" feature to simulate the millions of instructions in this particular
program. After the simulation is complete, we can check below to see a number of
statistics about our pipeline, which are useful for understanding why the
performance is as it is, and in particular we can see the program execution time.
1.18 milliseconds gives us the overall execution time of the program that our design
achieved, and notably our average instructions per cycle gives us the number of
instructions that we were able to complete on each clock cycle, in this case well
over 1, indicating that we are taking advantage of instruction level parallelism in
this simulation. Below all that, we can see our completed simulations and we can
check back on these as we explore new designs. So let's say, for example, we were
curious about the effect of increasing the re-order buffer size. We could change
that and then press "run" to quickly run our next experiment. And then if we scroll
down to our table, we can compare the results and see that actually the program
execution time was substantially increased by increasing the size of the re-order
buffer. Although, admittedly, this did come at a not insignificant increase in the
area of our design, and so whether or not this represents a good trade-off in
practice would very much depend on the problem we're trying to solve. In the
exercises, you'll be given a number of scenarios for processors, which generally
revolve around certain constraints on the design size, or the clock frequency, or
the target program execution time. And once you think you've found a design that
meets the goals required, configure it up using the settings and then scroll down to
the "submit" button at the bottom, and click that to have your answer checked. Best
of luck. [music]