Design Microprocessor

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Design

ConGderations
for a General-
Purpose
Microprocessor
Benjamin Maytal, Sorin Iacobovici,
Donald B. Alpert, Dan Biran,
Jonathan Levy, and Sidi Yom Tov
National Semiconductor

he challenge of designing a micro- software and hardware considerations for


processor lies in finding the opti- the microprocessor’s target applications.
1 . mal balance within the “eternal tri- After describing the functional partitioning
angle” of cost, performance, and schedule. Any COm puter design choices, including the means for support-
Providing a balanced solution involves
identifying the best trade-offs for both the
effort must balance Cost, ing a memory hierarchy and floating-point
operations, we present the NS32532’s
microprocessor under design and the sys- performance, and microarchitecture, including a description
tem applications it will support. of the four-stage pipeline and on-chip
In this context, we cannot achieve in- schedule. General- caches. We then examine the
creased performance simply by Packing
more and more transistors onto the chip and
purpose multiprocessors microprocessor’s system interface, the
memory reference transactions, and the
running the processor at higher frequen- used in many different instruction-flow and data-flow monitoring
cies. An unbalanced trade-off that empha- mechanisms. Finally, we present an over-
sizes microprocessor performance alone types of machines impose view of the methodology adopted to ac-
without considering the system environ-
ment can cut the user off from the device’s
special challenges. complish the design within a strict schedule
while achieving full functionality and
power and yield an unnecessarily costly meeting cost and performance goals.
and slow system. NS32532, a 32-bit general-purpose micro-
Instead, we need a global view of the processor produced by National Semicon-
system for correct functional partitioning ductor (see Figure 1). Among its features Design constraints
between on-chip features and those placed are 30-megahertz clock frequency, three
in general-purpose support components or on-chip caches, a four-stage pipeline, and The NS32532’s specification was de-
application-specific circuits. Functional dedicated mechanisms for multiprocessing fined by National Semiconductor’s design
partitioning maximizes system perform- support. engineers, architects, marketing personnel,
ance while simplifying the interchip inter- The article is divided into six sections. semiconductor process and packaging spe-
faces at a minimized cost. Most impor- We first describe the design constraints set cialists, and customers. The perspectives
tantly, these design choices do not stop at by the VLSI processing and packaging and constraints of each of these groups
the system architecture level, but repeat technologies and then address the issue of became the base on which the performance,
themselves in each design stage and for market requirements by examining the cost, and schedule trade-offs were made.
each section of the microprocessor.
A version of this article appeared in Proc. 22ndHawari
This article examines the performance, In[’/Conf. on Sysrerns Scienws. Jan. 3-6, 1989. Kailua- Techno’ogy’ The process team defined
cost, and schedule trade-offs made for the ~ ~ Hawaii.
n ~ , the technology constraints. Their goal was

66 0018-9162/89/0l00-0066$0l.00@ 1989 IEEE COMPUTER


to specify the semiconductor fabrication
process and the packaging characteristics.
Together, these defined the micro-
processor’s size limits and the number of
on-chip transistors. They also determined
the switching speed for logic gates, the
number of external pins, and the maximum
power dissipation.

Process. The process team decided to


fabricate the NS32532 with complemen-
tary metal-oxide semiconductor technol-
ogy, a process that was becoming the indus-
try standard. Earlier microprocessor gen-
erations had been fabricated using N-type
metal-oxide semiconductor technology,
which is somewhat faster, more compact,
and simpler to manufacture than CMOS.
Nevertheless, CMOS shows lower power
dissipation and better noise immunity than
NMOS, essential characteristics for con-
structing a microprocessor that integrates
several hundred thousand transistors.
Developments in semiconductor proc-
essing technology proceed so rapidly that
the design must allow for improvements
from initial fabrication through volume
production. At the time the design started,
National had a 1.5-micrometer process and
was developing a 1.25-micrometer proc-
ess. We expected the 1.25-micrometer
process would be in place two years later, in
time for first fabrication of the NS32532.
The process team decided to target the
NS32.532 toward the I .25-micrometer
process, but to leave open the option of
fabricating the first parts at 1.5 microme-
ters in case the more advanced process was
not available in time. Consequently, the
chip area was constrained by limitations of
the manufacturing equipment for the I .5-
micrometer process.
Some requirements of a 1 micrometer
process planned to be available at the time
of volume production also affected the
design. We wanted to be able to use the I -
micrometer process without redesigning
the microprocessor. As a consequence, we
fashioned the tolerance of certain feature
sizes beyond the requirements of the 1.25-
micrometer process to ensure compatibil-
ity with the I-micrometer process.
In the 1.25-micrometer technology, the
chip size was set at 1 1 .5 millimeters by 14
millimeters (460 mil by 550 mil). This
provided sufficient area for 370,000 tran-
sistors. Simulations of the most critical
circuitry carried out before detailed design
indicated that the NS32532 could reach a
frequency of 30 megahertz.

P u c h g e . Advances in packaging tech- F i g u r e 1. Photograph of the NS32532 microprocessor.

January 1989 67
nology meant that the microprocessor of systems. A processor designed for a curred more frequently in some control
would be able to access from 150 to 200 mainframe computer or minicomputer is applications than in any of the HLL pro-
external pins. This was a significant ad- usually targeted for a single system or a grams.
vance from earlier microprocessors, which small group of related systems. In contrast, Performance simulation showed that on-
were limited to fewer than 100 pins. The we designed the NS32532 for cost-sensi- chip caches of 1.5 kilobytes can boost per-
requirement for more signals was driven by tive embedded control products as well as formance by approximately 50 percent for
the need for greater communication rates performance-demanding multiuser sys- a multiprocessor system and 100 percent
with the system to achieve higher perform- tems. When used to control a laser printer, for a laser printer, which uses slower
ance. We allocated 129 pins for system for example, the microprocessor would memory. We concluded that integrating
interface, 39 for supplying power (and have to support a closed system with a fixed such a cache on the chip would provide a
minimizing noise problems), and four for set of peripheral devices, narrow (8- and valuable cost-performance contribution to
clocking. 16-bit) buses, and a memory built from systems in general. Further analysis, ex-
The package set apowerdissipation limit relatively slow but inexpensive dynamic plained in the next section, resulted in the
of 4 watts, allowing designers to speed up RAM. T o support a multiprocessing com- design of separate instruction and data
critical circuit paths by applying pseudo- puter for a large organization, the micro- caches with capacities of 512 and 1,024
NMOS techniques (ratioed p- and n-chan- processor would have to provide protection bytes, respectively. Together, the caches
ne1 devices), which are faster but consume for multiple user tasks, support virtual accounted for 130,000 transistors and 25
more current. These circuits consumed memory, and operate with fast caches that percent of the chip area, limiting our con-
approximately 25 percent of the chip’s present a coherent view of shared memory. sideration of other components to integrate.
power. The requirements of such divergent appli- Once we had selected the caches for
cations influenced nearly all design integration, we examined whether they
Market requirements. The corporate choices. should be virtually or physically addressed.
marketing group responsible for micropro- For several reasons, as explained in the
cessors identified product requirements following subsection, the functional re-
and target applications before we started System-partitioning quirements of many systems demanded
the design. These requirements defined the decisions physical addresses to access the caches. As
goals for the technical specifications devel- a consequence, we also placed the memory
oped by computer architects in conjunction The advanced technology available for management unit on the chip. This was not
with VLSl designers at the earliest stages of the NS32532 meant that over 350,000 tran- a difficult decision because the MMU used
the design. sistors could be integrated to form the only 30,000 transistors and less than 10
One of the primary decisions concerned microprocessor. This provided sufficient percent of chip area.
the instruction set. The marketing group resources for integrating some, but not all, The question of whether to incorporate
chose to maintain compatibility with two of the essential elements for a microproces- an on-chip floating-point unit was influ-
previous generations of the 32000 series sor system. The main functions considered enced by the fact that many of the target
microprocessors to support the existing for integration were caches, the memory applications had little need for floating-
customer base. The 32000 instruction set’ management unit, the floating-point unit, point arithmetic. Moreover, the size of a
is characterized by a regular combination and the cache controller. high-performance FPU exceeded the avail-
of operators, data types, and addressing We consideredeach function for integra- able area. (The FPU considered would have
modes. tion according to its impact on system cost used approximately 150,000 transistors.
Instruction-set compatibility meant that and performance in conjunction with chip Although the number of transistors is close
an existing body of software would enable area and compatibility constraints. The to that of the caches, the devices are packed
early introduction of products based on the choices were evaluated for several systems much more densely in the cache memory.)
NS32532. Compatibility also shortened the (we will consider the previously mentioned We decided to concentrate resources in
processor’s design schedule because the laser printer and multiuser computer here). developing an efficient pipelined interface
experience gained and computer-aided Before conducting the evaluation, we to an external FPU.
design tools developed during previous had to select a work load. We traced a We did not include the cache controller
designs could be applied to the NS32532. collection of Unix system utilities, all writ- because its benefit did not span all applica-
Nevertheless, there was concern that in- ten in C, along with several scientific appli- tions. Further, it was difficult to make it
struction-set compatibility might place the cations and benchmarks, most of them sufficiently general-purpose to meet the
NS32532 at a competitive disadvantage written in Fortran. In addition, a collection cache characteristics (degree of associativ-
with the performance available from newer of kernel fragments for certain embedded- ity, memory-update policy, and line size)
architectures. Technical analysis showed control applications were coded in assem- required by different systems.
that compatibility with the 32000 architec- bly language. The high-level language
ture would not limit the microprocessor’s programs were compiled with optimizing Memory hierarchy support. The on-
performance if we used appropriate tech- compilers developed for the 32000 archi- chip caches are located at the highest point
niques in coordinated development of the tecture.? One observation from this analy- of the system’s memory hierarchy. The
microarchitecture and compilers. We pre- sis showed that the frequency of instruc- character of the memory hierarchy can
sent some of these considerations in the tions and addressing modes for the hand- differ greatly between systems. In cost-
next section. coded control applications resembled that sensitive systems, for instance, the
One of the challenges characteristic of of the HLL programs. The only significant NS32532 can be connected directly to the
designing a general-purpose microproces- differences were for bit and logical opera- main memory. In performance-demanding
sor is to make it suitable for a wide variety tions and for multiplication, which oc- systems, the processor can be connected to

68 COMPUTER
Cache memories cache coherence problem arises when an implement more-complex bus
I/O operation modifies a memory location protocols that support write-back
copied into the processor's cache. caches, which can enable the use of
Cache memories',* are high-speed
Maintaining cache coherence is more more processors by reducing bus
buffer memories used to hold copies of
complex in multiprocessing systems traffic.
those portions of main memory
because several processors can read The on-chip caches of the NS32532
currently in use. A processor can
from and modify shared memory are a major factor in achieving high
access information in the cache
locations. Cache coherence can be CPU and system performance.
memory several times faster than the
maintained by software, hardware, or a However, these caches represent an
corresponding information in main
combination of the two. extra level in the system's memory
memory. A cache memory attached to
Software can maintain cache coher- hierarchy, which might consist of the
a processor also significantly de-
ence by selectively invalidating the cache on-chip caches, an external cache,
creases contention with other bus
and marking memory locations as and main memory. The NS32532
masters (such as processors and
noncacheable. Maintaining cache design provides mechanisms for
direct-memory access controllers)
coherence entirely by software has keeping this memory hierarchy (or a
when accessing the main memory, as
several drawbacks: simpler one) coherent4
most of the processor's memory
The system is incompatible with
accesses are satisfied by the
programs developed without considering
processor's private cache.
cache coherence. This problem can be
The principle of locality explains why
critical for systems with an open architec-
cache memories capture a large
ture, where supplementing the basic
fraction of the main memory refer-
system with additional software and
References
ences in most cases. The cache
hardware must be simple.
locality has two aspects: spatial and 1. A.J. Smith, "Cache Memories," Computing
Errors can arise because there is no
temporal. Spatial locality means that Suweys, Vol. 14, No. 3, Sept. 1982, pp. 473-
clear method for identifying in advance all 530.
the memory references of a program in
the circumstances under which memory
the near future are likely to be near the 2. M.D. Hill and A.J. Smith, "Experimental
locations can be modified.
currently referenced memory location. Evaluation of On-Chip Microprocessor
Performance can suffer because the
Temporal locality means that the Cache Memories," Proc. f fth Ann. Symp.
cache is unnecessarily invalidated and Computer Architecture, CS Press, Los
currently referenced memory location
restricted in its use. Alamitos, Calif., Order No. 538 (microfiche
is likely to be used in the near future. only), 1984, pp. 158-166.
A good cache design should Hardware schemes for maintaining
minimize the cache access time and cache coherence are commonly based on 3. J. Archibald and L.-L. Baer, "An Economical
the cache miss ratio (the percentage a single system bus. The cache for each Solution IO the Cache Coherence Problem,"
Proc. 1 1 th Ann. Symp. Computer Architec-
of memory refrences not satisfied by processor observes the memory writes ture, CS Press, Los Alamitos, Calif., Order
the cache). The cache design should performed by other processors and I/O No. 538 (microfiche only), 1984, pp. 355-
also keep the cache contents coherent devices; any copy of the modified location 362. .
with the main memory and other is invalidated or updated. Some systems
4. S. lacobovici and M. Baron, "Integrated
caches to prevent the processor from use write-through caches so all memory MMU. Cache Raise System Level Issues,'
using stale data.',3 modifications can be observed on the Computer Design, Vol. 26, No. 10, May 15,
In single-processor systems, the bus. Multiprocessing systems often 1987, pp. 75-79.

an external cache. Because the on-chip tionally. using physical addresses makes i t memory. The cache coherence problem has
caches would be called upon to work with unnecessary to invalidate the caches when already been faced in systems constructed
various types of memory hierarchies, we switching tasks or at other times when the with off-chip caches.
needed to characterize the on-chip caches virtual-to-physical mapping is altered. The NS325.12 addresses the cache coher-
for correct and efficient operation with Finally. our choice enabled the implemen- ence issue by offering the system designer
different systems. tation of techniques to ensure coherence a choice between techniques based on soft-
One key decision involved separating between the on-chip caches and external ware or hardware. The solution (or combi-
the data and instruction caches to increase memory. nation of solutions) can be tailored to meet
the memory bandwidth over that available Ensuring the integrity of cached data is a the system's specific performance, com-
from a single cache and to avoid conflicts primary concern of engineers designing patibility. and complexity requirements.
between instruction and data references. systems around microprocessors with on- We explain the cache coherence techniques
Another important design decision con- chip caches. Data integrity can be compro- i n the section entitled "System interface"
cerned the choice of using virtual or physi- mised when a direct-memory access device (also see the sidebarentit1ed"Cache memo-
cal addresses to access the caches. Al- or another processor changes the value o f a ries").
though a virtually addressed cache would shared main memory location. Failure to
have been simpler to design. we chose update all microprocessors using that Floating-point support. The main dis-
physically addressed caches because they memory location will make the data in their advantage of implementing the FPU on a
avoided problems associated with virtual caches inaccurate. The accuracy of cached separate chip from the CPU is the commu-
aliases (two or more virtual addresses trans- data - cache coherence - is maintained nication overhead. The NS32532 mini-
lated to a single physical address). Addi- by updating cached data as i t is changed in mizes the communication overhead by

January 1989 69
BCLK I I Instruction 110control

5r
1-
cache 2 Slave timing
and control
Burst control
SYNC; Address 32 Address
Unit Memory
management
Busaccess 4 I
control
aI I
I
Bus timing and
control outputs
I Register 5
4
Exception 1
A Bus control inputs

request I

* A I

Internal I
status 1
I

Figure 2. Microarchitecture a n d system interface o f the NS32.532.

T a b l e 1. Performance factors. WTL3 I64 chips. Floating-point opera- throughput o f 3 cycles per instruction in the
tions are executed as rapidly as integer absence of storage delays, and 3.7 cycles
Pertormance Cycles per operations. per instruction with typical hit ratios for the
component instruction T h e N S 3 2 5 3 2 - N S 3 2 5 8 0 - W T L 3 I64 on-chip cachcs and zero-wait-state exter-
cluster implements the pipelined FPU nal memory (see Table I ) .
Ideal execution 2.0 protocol.’ Up to five sequential floating-
Execution delays 0.3 point instructions can be processed simul- Instruction pipeline structure. The
Data dependencies 0.1 taneously. The NS32532 sends instruction execution pipeline operate\ in four stage5
Control dependencies 0.6 opcodes and operands to the NS32SXO and. (see Figure 3):
Storage delays 0.7 in most cases, continues to the next float- ( I ) Fetch instruction.
Total 3.7 ing-point instruction without waiting for ( 2 ) Decode instruction.
the instruction to c o m p l e t e . T h e ( 3 ) Calculate addresses and read source
instruction’s address (the CPU’s program operands.
counter) is saved within the NS32532 to ( 4 ) Calculate results and write destina-
enable correct instruction restartability for tion operands.
implementing a pipelined interface proto- exception handling. The NS32SXO receives The four stages are combined with three
col to an external FPU. One advantage of the instructions and operands from the buffers to smooth the flow of instructions
placing the FPU externally to the CPU is an NS32532 and controls the WTL3 163 tloat- through the pipeline. An X-byte queue fol-
increased range of cost-performance op- ing-point data processor. lowing the instruction-fetch stage buffer5
tions. The following are possible configu- and aligns instructions of byte-variable
rations: Microarchitecture length. A buffer following the instruction-
No FPU. Floating-point instructions decode stage can hold one fully decoded
are emulated at nocost in software, but The NS32S32’s microarchitecture is instruction ready for processing. Another
their performance is SO to 100 times based on the system-partitioning decisions. buffer at the end of the pipeline can hold
slower than integer operations. I t includes a four-stage pipeline connected two results destined for memory. The re-
A moderately priced, single-chip FPU, to the MMU through a virtual-address bus sults are written when the bus i s free.
like the NS3238 I . Floating-point op- and tothedatacache throughadata bus (see thereby permitting overlapped execution
erations are approximately I O times Figure 2). The pipeline also connects to the of instructions that read from and write to
slower than integer operations. instruction cache through an instruction memory.
A high-performance, multichip FPU. address bus and data bus. The pipeline The pipeline‘s regular structure allows
like the combination of NS.32580 and executes instructions at an average instruction fetching. data memory refer-

70 COMPUTER
Fetch Stage 1 Decode Tgg Tyg OYta Valid Decode
instruction

8-byte Buffer
queue

I Decode
instruction
Address Data Invalidation
bus

Figure 4. Data cache structure.


One decoded Buffer
instruction

Table 2. On-chip caches.

Instruction Data
Calculate results cache cache
IWrite destination operands1 Stage
Total size 5 12 bytes 1,024 bytes
Line size 16 bytes I6 bytes
Buffer Associativity Direct-map Two-way
results
Replacement N/A Least-
~ _ _ _ algorithm recently-used
Figure 3. Pipeline structure of the Update policy Write-through
NS32532.

ences, and instruction execution to proceed Such solutions were too costly. Instead, we access times.
in parallel. Data dependencies between used a relatively simple static branch-pre- Simulations showed that a design with
instructions are automatically handled by diction mechanism. The prediction is based 5 I 2 bytes of instruction cache and 1,024
hardware interlocks. The optimizing com- on the direction (forward or backward) of bytes of data cache would be balanced,
piler schedules instructions to minimize the branch and the type of condition tested. delivering hit ratios of 82 percent and 84
delays due to dependencies. The prediction, which is made as the in- percent, respectively. The pipeline proved
struction is decoded, is correct for 7 1 per- to be less sensitive to the hit ratio of the
Brar7c.h predic.tion. We examined sev- cent of control transfers, resulting in a instruction cache than that of the datacache
eral techniques for handling branches to saving of 0.3 cycles per instruction. (Of the because prefetching and the use of burst
sustain the pipeline's instruction through- control delays, approximately 0.2 cycles transfers on the bus reduced the delay for
put. Branches and other forms of control per instruction arise from procedure re- external instruction fetches.
transfer (procedure call, return, and jump) turns. Such transfers can only occur after
account for 23 percent of instructions in the the return address has been read from Data c,uc,he.The data cache design goal
work load. This results in approximately memory, so the branch prediction mecha- was to provide a hit access time of one cycle
one of six instructions transferring to a nism is ineffective here.) and a miss delay of only two cycles.
nonsequential location. The penalty for The data cache stores 1,024 bytes using
taking branches in pipelined computers is On-chip caches. The NS32532 holds a two-way, set-associative organization.
heavy because all prefetching and pre- three on-chip caches (see Table 2): an in- The cache occupies IO percent more area
processing performed on instructions fol- struction cache, a data cache, and a transla- and has a miss ratio 20 percent better than
lowing branches must be discarded. This tion look-aside buffer within the MMU. adirect-mapcache. Higherdegrees ofasso-
penalty increases with the depth of the The on-chip caches serve to reduce the ciativity presented difficult layout prob-
pipeline. The delay for control transfers average memory-access time. Access time lems and contributed less to performance
would have been 0.9 cycles per instruction is determined by three major factors: than the step from direct-map to two-way
with no branch prediction mechanism. cache hit ratio (the fraction of refer- associative. We chose a line length of 16
We considered numerous branch han- ences located in the cache), bytes primarily because i t suited the in-
dling schemes from the literature for im- access time for hits, and struction cache. A sub-block scheme is
plementation.' Dynamic branch prediction access time for misses. used where only 4 bytes need to be loaded
techniques. where a table records the his- The designer controls these factors by de- on a miss rather than the entire line. We
tory of encountered branches, showed suc- termining the cache size, organization, and made this choice to reduce utilization of the
cess rates up to 9 0 percent, but only for replacement policy. Careful circuit design external bus, especially for slower memory
tables of size comparable to the caches. of the caches is essential to achieving fast systems.

January 1989 71
quency ( a 30-megahertz CPU requires a
60-megahertz input clock). On-chip cir-
cuits divide the frequency in half to obtain
a two-phase (nonoverlap) internal clock.
Address One of these phases is sent off the chip and
pins forms the bus clock (BCLK). Most
NS325.12 timing parameters are specified
relative to the BCLK edges. The timing i j
Cancel bus optimized for the convenience of the sys-
transaction tem designer. The supplied inverse of the
Data system clock (BCLK) can minimize the
clock skew in system timing.
As explained previously. the integration
of on-chip caches provides a performance
boost across the target applications. The
on-chip instruction and data caches can be
accessed at the same time. in one clock
cycle. by processing elements in the execu-
Figure 4 shows the organization of the T o eliminate the need to translate the tion pipeline. As aresult. the peak NS32532
data cache. Note that the valid bits are dual address for most instruction fetches. the internal transfer rate (memory bandwidth)
ported. We will explain the use of these physical address for the current page is held is 240 megabytes per second at 30 mega-
ports for maintaining cache coherence in in the instruction cache. A reference to the hertz.
the next section. MMU is made only when a code access Nevertheless. the amount of cache
A write-through policy avoids cache crosses to another page. memory we could place on the chip ( 1 .S
coherence problems and simplifie5 im- kilobytes total) was relatively small. and
plementation. A least-recently-used policy T L B . The IS-cycle delay for handling we recognized that the highest-perform-
selects between the two banks of the data TLB mijses is considerably longer than the ance applications would use a large exter-
cache. delay for instruction or data cache misses. nal cache in conjunction with the micropro-
The biggest problem in the data cache Therefore. a higher hit rate is required. The cessor. For such applications we needed to
design was providing a hit access time of target was set to 90 percent, which we tune the access path for misses of the on-
one cycle using physical addresses. We achieved with a 64-entry. fully associative chip caches. We designed the bus to pro-
took advantage of the fact that the less- TLB. vide the physical addressexternally as soon
significant bits of the virtual address (in- A guaranteed translation time of 13 as the address translation is completed,
page address) are identical to the physical nanoseconds was needed to achieve the thereby allowing access to an external
address. The virtual address for a source data cache access in one cycle. We used cache to begin while the lookup in the on-
operand i s simultaneously presented to power-consuming static circuits to achieve chip cache is in progress. In the event of a
both the translation look-aside buffer and this goal. on-chip cache hit, the bus transaction can-
the data cache (see Figure 5 ) . While the cels. indicated by acontrol signal pin. In the
TLB translates the virtual page address. the System interface event of a cache miss, the bus transaction to
addres5 tags selected by the low-order un- read the data from external memory contin-
translated bits of the virtual address are The requirement that the microprocessor ues with no delay caused by the presence of
read from the data cache. Following trans- support a wide variety of applications the on-chip cache.
lation, the physical address is sent simulta- strongly influenced the definition of the
neously to the cache and output pins, and a system bus interface. More specifically, the Cache coher-crrc,e. When designing an
bus transaction is initiated. The twoaddress target systems' cost-performance require- on-chip cache for a microprocessor in-
tags from the data cache are then compared ments related directly to the highly differ- tended for a variety of system configura-
with the physical address. If there is a cache ing characteristics of their memory hierar- tions, i t i s important to provide a flexible
hit. the bus transaction i s cancelled as indi- chies. As a consequence, the issue of on- and complete set of cache coherence
cated by a control signal pin. If there is a chip cache memory was central to many mechanisms. Software coherence mecha-
cache miss, the bus transaction to read the design decisions. A straightforward ex- nisms can be appropriate for small single-
data from external memory continues with ample was the use of burst transfers on the processor applications, where the cost of
no delay caused by the presence of the on- 32-bit data bus to enable a complete cache maintaining cache coherence with hard-
chip cache. line to be filled in a single bus transaction. ware i s unacceptable. Hardware mecha-
Other aspects of the decision to integrate nisms are essential for shared-memory
Insrrxc.rion czc,tie.The instruction cache caches on the chip influenced the use of multiprocessing systems.
is a 5 12-byte. direct-mapped cache identi- techniques for efficient miss handling. As shown inTable 3, the NS32532 incor-
cal in design to a single bank of the data cache coherence. and observing the inter- porates several software and hardware
cache. The instruction cache derives much nal operation of the microprocessor. mechanisms for maintaining coherence
of its effectiveness from capturing loops, so The NS32532's system interface is rep- between its on-chip caches and external
the set-associative organizations show less resented in Figure 2. The single-phase memory.' Coherence requirements of vari-
improvement in miss ratio than they d o for clock (CLK) input of the processor accepts ous systems can be accommodated by se-
the data cache. a signal at twice the chip's operating fre- lecting the most appropriate mechanisms.

72 COMPUTER
In software, pages can be marked non-
cacheable, and acache invalidation instruc-
tion is available. When the instruction is
executed. the contents of the on-chip in-
struction and data caches are invalidated bus
and status information is displayed on the
microprocessor's external bus. Another Cache t
bus signal indicates when an off-chip ac- Valid
COPY System
cess is to a noncacheable page. Thus, soft- bus
ware can control an external cache in ex-
actly the same manner as the on-chip
caches.
The microprocessor's hardware-based NS32532
coherence mechanisms include a bus of
eight pins that controls total or partial in-
validation of the on-chip instruction and
data caches. Figure 4 shows the organiza-
tion of the two-way set-associative data
Figure 6. Cache coherence using a bus watcher.
cache and its connection with the invalida-
tion bus. Each of the cache lines within a set
has an address tag. 16 bytes of data, and
four dual-ported validity bits. Both lines of
adata cache set can be invalidated using the
invalidation bus. Because the validity bits
are dual-ported, invalidation of the on-chip processor (this is the case for many \ingle- Table 3. Cache coherence mecha-
caches occurs without interfering with processor applications). nisms.
ongoing cache accesses or bus transactions. The third application is appropriate for
The NS32532's hardware-based mecha-
nism for maintaining cache coherence can
multiprocessing systems and others that
use an external cache with the microproces-
i Software Hardhare
i
be applied in a variety of ways. We present sor. Coherence between the external cache Marh page Cache-inhibit
three examples here. In the first application and main memory is maintained u\ing any noncacheable input \ignal
(see Figure 6). the NS32.532 operates in scheme selected by the system de\igner.
conjunction with an external bus-watcher while the microprocessor's invalidation Invalidate block Invalidate set
circuit. The bus watcher observes the bus is used to maintain coherence between if in cache in cache
microprocessor's bus transactions to main- the on-chip and external caches. T o ensure
tain a duplicate copy of the on-chip cache the latter. i t is sufficient to invalidate a set Invalidate lnval idate
tags, while monitoring writes to main from the on-chip caches whenever ;I corre- entire cache entire cache
memory on the system bus. When the bus sponding line in the external cache is up-
watcher detects that a location in the on- dated or invalidated." The external cache
chip cache has been modified in main serves as a filter to keep potential invalida-
memory. it signals the microprocessor and tions o n the system bus from affecting the
sends the set number on the invalidation microprocessor. be disabled by software control of special
bus. bits in the processor's configuration regih-
In the second application, cache coher- E.~-ror.nu/nio~iforYyq.Integrating cache ter. In addition. break points can be estab-
ence is maintained without an external copy memory improves a microproces\or's per- lished upon reference to instruction and
of the on-chip cache's tags. This is accom- formance by locating the majority of data locations that might hit in the on-chip
plished by connecting the invalidation bus memory references on the chip. This per- caches and not appear o n the external bus.
lines that select the cache's set number to formance benefit. however, renden inviy-
the appropriate address signals of the sys- ible much of the microprocessor's activity ~ . ~ . 7 repre-
S \ . s r ~ ido.si,q)i i , . v u n i p / ~Figure
tem bus. Whenever a memory location is that would otherwise appear o n its bus. sents a simple single-processor system
modified. the set where the location can be making i t more difficult to accompli5h using the NS32532. Because of the
stored in the on-chip cache is invalidated. system debugging. Special mechanisms microprocessor's high integration. the
For example, when a write to location 4.096 incorporated into the NS32532 overcome core-computing cluster consists of only the
occurs, set number 0 is invalidated. This these potentia I I i m i t at ion s. processor and a crystal oscillator. An op-
ensures that no copy of location 4,096 ex- The NS32532 displays information for tional FPU and/or interrupt control unit can
ists in the cache, although copies of other each memory reference that appears on i t \ be added to the computing cluster. The rest
locations (for example, 5 12 and 1,024) in bus, enabling an external copy to be main- of the system simply consists of the main
the cache may have been unnecessarily tained for the on-chip caches and TLB tags. memory and peripheral devices (for ex-
invalidated. The performance impact of For example. for each cacheable data refer- ample. disk and communication control-
eliminating the bus watcher is minor when ence. a signal shows into which element of lers). I f this single-processor system does
the rate of invalidations is much lower than the selected cache set the data will be not use a direct-memory access device. the
the rate of memory accesses by the micro- placed. The NS32532.s on-chip caches can on-chip cache coherence problem does not

January 19x9 73
the correctness of the high-level structures
and protocols. Second, the behavioral de-
scription was broken down and translated
_____3 Peripheral to a hierarchy of functional units. The
Crystal
oscillator 3 devices leaves of this hierarchy, known as basic

- (I/O) blocks, were groups of random logic, pro-


grammable logic arrays, or special struc-

Control < Memory


tures like an arithmetic logic unit or a reg-
ister. Coding of this description was exe-
NS32532 controller cuted with a hardware definition language
32-bit developed at National Semiconductor.
microprocessor
t Massive logic testing was performed on
the microprocessor’s functional descrip-
AO- A31
tion model. More than five million machine
Memory
cycles were run prior to fabricating the first
3 parts. The tests consisted mainly of ran-
domly generated patterns with a high fre-
quency of external events (like interrupts)
NS32202 NS32381 and mixes of all instructions and address-
interrupt Floating- ing modes.
control point
unit unit
Low-level design. The low-level design
was generated by translating the basic
block functional description to layout. This
stage was broken down into three steps.
Step one identified the design restric-
tions for each basic block. The setup time
and input capacitances were estimated for
Figure 7. Simple single-processor system using the NS32532.
each block’s input signals. Designers also
estimated the valid time and capacitance
load for each block’s output signals.
A special computer-aided design tool
calculated the required capacitance load on
exist (only the CPU updates the main and semiautomatic tools in the hands of each signal and flagged contradictions be-
memory). However, if the main memory approximately 30 design engineers. The tween the source and destination specifica-
can be updated by a direct-memory access methodology advanced that schedule at an tions. These estimated timing and capaci-
device, the on-chip cache coherence should acceptable cost in hardware and software tance values were replaced by actual values
be preserved either in software or (prefera- development as well as chip size. after the layout was complete.
bly) in hardware without the help of a bus The hierarchy allowed each block’s Step two translated the functional de-
watcher. design to proceed independently of every scription to a circuit description. Auto-
In a multiprocessor system, a large cache other block. The design was done in several mated synthesis tools used the interface
is usually attached to the CPU to minimize stages, with each providing a base for the specifications and hardware description
the system bus traffic. The memory hierar- next. A major aspect of each stage was language for creating the programmable
chy in this case consists of the NS32532’s testing a block for correct construction and logic array and random logic transistor
on-chip caches, the external cache, the coherence with other stages. Tests at the descriptions. Automation ensured that no
main memory. and secondary memory highest level ensured that design flaws logic or speed flaws entered the design, and
(such as disks). Figure 8 represents a pos- were detected early. minimized other potential circuit prob-
sible multiprocessor system configuration. The use of automatic and semiautomatic lems. All paths were automatically checked
As previously explained. the coherence tools for translating the basic block’s func- to determine whether they satisfied the
between the caches and the main memory tional description to a layout ensured that speed requirements.
can be preserved using a bus watcher errors did not creep in at the design’s lower Automatically generated random logic
mechanism. level. This means the geometric patterns was 20 percent larger than manually de-
are placed on masks used during fabrica- signed logic. This sacrifice in area was
tion of the microprocessor. justified by the savings in design time,
Design methodology which was shorter by a factor of three.
High-level design. The high-level de- Step three, which took place after the
The goal of the design methodology sign of the NS32532 described the micro- design was complete. verified the coher-
developed for the NS32532 was the con- processor in terms of its major blocks and ence between the low-level and high-level
struction of a highly integrated, high-per- their interconnections. This stage pro- descriptions. Another computer-aided de-
formance device within a strict schedule.’ ceeded in two phases. First, the chip archi- sign tool verified adherence to circuit de-
We met the goal by developing a hierarchi- tecture specification was translated to a sign methodology rules.
cal methodology that relied on automatic behavioral description. Logic tests checked

14 COMPUTER
NS32532 CPU Processor board 1 1
I
I - I
I
I Instruction Bus I

-
I cache watcher I System
I bus
I
I

p q
I
Memory BUS interface Address I
Pipeline management .L Bus
I unit
unit controller 1
External I
I cache
Data I
I I
Data
I cache I
I I
I I

Figure 8. Multiprocessor system using the NS32532.

Layout. The layout for random logic, full speed, their coverage was more exten- our management and customers. We single out
programmable logic arrays, and on-chip sive than the presilicon simulations. Jay Finkelstein, who helped us prepare the draft
of this article.
memories was created automatically. Chip-

T
level routing was performed semiautomati- he design of a general-purpose
cally. While the automatic layout of the microprocessor must strike a bal
programmable logic arrays and memories ance among conflicting require-
was very efficient, the random logic layout ments of performance, cost, and schedule,
and global routing optimization cost about while adhering to technological con- References
I O percent in total chip size. Layout time straints. This balance cannot be based on
for this activity was efficient. the considerations of the processor alone, C. Hunter, Series 32000 Pro,qruninier.’s
The chip size was minimized by manu- but must be made in the context of various Refei-enc,e Monual. Prentice Hall, 1987.
ally laying out the basic cells and the spe- system applications. Many important de-
cial structures. sign decisions concern the partitioning of C. Bendelac and G. Erlich, “CTP - A
Family of Optimizing Compilers for the
system functions between those integrated
NS32532 Microprocessor,” Proc. Int’l
Postsilicon debug and correction. The on the chip with the processor and those im- Conf. Coniputer. Design (ICCD 88), CS
role of the design methodology did not end plemented externally. More specifically, Press, Los Alamitos. Calif., Order No.
with generating the masks forthe first parts. the decisions that concern integrating com- FJ872. 1988, pp, 247-250.
The major vehicle for postsilicon logic ponents of the memory hierarchy have the
S. Iacobovici, “A Pipelined Interface for
debugging was a specialized functional greatest impact on system performance, High Floating-point Performance with Pre-
tester developed at National Semiconduc- cost, and functionality. cise Exceptions,” lEEE Mic,ro, Vol. 8, No.
tor. The testing was based, as in the presili- 3. June 1988.
con stage, on automatically generated ran- Acknowledgments J.E. Smith, “A Study of Branch Prediction
dom patterns with frequent external events. Strategies,” Proc. Ei,yhrh Ann. Syrnp. Cnm-
The frequency of events in this tester was The authors acknowledge the dedication of all purer Archirecrure. CS Press, Los Alamitos,
about 1,000times greater than in an actual the National Semiconductor employees who Calif., Order No. 346 (microfiche only).
system environment. Since the tests ran at contributed to the NS32532 design, as well as 1981, pp. 135-148.

January 1989 15
r
DEPARTMENT HEAD
COMPUTER SCIENCE
University of Lowell
Lowell, Massachusetts
5 . D. Alpert. J . Levy. and B. Maytal. "Archi- Co/,r/iiffc,/.Desig/r.Vol. 26. No. 10. Ma) IS.
The facult) of Computer Science is seeh- tecture of the NS32532 Microproce\\or." 1987. pp. 75-79.
irig a new Department Head to pro\ itle
D<'.\/,l!//(/c'c'D
P/'OC / / / f ' / Cotlf COt7l/l/tf<'/'
leadership i n research mid admini\ter a com-
S7). CS Press. Los Alamito\. Calif.. Order 7. U. Weirer et al.. "Design of the NS32532
prehensive Graduate and 1~ndergractu:itepro-
No. FJXOZ. 1987. pp. IhX-177. Microprocessor." P/-oc,. / / i f ' / C C J / !Cot?!-
~
gram. Thi\ \earth ail1 contintic during the
1988-X9 academic bear o r until the po\ition prtf<'r De.x/,g/r ( / C C D 8 7 ) . CS Press. Los
6. S. lacobo\ici and M . Baron. "Integrated Alaniito\, Calif.. Order N o . FJX02. 19x7.
IS filled.
MMLI. Cache Raise Syhteni Level Is\ues." pp. 177-180.
The Department'Head w i l l he re\pon\ible
for faculty recruitment and evaluation. devel-
opment o f a comprehenive re\enrch plan
and the management of all departmentnl re-
source\. A in~ijorpriorit) i \ to recruit new fa-
culty and expand research activity and the
graduate program in Computer Science. The
successful candidate mu\t hold an earned
doctorate in Computer Science or a direct11
related field and must be a citizen o r a per-
manent re\ident of the U.S. The candidate
jhould be nationally recognized and cur-
rently active in research and teaching in at
least one area of modern Computer Science.
Priorexperience in academic and research ad- Benjamin May tal I\ manager of ;I VLSl depart- Dan Biran i s an engineering manager in the
ministration wfficient to qualif! for a ment at National Semiconductor (Israel), His VLSI design department of National Semicon-
tenured position at the rank of full profeisor interests include computer architecture. CAE ductor (Israel). He uorked on the design and
is also required. tools for VLSI design. and fault tolerance. testing of the NS32532. His interests include
The Computer Science department. nou i n Maytal received the BSEE degree from the computer architecture and VLSl design and
its tenth )ear. occupies a new 37.(X)O \quare Institute of Technolog) (Technion). lsrnel. in te\ting.
foot facllit) uith an expanding faculty of73 1079. Biran received the BSEE decgree from the
full-time and IO p a n t i m e member\. The \tu- Unibersity of Trl A \ i v . l u a e l .
dent body consirt\ of \ome 500 full-time uti-
dergraduates w ith 90 full-time and 2 2 0 part-
time M.S. students. I t \ doctoral program has
over 20 full-time candidate\. The depart-
ment ha\ excellent hardware re\ource\. in-
cluding multiple large DEC. VAX and Data

man) different manufacturer\. Our s> \teiiis


are full) linked hq ethernet and broadharid
fxilitie\. both internrill\ and throughout the
universit). Wide area netuorhing i\ provided
via ARPANETICSNET and USENETcon-
nection\. The department's Graphic Labora-
Sorin Iarobovici i \ a \enior computer 'irid \qs- Jonathan Levy is an engineering manager in
tory has \ e v e 4 international cooperati\ e
tem architect u ith National Semiconductor in N a t i o n a l Semiconductor's microprocessor
agreement\. The Department is a primnry
contributor to the Univer\ity'\ Center for Pro- Santa Clara. Californlu. Hi\ interests include group. where he u a \ rehponsible for the defini-
ducti\,it) Enhancement \upportiny a \ x i e t y computer architecture. high-performance coiil- tion and de\ign of the bus interface unit, mem-
of project5 in lle\ible manufacturing and i \ puter s) \tenis de\iyn. and performance analysi\ or) management unit. and data cache for the
the sole Computer Science participant in the a n d modeling. NS32.532.
Ma\\achu\ett\ Microelectronics Center'\ lacobovici hold\ an MSEE degree from the Levy received the BSEE degree from the
CAD netuorh nnd VLSI fabrication facility. Polytechnic ln\titute of Buchare\t. Romania. University of Tel Aviv. I m e l . in 1980.
Facult) re\earch activities cover ;I broad
spectrum oftheoretical a i d applied areas. in-
cluding funded research in graphic\. rohit-
ics. machine \ i\ioii. software engineering
CASE. parallel and d i h h u t e d \!stem\. and
computer-aided tie\ign.
Intere\ted applicant\ are invited to foruard
a complete academic vita. the iianie\ ot three
reference\ (tuw in hi\,her research area). and
a lrtter \tating hi\ or her approach toward e x -
panding rewarch and graduate \tud) in Com-
puter Science a1 the University of Lowell.
Addre\s replit.\ to:
Professor Robert Lechner Donald B. Alpert I \ architecture group man- Sidi Yom Tov is a senior VLSl de\ign manager
Department Head Search Committee ager at National Semiconductor's VLSl design with National Semiconductor'\ microprocessor
Computer Science, WL229 center i n Herrliba. I\rael. His interests include
University of Lowell
group. He served as the project manager of the
computer architecture. iiiicroproce\\or de\ign. NS32532 chip dejign. His main interests are
Lowell, MA 01854
and menior) hierrirch) org;ini/ation. VLSl dejign methodologies and advanced CAE
In i \ e r v t ) i i t Liwcll
I\ m t q u ~ O
l pp~iituiiit~/4l-
Alpert received the BSEE froni the Massa- tools.

Lowell
limi;iti\c -\i.iinn. Title IX.504r.mplo)er
chu\ett\ I i i \ t i t u t e of Technology and the MSEE Sidi received the BSEE degree from the Uni-
atid PhD degree\ from Stanford L:niver\ity. versity o f Beer Sheeva. Ivael. in 1978.

Reader\ ma) coiitiict Sorin lacohovici ;it hatlonal Semiconductor. MS D367X. Dept. 02-8079. 2900
~ Santa Clara. CA 0 5 0 5 I .
Semiconductor D r i e.

6 COMPUTER

You might also like