Noc Topologies01
Noc Topologies01
Noc Topologies01
EE, (c) Torus, (d) Folded torus, (e) Octagon, (f) BFT.
Kumar et al. [18] have proposed a mesh-based inter-
connect architecture called CLICH
EE (Chip-Level Integra-
tion of Communicating Heterogeneous Elements). This
architecture consists of an i i mesh of switches inter-
connecting computational resources (IPs) placed along with
the switches, as shown in Fig. 1b in the particular case of
16 functional IP blocks. Every switch, except those at the
edges, is connected to four neighboring switches and one
IP block. In this case, the number of switches is equal to the
number of IPs. The IPs and the switches are connected
through communication channels. A channel consists of
two unidirectional links between two switches or between a
switch and a resource.
Dally and Towles [19] have proposed a 2D torus as an
NoC architecture, shown in Fig. 1c. The Torus architecture
is basically the same as a regular mesh [22]; the only
difference is that the switches at the edges are connected to
the switches at the opposite edge through wrap-around
channels. Every switch has five ports, one connected to the
local resource and the others connected to the closest
neighboring switches. Again, the number of switches is
o `. The long end-around connections can yield ex-
cessive delays. However, this can be avoided by folding the
torus, as shown in Fig. 1d [28]. This renders to a more
suitable VLSI implementation and, consequently, in our
further comparative analysis, we consider the Folded Torus
of Fig. 1d.
Karim et al. [20] have proposed the OCTAGON MP-SoC
architecture. Fig. 1e shows a basic octagon unit consisting of
eight nodes and 12 bidirectional links. Each node is
associated with a processing element and a switch.
Communication between any pair of nodes takes at most
two hops within the basic octagonal unit. For a system
consisting of more than eight nodes, the octagon is
extended to multidimensional space. The scaling strategy
is as follows: Each octagon node is indexed by the 2-tuple
i. ,, i. , 2 0. 7. For each i 1, 1 2 0. 7, an octagon is
constructed using nodes f1. ,. , 2 0. 7g, which results in
eight individual octagon structures. These octagons are then
connected by linking the corresponding i nodes according
to the octagon configuration. Each node 1. J belongs to
two octagons: one consisting of nodes f1. ,, 2 0. 7g and
the other consisting of nodes fi. Ji 2 0. 7g. Of course, this
type of interconnection mechanism may significantly
increase the wiring complexity.
We proposed an interconnect template following a
Butterfly Fat-Tree (BFT) [21] architecture, as shown in
Fig. 1f. In our network, the IPs are placed at the leaves and
switches placed at the vertices. A pair of coordinates is used
to label each node, |. j, where | denotes a nodes level and
j denotes its position within that level. In general, at the
lowest level, there are ` functional IPs with addresses
ranging from 0 to ` 1. The pair 0. ` denotes the
locations of IPs at that lowest level. Each switch, denoted by
o|. j, has four child ports and two parent ports. The IPs
are connected to `,4 switches at the first level. In the
,th level of the tree, there are `,2
,1
switches. The number
of switches in the butterfly fat tree architecture converges to
a constant independent of the number of levels. If we
consider a 4-ary tree, as shown in Fig. 1f, with four down
links corresponding to child ports and two up links
corresponding to parent ports, then the total number of
switches in level , 1 is `,4. At each subsequent level, the
number of required switches reduces by a factor of 2. In this
way, the total number of switches approaches o
`
2
, as `
grows arbitrarily large [21].
3 SWITCHING METHODOLOGIES
Switching techniques determine when and how internal
switches connect their inputs to outputs and the time at
which message components may be transferred along these
paths. For uniformity, we apply the same approach for all
NoC architectures. There are different types of switching
techniques, namely, Circuit Switching, Packet Switching, and
Wormhole Switching [22].
In circuit switching, a physical path from source to
destination is reserved prior to the transmission of the
data. The path is held until all the data has been
transmitted. The advantage of this approach is that the
network bandwidth is reserved for the entire duration of
the data. However, valuable resources are also tied up for
the duration of the transmitted data and the set up of an
end-to-end path causes unnecessary delays.
In packet switching, data is divided into fixed-length blocks
called packets and, instead of establishing a path before
sendinganydata, whenever the source has apacket tobe sent,
it transmits the data. The need for storing entire packets in a
switch in case of conventional packet switching makes the
buffer requirement high in these cases. In an SoC environ-
ment, the requirement is that switches should not consume a
large fraction of silicon area compared to the IP blocks.
In wormhole switching, the packets are divided into fixed
length flow control units (flits) and the input and output
buffers are expected to store only a few flits. As a result, the
buffer space requirement in the switches can be small
compared to that generally required for packet switching.
Thus, using a wormhole switching technique, the switches
will be small and compact. The first flit, i.e., header flit, of a
packet contains routing information. Header flit decoding
enables the switches to establish the path and subsequent
flits simply follow this path in a pipelined fashion. As a
result, each incoming data flit of a message packet is simply
forwarded along the same output channel as the preceding
data flit and no packet reordering is required at destina-
tions. If a certain flit faces a busy channel, subsequent flits
also have to wait at their current locations.
One drawback of this simple wormhole switching method
is that the transmission of distinct messages cannot be
interleavedor multiplexedover a physical channel. Messages
must cross the channel intheir entiretybefore the channel can
be used by another message. This will decrease channel
utilizationif aflit fromagivenpacket is blockedinabuffer. By
introducing virtual channels [22] in the input and output
ports, we can increase channel utility considerably. If a flit
belonging to a particular packet is blocked in one of the
virtual channels, then flits of alternate packets can use the
other virtual channel buffers and, ultimately, the physical
channel. The canonical architecture of a switch having
virtual channels is shown in Fig. 2.
4 PERFORMANCE METRICS
To compare and contrast different NoC architectures, a
standard set of performance metrics can be used [22], [27].
For example, it is desirable that an MP-SoC interconnect
PANDE ET AL.: PERFORMANCE EVALUATION AND DESIGN TRADE-OFFS FOR NETWORK-ON-CHIP INTERCONNECT ARCHITECTURES 1027
architecture exhibits high throughput, low latency, energy
efficiency, and low area overhead. In todays power
constrained environments, it is increasingly critical to be
able to identify the most energy efficient architectures and
to be able to quantify the energy-performance trade-offs [3].
Generally, the additional area overhead due to the infra-
structure IPs should be reasonably small. We now describe
these metrics in more detail.
4.1 Message Throughput
Typically, the performance of a digital communication
network is characterized by its bandwidth in bits/sec.
However, we are more concerned here with the rate that
message traffic can be sent across the network and, so,
throughput is a more appropriate metric. Throughput can be
defined in a variety of different ways depending on the
specifics of the implementation. For message passing
systems, we can define message throughput, T1, as follows:
T1
Toto| ic::oqc: coij|ctcd `c::oqc |ciqt/
`ni/ci o) 11 /|oc/: Toto| tiic
.
1
where Total messages completed refers to the number of whole
messages that successfully arrive at their destination IPs,
Message length is measured in flits, Number of IP blocks is the
number of functional IP blocks involved in the commu-
nication, and Total time is the time (in clock cycles) that
elapses between the occurrence of the first message
generation and the last message reception. Thus, message
throughput is measured as the fraction of the maximum
load that the network is capable of physically handling. An
overall throughput of T1 1 corresponds to all end nodes
receiving one flit every cycle. Accordingly, throughput is
measured in flits/cycle/IP. Throughput signifies the max-
imum value of the accepted traffic and it is related to the
peak data rate sustainable by the system.
4.2 Transport Latency
Transport latency is defined as the time (in clock cycles) that
elapses from between the occurrence of a message header
injection into the network at the source node and the
occurrence of a tail flit reception at the destination node
[21]. We refer to this simply as latency in the remainder of
this paper. In order to reach the destination node from some
starting source node, flits must travel through a path
consisting of a set of switches and interconnect, called
stages. Depending on the source/destination pair and the
routing algorithm, each message may have a different
latency. There is also some overhead in the source and
destination that also contributes to the overall latency.
Therefore, for a given message i, the latency 1
i
is:
1
i
:cidci o.ci/cod tioi:joit |otcicy
iccci.ci o.ci/cod.
We use the average latency as a performance metric in
our evaluation methodology. Let 1 be the total number of
messages reaching their destination IPs and let 1
i
be the
latency of each message i, where i ranges from 1 to 1. The
average latency, 1
o.q
, is then calculated according to the
following:
1
o.q
P
1
|
1
i
1
. 2
4.3 Energy
When flits travel on the interconnection network, both the
interswitch wires and the logic gates in the switches toggle
and this will result in energy dissipation. Here, we are
concerned with the dynamic energy dissipation caused by
the communication process in the network. The flits from
the source nodes need to traverse multiple hops consisting
of switches and wires to reach destinations. Consequently,
we determine the energy dissipated by the flits in each
interconnect and switch hop. The energy per flit per hop is
given by
1
/oj
1
:nitc/
1
iitcicoiicct
. 3
where 1
:nitc/
and 1
iitcicoiicct
depend on the total capaci-
tances and signal activity of the switch and each section of
interconnect wire, respectively. They are determined as
follows:
1
:nitc/
c
:nitc/
C
:nitc/
\
2
. 4
1
iitcicoiicct
c
iitcicoiicct
C
iitcicoiicct
\
2
. 5
c
:nitc/
. c
iitcicoiicct
and C
:nitc/
. C
iitcicoiicct
are the signal
activities and the total capacitances of the switches and
wire segments, respectively. The energy dissipated in
transporting a packet consisting of i flits over / hops can
be calculated as
1
joc/ct
i
X
/
,1
1
/oj.,
. 6
Let 1 be the total number of packets transported, and let
1
joc/ct
be the energy dissipated by the ith packet, where i
ranges from 1 to 1. The average energy per packet, 1
joc/ct
,
is then calculated according to the following equation:
1
joc/ct
P
1
i1
1
joc/ct
i
1
P
1
i1
i
i
P
/
i
,1
1
/oj.,
1
. 7
The parameters c
:nitc/
and c
iitcicoiicct
are those that capture
the fact that the signal activities in the switches and the
interconnect segments will be data-dependent, e.g., there
may be long sequences of 1s or 0s that will not cause any
transitions. Any of the different low-power coding techni-
ques [29] aimed at minimizing the number of transitions can
be applied to any of the topologies described here. For the
sake of simplicity and without loss of generality, we do not
consider any specialized coding techniques in our analysis.
4.4 Area Requirements
To evaluate the feasibility of these interconnect schemes, we
consider their respective silicon area requirements. As the
switches form an integral part of the active components, the
1028 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 8, AUGUST 2005
Fig. 2. Virtual-channel switch.
infrastructure, it is important to determine the amount of
relative silicon area they consume. The switches have two
main components: the storage buffer and logic to imple-
ment routing and flow control. The storage buffers are the
FIFOs at the inputs and outputs of the switch. Another
source of silicon area overhead arises from the interswitch
wires, which, depending on their lengths, may have to be
buffered through repeater insertion to keep the interswitch
delay within one clock cycle [9]. Consequently, this
additional buffer area should also be taken into account.
Another important factor that needs to be considered when
analyzing the area overhead is the wiring layout. One of the
main advantages of the NoC design methodology is the
division of long global wires into smaller segments,
characterized by propagation times that are compatible
with the clock cycle budget [30]. All the NoC architectures
considered here achieve this as a result of their inherent
interconnect structure. But, the segmented wire lengths will
vary from one topology to another. Consequently, for each
architecture, the layout of interswitch wire segments
presents different degrees of complexity. Architectures that
possess longer interswitch wires will generally create more
routing challenges, compared to those possessing only
shorter wire segments. Long wires can block wiring
channels, forcing the use of additional metal layers and
causing other wires to become longer. The determination of
the distribution of interswitch wire lengths can give a first-
order indication of the overall wiring complexity.
4.5 Evaluation Methodology
In order to carry out a consistent comparison, we developed
a simulator employing flit-level event-driven wormhole
routing to study the characteristics of the communication-
centric parameters of the interconnect infrastructures. In
our experiments, the traffic injected by the functional
IP blocks followed Poisson [31] and self-similar distribu-
tions [31]. In the past, a Poisson distributed injection rate
was frequently used when characterizing performance of
multiprocessor platforms [32]. However, the self-similar
distribution was found to be a better match to real-world
SoC scenarios [33]. Each simulation was initially run for
1,000 cycles to allow transient effects to stabilize and,
subsequently, it was executed for 20,000 cycles. Using a flit
counter at the destinations, we obtain the throughput as the
number of flits reaching each destination per unit time. To
calculate average latencyandenergy, weassociate anordered
pair, 1
:nitc/
. 1
:nitc/
, with each switch and an ordered pair,
1
iitcicoiicct
. 1
iitcicoiicct
, with each interconnect segment,
where 1
:nitc/
. 1
iitcicoiicct
and 1
:nitc/
. 1
iitcicoiicct
denote the
delays and energy dissipated in the switch and intercon-
nect, respectively. The average latency and energy dissipa-
tion are calculated according to (2) and (7).
To estimate the silicon area consumed by the switches, we
developedtheir VHDLmodels andsynthesizedthemusing a
fullystatic, standardcell-basedapproachfor a 0.13jmCMOS
technology library. Starting from this initial estimation, by
using an ITRS (International Technology Roadmap for
Semiconductors) suggested scaling factor of 0.7, we can
project the area overhead in future technology nodes.
5 INFRASTRUCTURE IP DESIGN CONSIDERATIONS
One common characteristic of the communication-centric
architectures described in this paper is that the functional
IP blocks communicate with each other with the help of
intelligent switches. The switches provide a robust data
transport medium for the functional IP modules. To ensure
the consistency of the comparisons we later make in this
paper, we assume that similar types of switching and
routing circuits are used in all cases. These designs are now
described in more detail.
5.1 Switch Architecture
The different components of the switch port are shown in
Fig. 3. It mainly consists of input/output FIFO buffers,
input/output arbiters, one-of-four MUX and DEMUX units,
and a routing block. In order to have a considerably high
throughput, we use a virtual channel switch, where each
port of the switch has multiple parallel buffers [22].
Each physical input port has more than one virtual
channel, uniquely identified by its virtual channel identifier
(VCID). Flits may simultaneously arrive at more than one
virtual channel. As a result, an arbitration mechanism is
necessary to allow only one virtual channel to access a single
physical port. Let there be i virtual channels corresponding
to each input port; we need an i : 1 arbiter at the input.
Similarly, flits from more than one input port may simulta-
neously try to access a particular output port. If / is the
number of ports in a switch, then we need a / 1 : 1 arbiter
at each output port. The routing logic block determines the
output port to be taken by an incoming flit.
The operation of the switch consists of one or more
processes, depending on the nature of the flit. In the case of
a header flit, the processing sequence is: 1) input arbitration,
2) routing, and 3) output arbitration. In the case of body flits,
switch traversal replaces the routing process since the
routing decision based on the header information is
maintained for the subsequent body flits. The basic
functionality of the input/output arbitration blocks does
not vary from one architecture to another. The design of the
routing hardware depends on the specific topology and
routing algorithm adopted. In order to make the routing
logic simple, fast, and compact, we follow different forms of
deterministic routing [22]. In our routing schemes, we use
distributed source routing, i.e., the source node determines
only its neighboring nodes that are involved in message
delivery. For the tree-based architectures (SPIN and BFT),
the routing algorithm applied is the least common ancestor
(LCA) and, for CLICH
ico
p
2
|c.c|:o
. 11
where n
o1.o
is the length of the wire spanning the
distance between level o and level o 1 switches, where
o can take integer values between 0 and |c.c|: 1. For
CLICH
ico
p
`
p
1
. 12
while, for the Folded Torus, all the interswitch wire lengths
are double those for CLICH
EE [28].
Considering a die size of 20ii 20ii and a system
size of 256 IP blocks, we determined the number of
interswitch links and their lengths for all the NoC
architectures under consideration. We calculated the inter-
switch wire lengths in the cases of CLICH
EE and Folded
Torus are the simplest ones from a layout perspective, while
BFT and Octagon stand between these two groups.
PANDE ET AL.: PERFORMANCE EVALUATION AND DESIGN TRADE-OFFS FOR NETWORK-ON-CHIP INTERCONNECT ARCHITECTURES 1037
TABLE 3
Distribution of Functional and Infrastructure IP Blocks
Fig. 15. Area overhead.
7 CASE STUDY
To illustrate how system designers can use the analytical
and experimental procedures outlined in this paper to
estimate the performance of an SoC application, we
simulated a multiprocessor SoC, a network processing
platform, mapped to different NoC communication fabrics
we described in earlier sections. Among all the architectures
under consideration, Octagon and SPIN have the higher
throughput, but their energy dissipation is much greater
than that of the others. In addition, the silicon area overhead
due to the infrastructure IP blocks is also higher. Taking
these facts into account, we considered the architectures
with a lower energy dissipation profile, i.e., BFT, CLICH
EE,
and Folded Torus, for further evaluation. For illustrative
purposes, we mapped the network processing platform
onto these three interconnect architectures. The functional
block diagram of the network processor is shown in Fig. 18,
based on a commercial design [26]. All the functional blocks
are divided into five clusters. Initially, we assumed the
traffic to be uniformly distributed among these five clusters.
The micro-engines (MEs) in clusters 2 and 3 are the
programmable engines specialized for network processing.
MEs do the main data plane [26] processing for each packet
and communicate in a pipelined fashion within each
ME cluster. Consequently, the traffic will be highly
localized within these two clusters (clusters 2 and 3).
As discussed earlier, we assumed localization factors of
0.3, 0.5, and 0.8 for the traffic within these two clusters,
while the rest of the traffic is assumed to be uniformly
random. We also assumed a self-similar injection process.
Under the stated traffic distributions, we simulated the
performance of the network processor SoC shown in Fig. 18.
From the throughput characteristics, we can project the
aggregate bandwidth [35] sustainable by the SoC platform
by using the following expression:
qqicqotc 1oidnidt/ `ni/ci o) 11 /|oc/:
1|it |ciqt/ cccjtcd tio))ic c|oc/ iotc.
Table 4 shows the projected bandwidth, assuming a clock
rate of 500 MHz (typical for an SoC implemented in a
130 nm process) and 16-bit flit length, average message
latency, and average energy dissipation.
As expected and discussed in Section 6.1, throughput
increases significantly with traffic localization, which in
turn gives rise to higher aggregate bandwidth. The value of
average latency is measured at an injection load below
saturation. The effect of traffic localization on average
latency is that it allows a higher injection load without
saturating the network. The message latency, at a lower
injection load (below saturation), remains largely unaf-
fected by traffic localization. While measuring the average
energy dissipation, to have a consistent comparison, we
kept the system throughput at the same level for all the
architectures, while varying the amount of localized traffic.
When the factor of localization is varied from 0.3 to 0.8, the
bit energy savings relative to the uniformly distributed
traffic scenario vary from 20 percent to 50 percent.
As shown in this case study, it is possible to project the
achievable performance of a typical multicore SoC imple-
mented using the NoC design paradigm.
1038 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 8, AUGUST 2005
Fig. 16. Simplified layout examples of SPIN, OCTAGON, and BFT.
Fig. 17. Interswitch wire length distribution. Fig. 18. Functional block diagram of a typical network processor.
8 CONCLUSIONS
Networks-on-chip (NoC) are emerging as a viable inter-
connect architecture for multiprocessor SoC platforms. In
this new paradigm, infrastructure IPs are used to establish
the on-chip communication medium. NoC-based architec-
tures are characterized by various trade-offs with regard to
functional, structural, and performance specifications. Here,
we carried out detailed comparisons and contrasted
different NoC architectures in terms of throughput, latency,
energy dissipation, and silicon area overhead. We illu-
strated that some architectures can sustain very high data
rates at the expense of high-energy dissipation and
considerable silicon area overhead, while others can
provide a lower data rate and lower energy dissipation
levels. Our principal contribution lies in the establishment
and illustration of a consistent comparison and evaluation
methodology based on a set of readily quantifiable
parameters for NoCs. Our methodology sets an important
basis for the optimal evaluation and selection of inter-
connect infrastructures for large and complex SoCs. Though
the parameters considered in our benchmarking are
considered by experts in the field to be among the most
critical, they do not constitute a unique set nor are they
exhaustive. Different applications or circumstances may
require this set to be altered or augmented, e.g., by
including parameters such as testability, dependability,
and reliability. However, they are an important set to
characterize the emerging NoC architectures.
ACKNOWLEDGMENTS
The authors thank Micronet, PMC-Sierra, Gennum, and
NSERCfor their financial support andthe CMCfor providing
access to CAD tools. The authors also thank Pierre Paulin of
ST Microelectronics and Allan Nakamoto of PMC Sierra for
their feedback. In addition, they wish to thank the expert
reviewers for their extremely valuable comments, which
helped them enhance the quality of their work.
REFERENCES
[1] L. Benini and G. DeMicheli, Networks on Chips: A New SoC
Paradigm, Computer, vol. 35, no. 1, pp. 70-78, Jan. 2002.
[2] P. Magarshack and P.G. Paulin, System-on-Chip beyond the
Nanometer Wall, Proc. Design Automation Conf. (DAC), pp. 419-
424, June 2003.
[3] M. Horowitz and B. Dally, How Scaling Will Change Processor
Architecture, Proc. Intl Solid State Circuits Conf. (ISSCC), pp. 132-
133, Feb. 2004.
[4] Y. Zorian, Guest Editors Introduction: What Is Infrastructure
IP? IEEE Design and Test of Computers, vol. 19, no. 3, pp. 3-5, May/
June 2002.
[5] M.A. Horowitz et al., The Future of Wires, Proc. IEEE, vol. 89,
no. 4, pp. 490-504, Apr. 2001.
[6] K.C. Saraswat et al., Technology and Reliability Constrained
Future Copper InterconnectsPart II: Performance Implications,
IEEE Trans. Electron Devices, vol. 49, no. 4, pp. 598-604, Apr. 2002.
[7] D. Sylvester and K. Keutzer, Impact of Small Process Geometries
on Microarchitectures in Systems on a Chip, Proc. IEEE, vol. 89,
no. 4, pp. 467-489, Apr. 2001.
[8] ITRS 2003 Documents, http://public.itrs.net/Files/2003ITRS/
Home2003.htm, 2003.
[9] C. Grecu, P.P. Pande, A. Ivanov, and R Saleh, Structured
Interconnect Architecture: A Solution for the Non-Scalability of
Bus-Based SoCs, Proc. Great Lakes Symp. VLSI, pp. 192-195, Apr.
2004.
[10] C. Hsieh and M. Pedram, Architectural Energy Optimization by
Bus Splitting, IEEE Trans. Computer-Aided Design, vol. 21, no. 4,
pp. 408-414, Apr. 2002.
[11] AMBA Bus specification, http://www.arm.com, 1999.
[12] Wishbone Service Center, http://www.silicore.net/wishbone.
htm, 2004.
[13] CoreConnect Specification, http://www3.ibm.com/chips/
products/coreconnect/, 1999.
[14] D. Wingard, MicroNetwork-Based Integration for SoCs, Proc.
Design Automation Conf. (DAC), pp. 673-677, June 2001.
[15] Open Core Protocol, www.ocpip.org, 2003.
[16] MIPS SoC-it, www.mips.com, 2002.
[17] P. Guerrier and A. Greiner, A Generic Architecture for On-Chip
Packet-Switched Interconnections, Proc. Design and Test in Europe
(DATE), pp. 250-256, Mar. 2000.
[18] S. Kumar et al., A Network on Chip Architecture and Design
Methodology, Proc. Intl Symp. VLSI (ISVLSI), pp. 117-124, 2002.
[19] W.J. Dally and B. Towles, Route Packets, Not Wires: On-Chip
Interconnection Networks, Proc. Design Automation Conf. (DAC),
pp. 683-689, 2001.
[20] F. Karim et al., An Interconnect Architecture for Networking
Systems on Chips, IEEE Micro, vol. 22, no. 5, pp. 36-45, Sept./Oct.
2002.
[21] P.P. Pande, C. Grecu, A. Ivanov, and R. Saleh, Design of a Switch
for Network on Chip Applications, Proc. Intl Symp. Circuits and
Systems (ISCAS), vol. 5, pp. 217-220, May 2003.
[22] J. Duato, S. Yalamanchili, and L. Ni, Interconnection NetworksAn
Engineering Approach. Morgan Kaufmann, 2002.
PANDE ET AL.: PERFORMANCE EVALUATION AND DESIGN TRADE-OFFS FOR NETWORK-ON-CHIP INTERCONNECT ARCHITECTURES 1039
TABLE 4
Projected Performance of a Network Processor SoC Platform in NoC Design Paradigm
[23] H.-S. Wang, L.S Peh, and S. Malik, A Power Model for Routers:
Modeling Alpha 21364 and Infiniband Routers, Proc. 10th Symp.
High Performance Interconnects, pp. 21-27, 2002.
[24] T. Chelcea and S.M. Nowick, A Low-Latency FIFO for Mixed
Clock Systems, Proc. IEEE CS Workshop VLSI, pp. 119-126, Apr.
2000.
[25] P.P. Pande, C. Grecu, A. Ivanov, and R. Saleh, High-Throughput
Switch-Based Interconnect for Future SoCs, Proc. Third IEEE Intl
Workshop System-on-Chip for Real-Time Applications pp. 304-310,
2003.
[26] Intel IXP2400 datasheet, http://www.intel.com/design/network
/products/npfamily/ixp2400.htm, 2004.
[27] J. Hennessey and D. Patterson, Computer Architecture: A Quanti-
tative Approach. Morgan Kaufmann, 2003.
[28] W.J. Dally and C.L. Seitz, The Torus Routing Chip, Technical
Report 5208:TR: 86, Computer Science Dept., California Inst. of
Technology, pp. 1-19, 1986.
[29] V. Raghunathan, M.B. Srivastava, and R.K. Gupta, A Survey of
Techniques for Energy Efficient On-Chip Communications, Proc.
Design and Test in Europe (DATE), pp. 900-905, June 2003.
[30] L. Benini and D. Bertozzi, Xpipes: A Network-on-Chip Archi-
tecture for Gigascale Systems-on-Chip, IEEE Circuits and Systems
Magazine, vol. 4, no. 2, pp. 18-31, 2004.
[31] K. Park and W. Willinger, Self-Similar Network Traffic and
Performance Evaluation. John Wiley & Sons, 2000.
[32] D.R. Avresky, V. Shubranov, R. Horst, and P. Mehra, Perfor-
mance Evaluation of the ServerNetR SAN under Self-Similar
Traffic, Proc. 13th Intl and 10th Symp. Parallel and Distributed
Processing, pp. 143-147, Apr. 1999.
[33] G. Varatkar and R. Marculescu, Traffic Analysis for On-Chip
Networks Design of Multimedia Applications, Proc. Design
Automation Conf. (DAC), pp. 510-517, June 2002.
[34] Networks on Chip, A. Jantsch and H. Tenhunen, eds. Kluwer
Academic, 2003.
[35] B. Vermeulen et al., Bringing Communication Networks on a
Chip: Test and Verification Implications, IEEE Comm. Magazine,
pp. 74-81, Sept. 2003.
[36] W.J. Dally, Virtual-Channel Flow Control, IEEE Trans. Parallel
and Distributed Systems, vol. 3, no. 2, pp. 194-205, Mar. 1992.
Partha Pratim Pande is completing his PhD
degree studies in VLSI design at the Department
of Electrical and Computer Engineering, Uni-
versity of British Columbia, Canada. His PhD
thesis evolved around the topic Network on
Chip. Before this, he received the MSc degree
in computer science from the National University
of Singapore in 2001 and the bachelors degree
in electronics and communication engineering
from Calcutta University, India, in 1997. After
receiving the bachelors degree, he worked in industry for a couple of
years as a digital system design engineer. At the end of 1999, he
returned to academia to pursue higher studies. He has received several
national scholarships from the government of India for academic
excellence. In addition to this, he received the International Graduate
student scholarship from the National University of Singapore. He is a
student member of the IEEE.
Cristian Grecu received the BS and MEng
degrees in electrical engineering from the
Technical University of Iasi, Romania, and the
MASc degree from the University of British
Columbia. He is a doctoral student in the
Department of Electrical and Computer Engi-
neering, University of British Columbia, Canada.
His research interests focus on design and test
of large SoCs, with particular emphasis on their
data communication infrastructures.
Michael Jones received the BSc degree in
computer engineering from Queens University,
Kingston, Ontario, Canada, in 2002 and the
MASc degree from the University of British
Columbia, Vancouver, Canada, in 2005. Since
2005, he has moved on to work as a software
engineer in the high tech industry.
Andre Ivanov received the BEng (Hon.), MEng,
and PhD degrees in electrical engineering from
McGill University. He is a professor in the
Department of Electrical and Computer Engi-
neering at the University of British Columbia
(UBC). He spent a sabbatical leave at PMC-
Sierra, Vancouver, British Columbia. He has
held invited professor positions at the University
of Montpellier II, the University of Bordeaux I,
and Edith Cowan University in Perth, Australia.
His primary research interests lie in the area of integrated circuit testing,
design for testability, and built-in self-test for digital, analog, and mixed-
signal circuits and systems-on-a-chip (SoCs). He has published widely
in these areas and holds several patents in IC design and test. Besides
testing, he has interests in the design and design methodologies of large
and complex integrated circuits and SoCs. He has served and continues
to serve on numerous national and international steering, program, and/
or organization committees in various capacities. Recently, he was the
program chair of the 2002 VLSI Test Symposium (VTS 02) and the
general chair for VTS 03 and VTS 04. In 2001, he cofounded Vector 12,
a semiconductor IP company. He has published more than 100 papers
in conference and journals and holds four US patents. He serves on the
editorial board of IEEE Design and Test and Kluwers Journal of
Electronic Testing: Theory and Applications. He is currently the chair of
the IEEE Computer Societys Test Technology Technical Council
(TTTC). He is a Golden Core Member of the IEEE Computer Society,
a senior member of the IEEE, a fellow of the British Columbia Advanced
Systems Institute, and a Professional Engineer of British Columbia.
Resve Saleh received the PhD and MS degrees
in electrical engineering from the University of
California, Berkeley, and the BS degree in
electrical engineering from Carleton University,
Ottawa, Canada. He is currently a professor and
the NSERC/PMC-Sierra Chairholder in the
Department of Electrical and Computer Engi-
neering at the University of British Columbia,
Vancouver, Canada, working in the field of
system-on-chip design, verification, and test.
He received the Presidential Young Investigator Award in 1990 from
the US National Science Foundation. He has published more than 50
journal articles and conference papers. He is a senior member of the
IEEE and served as general chair (1995), conference chair (1994), and
technical program chair (1993) for the Custom Integrated Circuits
Conference. He recently held the positions of technical program chair,
conference chair, and vice-general chair of the International Symposium
on Quality in Electronic Design (2001) and has served as an associate
editor of the IEEE Transactions on Computer-Aided Design. He recently
coauthored a book entitled Design and Analsys of Ditigal Integrated
Circuit Design: In Deep Submicron Technology. He was a founder and
chairman of Simplex Solutions, which developed CAD software for deep
submicron digital design verification. Prior to starting Simplex, he spent
nine years as a professor in the Department of Electrical and Computer
Engineering at the University of Illinois, Urbana, and one year teaching
at Stanford University. Before embarking on his academic career, he
worked for Mitel Corporation in Ottawa, Canada, Toshiba Corporation in
Japan, Tektronix in Beaverton, Oregon, and Nortel in Ottawa, Canada.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
1040 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 8, AUGUST 2005