10Gb Ethernet
10Gb Ethernet
10Gb Ethernet
Abstract 1. Introduction
This paper presents a case study of the 10-Gigabit Ether- Thirty years ago in a May 1973 memo, Robert Metcalfe
net (10GbE) adapter from Intel R . Specifically, with appropri- described the technology that would evolve into today’s ubiq-
ate optimizations to the configurations of the 10GbE adapter uitous Ethernet protocol. By 1974, Metcalfe and his col-
and TCP, we demonstrate that the 10GbE adapter can perform league, David Boggs, built their first Ethernet; and by 1975,
well in local-area, storage-area, system-area, and wide-area they demonstrated what was at the time a dazzling 2.94 Mb/s
networks. of throughput over the 10-Mb/s Ethernet medium. Since that
For local-area, storage-area, and system-area networks in time, Ethernet has proliferated and evolved tremendously and
support of networks of workstations, network-attached stor- has done so in virtual lockstep with the ubiquitous TCP/IP
age, and clusters, respectively, we can achieve over 7-Gb/s (Transmission Control Protocol / Internet Protocol) protocol
end-to-end throughput and 12- s end-to-end latency between suite which was started at Stanford University in the summer
applications running on Linux-based PCs. For the wide-area of 1973. Today’s Ethernet carries 99.99% of Internet pack-
network in support of grids, we broke the recently-set Inter- ets and bears little resemblance to the original Ethernet [11].
net2 Land Speed Record by 2.5 times by sustaining an end- About the only aspect of the original Ethernet that still re-
to-end TCP/IP throughput of 2.38 Gb/s between Sunnyvale, mains is its packet format.
California and Geneva, Switzerland (i.e., 10,037 kilometers) So, even though the recently ratified 10-Gigabit Ether-
to move over a terabyte of data in less than an hour. Thus, net (10GbE) standard differs from earlier Ethernet standards,
the above results indicate that 10GbE may be a cost-effective mainly with respect to operating only over fiber and only in
solution across a multitude of computing environments. full-duplex mode, it still remains Ethernet, and more impor-
tantly, does not obsolete current investments in network in-
frastructure. Furthermore, the 10GbE standard ensures inter-
operability not only with respect to existing Ethernet but also
This work was supported by the US DOE Office of Science through other networking technologies such as SONET (i.e., Ethernet
LANL contract W-7405-ENG-36 Caltech contract DE-FG03-92-ER40701, over SONET), thus paving the way for Ethernet’s expanded
and SLAC contract DE-AC03-76SF00515. Additional support was provided
by NSF through grant ANI-0230967, AFOSR through grant F49620-03-1- use in metropolitan-area networks (MANs) and wide-area net-
0119, and ARO through grant DAAD19-02-1-0283. works (WANs). Finally, while 10GbE is arguably intended
This paper is also available as the following LANL technical report: to ease migration to higher aggregate performance levels in
LA-UR 03-5728, July 2003.
institutional network-backbone infrastructures, the results in
SC’03, November 15-21, 2003, Phoenix, Arizona, USA this paper will demonstrate 10GbE’s versatility in a myriad of
Copyright 2003 ACM 1-58113-695-1/1/03/0011...$5.00 computing environments.
P 8B/10B
C D M PCS
I M A 3.125Gbps
− SerDes
X A C
I/F XGM II
4x3.125Gbps
R X P P RX 10.3Gbps IN
Intel G C M Optics
XAUI
PCI−X Bus 82597EX X S A TX 10.3Gbps OUT
(8.5Gbps) S Optics
4x3.125Gbps
Intel R 1310nm Serial Optics
512K flash Intel R PRO/10GbE−LR
The remainder of the paper is organized as follows: Sec- data and descriptor transfers between the host memory and on-
tion 2 briefly describes the architecture of the Intel 10GbE chip memory while the latter provides a complete glueless in-
adapter. Section 3 presents the local-area network (LAN) terface to a 33/66-MHz, 32/64-bit PCI bus or a 33/66/100/133-
and system-area network (SAN) testing environments, exper- MHz, 32/64-bit PCI-X bus.
iments, and results and analysis, and Section 4 does the same As is already common practice with high-performance
for the wide-area network (WAN). Finally, we summarize and adapters such as Myricom’s Myrinet [2] and Quadrics’ Qs-
conclude in Section 5. Net [17], the 10GbE adapter frees up host-CPU cycles by per-
forming certain tasks (in silicon) on behalf of the host CPU.
2. Architecture of a 10GbE Adapter In contrast to the Myrinet and QsNet adapters, however, the
10GbE adapter focuses on host off-loading of certain TCP/IP
The recent arrival of the Intel R PRO/10GbE LRTM server tasks1 rather than on remote direct-memory access (RDMA)
adapter paves the way for 10GbE to become an all- and source routing. As a result, unlike Myrinet and Qs-
encompassing technology from LANs and SANs to MANs Net, the 10GbE adapter provides a general-purpose, TCP/IP-
and WANs. This first-generation 10GbE adapter consists of based solution to applications, a solution that does not require
three major components: Intel 82597EXTM 10GbE controller, any modification to application codes to achieve high perfor-
512-KB of flash memory, and Intel 1310-nm serial optics, as mance, e.g., as high as 7 Gb/s between end-host applications
shown in Figure 1. with an end-to-end latency as low as 12 s.
The 10GbE controller provides an Ethernet interface that As we will see later, achieving higher throughput will re-
delivers high performance by providing direct access to all quire either efficient offloading of network tasks from soft-
memory without using mapping registers, minimizing pro- ware to hardware (e.g., IETF’s RDMA-over-IP effort, known
grammed I/O (PIO) read access required to manage the de- as RDDP or remote direct data placement [19]) and/or sig-
vice, minimizing interrupts required to manage the device, nificantly faster machines with large memory bandwidth.
and off-loading the host CPU of simple tasks such as TCP Achieving substantially higher throughput, e.g., approaching
checksum calculations. Its implementation is in a single 10 Gb/s, will not be possible until the PCI-X hardware bottle-
chip and contains both the medium-access control (MAC) and neck in a PC is addressed. Currently, the peak bandwidth of a
physical (PHY) layer functions, as shown at the top of Fig- 133-MHz, 64-bit PCI-X bus in a PC is 8.5 Gb/s (see left-hand
ure 1. The PHY layer, to the right of the MAC layer in Fig- side of Figure 1), which is less than half the 20.6-Gb/s bidirec-
ure 1, consists of an 8B/10B physical coding sublayer and a tional data rate (see right-hand side of Figure 1) that the Intel
10-gigabit media independent interface (XGM II). To the left 10GbE adapter can support.
of the MAC layer is a direct-memory access (DMA) engine
and the “peripheral component interconnect extended” inter-
face (PCI-X I/F). The former handles the transmit and receive 1 Specifically, TCP & IP checksums and TCP segmentation.
3. LAN/SAN Tests PE2650 10GbE PE2650
where
Mbit/sec
2500
2000
Stock TCP + Increased PCI-X Burst Size + Uniprocessor
At the present time, the P4 Xeon SMP architecture assigns 1500
Seconds
latency when the machines are connected back-to-back and 1.8e-05
25- s end-to-end latency when going through the Foundry Fa-
1.6e-05
stIron 1500 switch. As the payload size increases from one
byte to 1024 bytes, latencies increase linearly in a stepwise 1.4e-05
fashion, as shown in Figure 6. Over the entire range of pay-
loads, the end-to-end latency increases a total of 20% such that 1.2e-05
0 128 256 384 512 640 768 896 1024
the back-to-back latency is 23 s and the end-to-end latency Payload Size (bytes)
through the switch is 28 s.
To reduce these latency numbers even further, particularly
for latency-sensitive environments, we trivially shave off an Figure 6. End-to-End Latency in Test Configu-
additional 5 s (i.e., down to 14- s end-to-end latency) by ration
simply turning off a feature called interrupt coalescing (Fig-
ure 7). In our bandwidth tests and the above latency tests in 2.4e-05
Figure 6, we had configured the 10GbE adapters with a 5- s
interrupt delay. This delay is the period that the 10GbE card 2.2e-05
window normally opens at a constant rate of one segment per Second, given that the bottleneck bandwidth is the transat-
round-trip time; but each time a congestion signal is received lantic LHCnet OC-48 POS at 2.5 Gb/s, achieving 2.38 Gb/s
(i.e., packet loss), the congestion window is halved. means that the connection operated at roughly 99% payload
However, as the bandwidth and latency increase, and efficiency. Third, the end-to-end WAN throughput is actu-
hence, increasing the bandwidth-delay product, the effect of a ally larger than what an application user typically sees in a
single packet loss is disastrous in these long fat networks with LAN/SAN environment. Fourth, our results smashed both the
gargantuan bandwidth-delay products. For example, in future single- and multi-stream Internet2 Land Speed Records by 2.5
scenarios, e.g., 10-Gb/s connection from end-to-end between times.
Geneva and Sunnyvale, Table 1 shows how long it takes to
recover from a packet loss and eventually return to the origi- 5. Conclusion
nal transmission rate (prior to the packet loss), assuming that
the congestion window size is equal to the bandwidth-delay With the current generation of SAN interconnects such as
product when the packet is lost. Myrinet and QsNet being theoretically hardware-capped at 2
To avoid this problem, one simply needs to reduce the Gb/s and 3.2 Gb/s, respectively, achieving over 4 Gb/s of end-
packet-loss rate. But how? In our environment, packet loss to-end throughput with 10GbE makes it a viable commod-
is due exclusively to congestion in the network, i.e., packets ity interconnect for SANs in addition to LANs. However,
are dropped when the number of unacknowledged packets ex- its Achilles’ heel is its 12- s (best-case) end-to-end latency,
ceeds the available capacity of the network. In order to reduce which is 1.7 times slower than Myrinet/GM (but over two
the packet-loss rate, we must “stop” the increase of the con- times faster than Myrinet/IP) and 2.4 times slower than Qs-
gestion window before it reaches a congested state. Because Net/Elan3 (but over two times faster than QsNet/IP). These
explicit control of the congestion-control window is not possi- performance differences can be attributed mainly to the host
ble, we turn to the flow-control window (TCP buffer sizing) to software.
implicitly cap the congestion-window size to the bandwidth- In recent tests on the dual 2.66-GHz CPUs with 533-
delay product of the wide-area network so that the network MHz FSB Intel E7505-based systems running Linux, we have
approaches congestion but avoids it altogether. achieved 4.64 Gb/s throughput “out of the box.” The great-
As a result, using only a single TCP/IP stream between est difference between these systems and the PE2650s is the
Sunnyvale and Geneva, we achieved an end-to-end through- FSB, which indicates that the CPU’s ability to move — but
put of 2.38 Gb/s over a distance of 10,037 kilometers. This not process — data, might be an important bottleneck. These
translates into moving a terabyte of data in less than one hour. tests have not yet been fully analyzed.
Why is this result so remarkable? First, it is well-known To continue this work, we are currently instrumenting the
that TCP end-to-end throughput is inversely proportional to Linux TCP stack with MAGNET to perform per-packet profil-
round-trip time; that is, the longer the round-trip time (in this ing and tracing of the stack’s control path. MAGNET allows
case, 180 ms, or approximately 10,000 times larger than the us to profile arbitrary sections of the stack with CPU-clock
round-trip time in the LAN/SAN), the lower the throughput. accuracy, while 10GbE stresses the stack with previously im-
Path Bandwidth Assumption RTT (ms) MSS (bytes) Time to Recover
LAN 10 Gbps 1 1460 428 ms
Geneva - Chicago 10 Gb/s 120 1460 1 hr 42 min
Geneva - Chicago 10 Gb/s 120 8960 17 min
Geneva - Sunnyvale 10 Gb/s 180 1460 3 hr 51 min
Geneva - Sunnyvale 10 Gb/s 180 8960 38 min
possible loads. Analysis of this data is giving us an unprece- Doug Walsten from Cisco Systems for the 10GbE switch
dentedly high-resolution picture of the most expensive aspects support in Sunnyvale, and
of TCP processing overhead [4].
While a better understanding of current performance bot- Peter Kersten and John Szewc from Foundry Networks
tlenecks is essential, the authors’ past experience with Myrinet for providing Los Alamos National Laboratory with a
and Quadrics leads them to believe that an OS-bypass proto- 10GbE switch to run their LAN/SAN tests on.
col, like RDMA over IP, implemented over 10GbE would re- If any of the above infrastructure pieces would have fallen
sult in throughput approaching 8 Gb/s, end-to-end latencies through, no attempt at the Internet2 Land Speed Record would
below 10 s, and a CPU load approaching zero. However, be- have ever been made.
cause high-performance OS-bypass protocols require an on-
Last, but not least, we gratefully acknowledge the invalu-
board (programmable) network processor on the adapter, the able contributions and support of the following colleagues
10GbE adapter from Intel currently cannot support an OS- during our record-breaking effort on the Internet2 Land Speed
bypass protocol. Record: Julian Bunn and Suresh Singh of Caltech; Paolo
The availability of 10-Gigabit Ethernet provides a remark- Moroni and Daniel Davids of CERN; Edoardo Martelli of
able opportunity for network researchers in LANs, SANs,
CERN/DataTAG; Gary Buhrmaster of SLAC; and Eric Wei-
MANs, and even WANs in support of networks of worksta- gle and Adam Engelhart of Los Alamos National Labora-
tions, clusters, distributed clusters, and grids, respectively. tory. In addition, we wish to thank the Internet2 consortium
The unprecedented (commodity) performance offered by the
(http://www.internet2.edu) for creating a venue for “supernet-
Intel PRO/10GbE server adapter also enabled us to smash working” achievements.
the Internet2 Land Speed Record (http://lsr.internet2.edu) on
February 27, 2003, by sustaining 2.38 Gb/s across 10,037 km
between Sunnyvale, California and Geneva, Switzerland, i.e., References
23,888,060,000,000,000 meters-bits/sec.
[1] M. Allman, V. Paxson, and W. Stevens, “TCP Congestion Con-
trol,” RFC-2581, April 1999.
Acknowledgements [2] N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz,
J. Seizovic, and W. Su, “Myrinet: A Gigabit-Per-Second Local
First and foremost, we would like to thank the Intel team Area Network,” IEEE Micro, Vol. 15, No. 1, January/February
— Patrick Connor, Caroline Larson, Peter Molnar, and Marc 1995.
Rillema of the LAN Access Division — for their tremendous [3] D. Clark, “Window and Acknowledgment Strategy in TCP,”
support of our research efforts and Eric Weigle for his assis- RFC-813, July 1982.
tance throughout this project. [4] D. Clark, V. Jacobson, J. Romkey, and H. Salwen, “An Analy-
With respect to the wide-area network, none of the research sis of TCP Processing Overhead,” IEEE Communications, Vol.
27, No. 6, June, 1989, pp. 23-29.
(e.g., Internet2 Land Speed Record) would have been possi-
[5] “Communication Streaming Architecture: Reducing the PCI
ble without the generous support and contributions of many Network Bottleneck,” Intel Whitepaper 252451-002. Available
people and institutions. From an infrastructure standpoint, at: http://www.intel.com/design/network/papers/252451.htm.
we thank Linda Winkler and Tom DeFanti from Argonne Na- [6] M. K. Gardner, W. Feng, M. Broxton, A Engelhart, and
tional Laboratory and the University of Illinois of Chicago, G. Hurwitz, “MAGNET: A Tool for Debugging, Analysis
respectively, for the use of a Juniper T640 TeraGrid router at and Reflection in Computing Systems,” Proceedings of the
Starlight in Chicago. Equipment-wise, in addition to Intel, we 3rd IEEE/ACM International Symposium on Cluster Comput-
thank ing and the Grid (CCGrid’2003), May 2003.
[7] G. Hurwitz, and W. Feng, “Initial End-to-End Performance
Paul Fernes from Level(3) Communications for the OC- Evaluation of 10-Gigabit Ethernet,” Proceedings of Hot Inter-
192 link from Sunnyvale to Chicago, connects 11 (HotI’03), August 2003.
[8] “Iperf 1.6 - The TCP/UDP Bandwidth Measurement Tool,”
http://dast.nlanr.net/Projects/ Iperf/.
[9] M. Mathis, “Pushing Up the Internet MTU,” Presentation to
Miami Joint Techs, Feburary 2003.
[10] J. McCalpin, “STREAM: Sustainable Memory Bandwidth
in High-Performance Computers,” http://www.cs.virginia.edu/
stream/.
[11] R. Metcalfe and D. Boggs, “Ethernet: Distributed Packet
Switching for Local Computer Networks,” Communications
of the ACM, Vol. 19, No. 5, July 1976.
[12] “Myrinet Ethernet Emulation (TCP/IP & UDP/IP) Perfor-
mance,” http://www.myri.com/myrinet/performance/ ip.html.
[13] “Myrinet Performance Measurements,” http://www.myri.com/
myrinet/performance/index.html.
[14] “Netperf: Public Netperf Homepage,” http://www.netperf.org/.
[15] “NetPIPE,” http://www.scl.ameslab.gov/netpipe/.
[16] “NTTCP: New TTCP program,” http://www.leo.org/˜ el-
mar/nttcp/.
[17] F. Petrini, W. Feng, A. Hoisie, S. Coll, and E. Frachtenberg,
“The Quadrics Network: High-Performance Clustering Tech-
nology,” IEEE Micro, Vol. 22, No. 1, January/Feburary 2002.
[18] I. Philp and Y.-L. Liong, “The Scheduled Transfer (ST) Pro-
tocol,” 3rd International Workshop on Communication, Ar-
chitecture, and Applications for Network-Based Parallel Com-
puting (CANPC’99), Lecture Notes in Computer Science, Vol.
1602, January 1999.
[19] A. Romanow, and S. Bailey, “An Overview of RDMA over IP,”
Proceedings of the First International Workshop on Protocols
for Fast Long-Distance Networks (PFLDnet 2003), Feburary
2003.
[20] J. Stone, and C. Partridge, “When the CRC and TCP Check-
sum Disagree,” Proceedings of ACM SIGCOMM 2000, August
2000.
[21] “TCPDUMP Public Repository,” http://www.tcpdump.org.
[22] B. Tierney, “TCP Tuning Guide for Distributed Application
on Wide-Area Networks,” USENIX ;login:, Vol. 26, No. 1,
February 2001.
[23] S. Ubik, and P. Chimbal, “Achieving Reliable High Perfor-
mance in LFNs,” Proceedings of Trans-European-Research
and Education Networking Association Networking Confer-
ence (TERENA 2003), May 2003.
[24] W. Washington and C. Parkinson, An Introduction to Three-
Dimensional Climate Modeling, University Science Books,
1991.