Experiences Setting Up A Rocks Based Linux Cluster: Tomas Lind en Helsinki Institute of Physics CMS Programme 08.12.2003
Experiences Setting Up A Rocks Based Linux Cluster: Tomas Lind en Helsinki Institute of Physics CMS Programme 08.12.2003
Experiences Setting Up A Rocks Based Linux Cluster: Tomas Lind en Helsinki Institute of Physics CMS Programme 08.12.2003
Contents
1. Introduction
2. Linux cluster mill hardware
3. Disk performance results
4. NPACI Rocks Cluster Distribution
5. Conclusions
6. Acknowledgements
1. Introduction
Commodity off the shelf hardware (COTS) gives an excellent price /
performance ratio for High Throughput Computing (HTC). Top500 class
computers can be built using COTS. There is no public x86 Linux HTC Linux
cluster at CSC nor in Kumpula. To meet the demand for large affordable
CPU resources a x86 Linux project was started in Kumpula with:
• Hardware optimized for Monte Carlo simulations in Computational
Material Science (MD-simulations) and High Energy Physics (CMS
Monte Carlo production runs for CMS Data Challenges).
• The applications are mostly of the pleasantly parallell type.
• Network bandwidth and latency for message passing is therefore not
an issue, so only a 100 Mb/s network is required between the nodes.
• One subphase of the simulations is I/O bound and needs good random
read performance.
The budget for the system was ≈3 50 kEUR (without VAT).
08.12.2003 Tomas Lindén
HIP Software and Physics project
A 32 node OpenMosix AMD Athlon XP ATX blade server at the Institute of Biotechnology at
the University of Helsinki.
1U cases are very compact, but the only possibilty to add expansion cards is
trough a PCI-riser card, which can be problematic, so when using a 1U case
one really wants to have a motherboard with as many integrated
components as possible. The cooling of a 1U case needs to be designed very
carefully.
The advantage of a 2U case is that one can use half height or low-profile
PCI-cards without using any PCI-riser card (Intel Gb/s NICs are available in
half height PCI size). Heat dissipation problems are probably less likely to
occur in a 2U case because of the additional space available for airflow.
The advantage of 4U cases is that standard PCI-expansion cards can be used.
eth0 1 Gb/s
(Cu) eth0 1 Gb/s
(Cu)
silo (1.4 TB fileserver)
eth1 1 Gb/s
PRIVATE LAN (fiber)
console
compute-0-0 (node-0) eth0 100 Mb/s compute-1-17 (node-31) eth0 100 Mb/s
...
• CPU: 2 * AMD 2.133 GHz Athlon MP The 32+1 node dual AMD Athlon
• MB: Tyan Tiger MPX S2466N-4M MP Rocks 2U rack cluster mill.
Cooling
• The cases have large air exhaust holes near the CPUs.
• Rounded IDE cables on the nodes.
• The frontend draws 166 W (idle) — 200 W (full load).
• Node CPU temperatures are ≤ 40 ◦ C, when idle.
Node console
BIOS settings can be replicated with /dev/nvram and many error conditions are found
in the Linux log files, but to see BIOS POST messages or console error messages some kind
of console is needed [1]. Some popular alternatives are:
• serial switch
3. Disk performance
Equipping each node with two disks gives flexibility in the disk space
configuration. I/O performance can be maximized using software RAID 1
and storage space can be maximized using individual disks or software
RAID 0 (with or without LVM). In the following performance for single disks
and software RAID 1 are presented.
The seek-test by Tony Wildish is a benchmark tool to study sequential and
random disk reading speed as a function of the number of processes [2].
• Each filesystem was filled so the disk speed variation of different disk area
regions were averaged over.
• The files were read sequentially (1-1) and randomly (5-10) (the test can
randomly skip a random number of blocks within a interval of minimum
and maximum number of blocks).
• All of the RAM available was used in these tests and each point reads
data for 600 s.
08.12.2003 16 Tomas Lindén
HIP Software and Physics project
kB/s
"./gplot.seek-test-1-1_jbod_120gxpMB" using 1:2:3
"./gplot.seek-test-1-1_jbod_120gxp" using 1:2:3
50000
40000
30000
20000
10000
0
0 10 20 30 40 50 (# procs) 60
kB/s
"./gplot.sraid1_newBIOS_crh731_64k-seek-test-1-1" using 1:2:3
70000
60000
50000
40000
30000
20000
10000
0
0 10 20 30 40 50 (# procs) 60
kB/s
"./gplot.sraid1_newBIOS_crh731_64k-seek-test-5-10" using 1:2:3
"./gplot.compute-0-0_jbod_seek-test-5-10_20030812" using 1:2:3
15000
10000
5000
0
0 10 20 30 40 50 (# procs) 60
Features of Rocks
• Supported architectures IA32, IA64. IA32-64 soon (no Alpha, no SPARC)
• Networks: Ethernet, Myrinet (no SCALI)
• Requires nodes with local disk
• Requires nodes able to boot without a keyboard (headless support)
• Relies heavily on Kickstart and Anaconda (works only with Red Hat)
• Can handle a cluster with heterogenous hardware
• XML configuration scripts
• Non RPM packages can be handled with XML post configuration scripts
• eKV (ethernet, Keyboard and Video) a telnet based tool for remote
installation monitoring
• Supports PXE booting and installation
• Installation is very easy on well behaved hardware
08.12.2003 23 Tomas Lindén
HIP Software and Physics project
Documentation
Tutorial: The Rocks CDROM contains the slides of a good tutorial talk [5].
User Guide: The Rocks manual covers the minimum to get one started.
Reference Guide: Some basic configuration changes are covered in the manual,
but more advanced issues need usually to be resolved with the help of the
very active Rocks mailing list which has also a web archive.
Mininimum hardware to try out Rocks
To try out Rocks on a minimum 1+1 cluster only two computers, three NICs
and one cross connect ethernet cable is needed.
Future of Rocks
The next version of NPACI Rocks will be based on Red Hat Enterprise Linux
compiled from the source RPMs.
The project has financing for at least three years from now.
08.12.2003 26 Tomas Lindén
HIP Software and Physics project
5. Conclusions
• The hardware on mill works well, application software installation has
started. Also the NorduGrid installation has begun.
• The Tyan MPX BIOS has still room for improvements.
• Software RAID 1 gives a good sequential and random (2 processes) disk
reading performance.
• The 3ware IDE-controller driver or the Linux SCSI driver has some room
for improvement compared to the standard IDE-driver.
• The NPACI Rocks cluster distribution has shown to be a powerful tool
enabling nonexperts to set up a Linux cluster. But the Rocks
documentation level is not quite up to the software quality. This is partly
compensated by the active Rocks user mailing list.
• NPACI Rocks cluster distribution is an interesting option worth
considering for the Material science grid project.
08.12.2003 27 Tomas Lindén
HIP Software and Physics project
6. Acknowledgements
The Institute of Physical Sciences has financed the mill cluster nodes.
The Kumpula Campus Computational Unit hosts the mill cluster in
their machine room and has provided the needed network switches.
N. Jiganova has helped with the software and hardware of the cluster.
P. Lähteenmäki has been very helpful in clarifying network issues and
setting up the network for the cluster.
Damicon Kraa the vendor of the nodes has given very good service.
References
[1] Remote Serial Console HOWTO, http://www.dc.turkuamk.fi/LDP/
HOWTO/Remote-Serial-Console-HOWTO/index.html.
[2] Seek-test, by T. Wildish http://wildish.home.cern.ch/wildish/
Benchmark-results/Performance.html.
[3] Analysis and Evaluation of Open Source Solutions for the Installation
and Management of Clusters of PCs under Linux, R. Leiva
http://heppc11.ft.uam.es/galera/doc/ATL-SOFT-2003-001.pdf.
[4] Rocks homepage, http://rocks.sdsc.edu/Rock.
[5] NPACI All Hands Meeting, Rocks v2.3.2 Tutorial Session, March 2003,
http://rocks.sdsc.edu/rocks-documentation/3.0.0/talks/
npaci-ahm-2003.pdf