Experiences Setting Up A Rocks Based Linux Cluster: Tomas Lind en Helsinki Institute of Physics CMS Programme 08.12.2003

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

HIP Software and Physics project

Experiences setting up a Rocks based Linux cluster


Tomas Lindén
Helsinki Institute of Physics
CMS programme
08.12.2003

Physicists network December meeting


CSC, Espoo, Finland

08.12.2003 1 Tomas Lindén


HIP Software and Physics project

Contents
1. Introduction
2. Linux cluster mill hardware
3. Disk performance results
4. NPACI Rocks Cluster Distribution
5. Conclusions
6. Acknowledgements

08.12.2003 2 Tomas Lindén


HIP Software and Physics project

1. Introduction
Commodity off the shelf hardware (COTS) gives an excellent price /
performance ratio for High Throughput Computing (HTC). Top500 class
computers can be built using COTS. There is no public x86 Linux HTC Linux
cluster at CSC nor in Kumpula. To meet the demand for large affordable
CPU resources a x86 Linux project was started in Kumpula with:
• Hardware optimized for Monte Carlo simulations in Computational
Material Science (MD-simulations) and High Energy Physics (CMS
Monte Carlo production runs for CMS Data Challenges).
• The applications are mostly of the pleasantly parallell type.
• Network bandwidth and latency for message passing is therefore not
an issue, so only a 100 Mb/s network is required between the nodes.
• One subphase of the simulations is I/O bound and needs good random
read performance.
The budget for the system was ≈3 50 kEUR (without VAT).
08.12.2003 Tomas Lindén
HIP Software and Physics project

Main design goals • 1U or 2U 19” rack case


• Maximize CPU power + more compact than minitower
• Maximize random disk reading - more expensive than minitower
bandwidth • home made ”ATX blade server”
• Minimize costs + more compact than minitower
Casing comparison + least expensive if assembly costs
• minitower are neglected
+ standard commodity solution - labour intensive
+ inexpensive - cooling problems can be an issue
- big space requirement
• blade server
+ most compact
- most expensive

08.12.2003 4 Tomas Lindén


HIP Software and Physics project

A 32 node OpenMosix AMD Athlon XP ATX blade server at the Institute of Biotechnology at
the University of Helsinki.

08.12.2003 5 Tomas Lindén


HIP Software and Physics project

Heat dissipation is obviously a major problem with any sizeable cluster.


Other factors affecting the choice of casing are space and cost issues.
The idea of building a ”ATX-blade server” was very attractive to us in terms
of needed space and costs, but we were somewhat discouraged by the heat
problems with the previously shown cluster (the heat problem was
subsequently solved with more effective CPU coolers).
It was also felt that one would have more problems with warranties with a
completely home built system.
Also the mechanical design of a ATX blade server would take some extra
effort compared to a more standard solution.
Because of space limitations a mini tower solution was not possible so we
chose a rack based solution with 2U cases for the nodes and a 4U case for
the frontend for the Linux cluster mill.

08.12.2003 6 Tomas Lindén


HIP Software and Physics project

1U cases are very compact, but the only possibilty to add expansion cards is
trough a PCI-riser card, which can be problematic, so when using a 1U case
one really wants to have a motherboard with as many integrated
components as possible. The cooling of a 1U case needs to be designed very
carefully.
The advantage of a 2U case is that one can use half height or low-profile
PCI-cards without using any PCI-riser card (Intel Gb/s NICs are available in
half height PCI size). Heat dissipation problems are probably less likely to
occur in a 2U case because of the additional space available for airflow.
The advantage of 4U cases is that standard PCI-expansion cards can be used.

08.12.2003 7 Tomas Lindén


HIP Software and Physics project

The motherboard requirements were:


• Dual CPU support
– minimizes the number of boxes to save space and maintenance effort
• At least dual ATA 100 controllers
• Integrated fast ethernet or Gb/s ethernet
• Serial console support or integrated display adapter
• Support for bootable USB-devices (CDROM, FDD).
– Each worker node does not need a CDROM- nor a FD drive if the
motherboard BIOS supports booting from a corresponding USB
device and the network card and the cluster management software
supports PXE booting. Worker nodes can then be booted for
maintenance or BIOS upgrades from USB devices that are plugged
into the node only when needed.
• Support for PXE booting
08.12.2003 8 Tomas Lindén
HIP Software and Physics project

2. Linux cluster mill hardware


PUBLIC LAN

mill (frontend) eth1 100 Mb/s

eth0 1 Gb/s
(Cu) eth0 1 Gb/s
(Cu)
silo (1.4 TB fileserver)

eth1 1 Gb/s
PRIVATE LAN (fiber)

1000/100 Mb/s switch

console

compute-0-0 (node-0) eth0 100 Mb/s compute-1-17 (node-31) eth0 100 Mb/s

...

A schematic view of the mill cluster network connections.


08.12.2003 9 Tomas Lindén
HIP Software and Physics project

• CPU: 2 * AMD 2.133 GHz Athlon MP The 32+1 node dual AMD Athlon
• MB: Tyan Tiger MPX S2466N-4M MP Rocks 2U rack cluster mill.

• Memory: 1 GB ECC registered DDR


• IDE disks: 2 * 80 GB 7200 rpm Hitachi
180 GXP DeskStar
• NIC: 3Com 3C920C 100 Mb/s
• Case: Supermicro SC822I-300LP 2U
• Power: Ablecom SP302-2C 300W
• FDD: integrated in the case
• Price: ≈1.4 kEUR/node with 0% VAT

One USB 2.0 - IDE case with a 52x IDE


CDROM can be connected to the USB 1.1
ports of any node.
The 1000/100 Mb/s network switches ex-
isted already previously.

08.12.2003 10 Tomas Lindén


HIP Software and Physics project

Frontend hardware: Node cables:


• CPU: 2 * AMD 1.667 GHz Athlon MP • Power supply
• MB: Tyan Tiger MPX S2466N-4M • Ethernet
• Memory: 1 GB ECC registered DDR • Serial console CAT-5
• IDE disks: 2 * 60 GB 7200 rpm IBM Console monitoring:
180 GXP DeskStar • Digi EtherLite 32 with 32 RS232 ports
• NIC: 3Com 3C996-TX Gb/s • 2 port D-Link DKVM-2 switch
• NIC: 3Com 3C920C 100 Mb/s Recycled hardware:
• Graphics: ATI Rage Pro Turbo 8MB • rack shelves
AGP
• serial console control computer
• Case: Compucase S466A 4U
• 17” display, keyboard and mouse
• Power supply: HEC300LR-PT
In total:
• CDROM: LG 48X/16X/48X CD-RW
• CPUs: 64+2
• FDD: yes
• Memory 30 GB
Racks:
• disk: 5 TB (2,5 TB RAID1)
• 2 Rittal TS609/19 42U 60x90x200cm

08.12.2003 11 Tomas Lindén


HIP Software and Physics project

Cooling
• The cases have large air exhaust holes near the CPUs.
• Rounded IDE cables on the nodes.
• The frontend draws 166 W (idle) — 200 W (full load).
• Node CPU temperatures are ≤ 40 ◦ C, when idle.

The interior of one of the nodes.


08.12.2003 12 Tomas Lindén
HIP Software and Physics project

About the chosen hardware


• PXE booting works mostly fine.
• USB FD and USB CDROM booting works OK (USB 1.1 is a bit slow).
• The Digi EtherLite 32 has worked fine after the correct DB9-RJ-45
adapter wiring was found.
• The ATX power cable extensions have created problems on a few nodes.
• Booting from a USB memory stick does not work well, because
sometimes the nodes hang when inserting the flash memory. Booting
several nodes from a USB Storage device is unpractical, because the BIOS
boot order list has to be edited every time the memory stick is inserted,
unlike the case for CDROM- and FD drives.
• The remote serial console is a bit tricky to setup. Unfortunately the
slow BIOS memory test cannot be interrupted from a serial console.
• The Rittal 1U power outlets have a fragile plastic mounting plate.
08.12.2003 13 Tomas Lindén
HIP Software and Physics project

Node console
BIOS settings can be replicated with /dev/nvram and many error conditions are found
in the Linux log files, but to see BIOS POST messages or console error messages some kind
of console is needed [1]. Some popular alternatives are:

• keyboard and display is attached only when needed

• Keyboard Video Mouse (KVM) switch

• serial switch

• remote management card

cabling LAN all consoles graphics speed special mouse


access at once card keys
Serial + + + not required - - -
KVM - - - required + + +

08.12.2003 14 Tomas Lindén


HIP Software and Physics project

Useful special Red Hat Linux kernel options:


• apm=power off This is useful on SMP-machines.
• display=IP:0 Allows kernel Oops and Panics to be viewed remotely (this
has not been tested on mill).
Compilers to install:
• C/C++ and F77/F90 compilers by Intel and Portland Group
Grid software to install:
• NorduGrid software
• Large Hadron Collider Computing Grid (LCG) software
Application software to install:
• Molecular dynamics simulation package PARCAS.
• Compact Muon Solenoid standard- and production software.
08.12.2003 15 Tomas Lindén
HIP Software and Physics project

3. Disk performance
Equipping each node with two disks gives flexibility in the disk space
configuration. I/O performance can be maximized using software RAID 1
and storage space can be maximized using individual disks or software
RAID 0 (with or without LVM). In the following performance for single disks
and software RAID 1 are presented.
The seek-test by Tony Wildish is a benchmark tool to study sequential and
random disk reading speed as a function of the number of processes [2].
• Each filesystem was filled so the disk speed variation of different disk area
regions were averaged over.
• The files were read sequentially (1-1) and randomly (5-10) (the test can
randomly skip a random number of blocks within a interval of minimum
and maximum number of blocks).
• All of the RAM available was used in these tests and each point reads
data for 600 s.
08.12.2003 16 Tomas Lindén
HIP Software and Physics project

Disk read throughput


60000
"./gplot.compute-0-0_jbod_seek-test-1-1_20030812" using 1:2:3

kB/s
"./gplot.seek-test-1-1_jbod_120gxpMB" using 1:2:3
"./gplot.seek-test-1-1_jbod_120gxp" using 1:2:3

50000

40000

30000

20000

10000

0
0 10 20 30 40 50 (# procs) 60

Single disk sequential read performance Tiger MPX MB IDE-controller 180


GXP disk, Tyan MP MB IDE-controller 120 GXP disk and 3ware 7850 IDE-
controller 120 GXP disk on a Tiger MP.

08.12.2003 17 Tomas Lindén


HIP Software and Physics project

Disk read throughput


80000
"./gplot.seek-test-1-1_hipswap_sraid1" using 1:2:3

kB/s
"./gplot.sraid1_newBIOS_crh731_64k-seek-test-1-1" using 1:2:3
70000

60000

50000

40000

30000

20000

10000

0
0 10 20 30 40 50 (# procs) 60

Software RAID 1 sequential read performance Tiger MPX MB IDE-controller


180 GXP disk and 3ware 7850 IDE-controller on a Tiger MP 120 GXP disk.

08.12.2003 18 Tomas Lindén


HIP Software and Physics project

Disk read throughput


20000
"./gplot.seek-test-5-10_hipswap_sraid1" using 1:2:3

kB/s
"./gplot.sraid1_newBIOS_crh731_64k-seek-test-5-10" using 1:2:3
"./gplot.compute-0-0_jbod_seek-test-5-10_20030812" using 1:2:3

15000

10000

5000

0
0 10 20 30 40 50 (# procs) 60

Software RAID 1 random read performance Tiger MPX MB IDE-controller 180


GXP disk and 3ware 7850 IDE-controller on a Tiger MP 120 GXP disk. Single
180 GXP disk random read performance on a Tiger MPX MB IDE-controller.

08.12.2003 19 Tomas Lindén


HIP Software and Physics project

4. NPACI Rocks Cluster Distribution


Cluster software requirements
• Cluster software should reduce the task of maintaining N+1 computers to
maintaining only 1+1 computers and provide good tools for cluster
installation, configuration, monitoring and maintenance.
• Manage software components, not the bits on the disk. This is the only
way to deal with heterogeneous hardware
– System imaging relies on homogeneity (bit blasting)
– Homogeneous clusters exists a very short time
• Cluster software should work with CERN Red Hat Linux
• No cluster software with a modified Linux kernel (OpenMosix, Scyld)
NPACI Rocks was chosen as the cluster software because of a favorable review
and positive experiences within CMS [3].
08.12.2003 20 Tomas Lindén
HIP Software and Physics project

Some other cluster tools are


• OSCAR (Open Source Cluster Application Resources)
– Image based, more complex to set up than Rocks, but better documentation.
– http://oscar.sourceforge.net/
• LCFG (Local ConFiGuration system)
– Powerful but difficult to use. Used and developed by EDG/EGEE (LCG).
– http://www.lcfg.org/
Some widely used automated installation tools are:
• SystemImager
– Assumes homogeneous hardware.
– http://www.systemimager.org/
• LUI (Linux Utility for cluster Installation)
– Created by IBM, no development since 2001?
– http://oss.software.ibm.com/developerworks/projects/lui/
• FAI (Fully Automatic Installation)
– Only for Debian Linux
– http://www.informatik.uni-koeln.de/fai/
08.12.2003 21 Tomas Lindén
HIP Software and Physics project

NPACI Rocks Cluster Distribution is a RPM based cluster management


software for scientific computation based on Red Hat Linux [4]. Both the
latest version 3.0.0 and the previous one 2.3.2. are based on Red Hat 7.3.
There are at least four Rocks based clusters on the November 2003 Top500
list (# 26, # 176, # 201 and # 408). There are some 140 registered Rocks
clusters with more than 8000 CPUs.
All nodes are considered to have soft state and any upgrade, installation or
configuration change is done by node reinstallation, which takes about 6–8
min for a node and about 20 min for the whole mill cluster.
The default configuration is to reinstall a node also after each power down.
Settings like this can be easily changed according to taste.
Rocks makes it possible for nonexperts to setup a Linux cluster for scientific
computation in a short amount of time.

08.12.2003 22 Tomas Lindén


HIP Software and Physics project

Features of Rocks
• Supported architectures IA32, IA64. IA32-64 soon (no Alpha, no SPARC)
• Networks: Ethernet, Myrinet (no SCALI)
• Requires nodes with local disk
• Requires nodes able to boot without a keyboard (headless support)
• Relies heavily on Kickstart and Anaconda (works only with Red Hat)
• Can handle a cluster with heterogenous hardware
• XML configuration scripts
• Non RPM packages can be handled with XML post configuration scripts
• eKV (ethernet, Keyboard and Video) a telnet based tool for remote
installation monitoring
• Supports PXE booting and installation
• Installation is very easy on well behaved hardware
08.12.2003 23 Tomas Lindén
HIP Software and Physics project

• Services and libraries out of the box


– Ganglia (monitoring with nice graphical interface)
– SNMP (text mode monitoring information)
– PBS (batch queue system)
– Maui (scheduler)
– Sun Grid Engine (alternative to PBS)
– MPICH (parallell libraries)
– DHCP (node ip-addresses)
– NIS (user management) 411 SIS is beta in v. 3.0.0
– NFS (global disk space)
– MySQL (cluster internal configuration bookkeeping)
– HTTP (cluster installation)
– PVFS (distributed cluster file system kernel support)
08.12.2003 24 Tomas Lindén
HIP Software and Physics project

The most important Rocks commands


• insert-ethers Insert/remove a node to/from the MySQL cluster database.
• rocks-dist mirror Build or update a Rocks mirror.
• rocks-dist dist Build the RPM distribution for compute nodes.
• shoot-node Reinstall a compute node.
• cluster-fork Run any command serially on the cluster.
cluster-fork /boot/kickstart/cluster-kickstart Reinstalls the cluster.
• cluster-ps Get a cluster wide process list.
• cluster-kill Kill processes running on the cluster.
Test if the XML/kickstart infrastructure returns a OK kickstart file:
cd /export/home/install ; ./kickstart.cgi –client=”compute-0-0”
character ” & ’ < >
XML &quot; &amp; &apos; &lt; &gt;
08.12.2003 25 Tomas Lindén
HIP Software and Physics project

Documentation
Tutorial: The Rocks CDROM contains the slides of a good tutorial talk [5].
User Guide: The Rocks manual covers the minimum to get one started.
Reference Guide: Some basic configuration changes are covered in the manual,
but more advanced issues need usually to be resolved with the help of the
very active Rocks mailing list which has also a web archive.
Mininimum hardware to try out Rocks
To try out Rocks on a minimum 1+1 cluster only two computers, three NICs
and one cross connect ethernet cable is needed.
Future of Rocks
The next version of NPACI Rocks will be based on Red Hat Enterprise Linux
compiled from the source RPMs.
The project has financing for at least three years from now.
08.12.2003 26 Tomas Lindén
HIP Software and Physics project

5. Conclusions
• The hardware on mill works well, application software installation has
started. Also the NorduGrid installation has begun.
• The Tyan MPX BIOS has still room for improvements.
• Software RAID 1 gives a good sequential and random (2 processes) disk
reading performance.
• The 3ware IDE-controller driver or the Linux SCSI driver has some room
for improvement compared to the standard IDE-driver.
• The NPACI Rocks cluster distribution has shown to be a powerful tool
enabling nonexperts to set up a Linux cluster. But the Rocks
documentation level is not quite up to the software quality. This is partly
compensated by the active Rocks user mailing list.
• NPACI Rocks cluster distribution is an interesting option worth
considering for the Material science grid project.
08.12.2003 27 Tomas Lindén
HIP Software and Physics project

6. Acknowledgements
The Institute of Physical Sciences has financed the mill cluster nodes.
The Kumpula Campus Computational Unit hosts the mill cluster in
their machine room and has provided the needed network switches.
N. Jiganova has helped with the software and hardware of the cluster.
P. Lähteenmäki has been very helpful in clarifying network issues and
setting up the network for the cluster.
Damicon Kraa the vendor of the nodes has given very good service.

08.12.2003 28 Tomas Lindén


HIP Software and Physics project

References
[1] Remote Serial Console HOWTO, http://www.dc.turkuamk.fi/LDP/
HOWTO/Remote-Serial-Console-HOWTO/index.html.
[2] Seek-test, by T. Wildish http://wildish.home.cern.ch/wildish/
Benchmark-results/Performance.html.
[3] Analysis and Evaluation of Open Source Solutions for the Installation
and Management of Clusters of PCs under Linux, R. Leiva
http://heppc11.ft.uam.es/galera/doc/ATL-SOFT-2003-001.pdf.
[4] Rocks homepage, http://rocks.sdsc.edu/Rock.
[5] NPACI All Hands Meeting, Rocks v2.3.2 Tutorial Session, March 2003,
http://rocks.sdsc.edu/rocks-documentation/3.0.0/talks/
npaci-ahm-2003.pdf

08.12.2003 29 Tomas Lindén

You might also like