Blue Gene Super Computer PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

“Blue Gene-Super Computer”

1
INDEX

Abstract 1

1 Introduction 2

2 Overview of system blue gene/l supercomputer 3

2.1 Overall organization

2.2 Blue gene /L Communication Hardware

2.3 Hardware Technologies for petascale computing

2.4 Special purpose hardware

3 Blue gene system software 10

3.1 For i/o nodes

3.2 Compute nodes

3.3 System management software

3.4 Compiler and run time support

3.5 Communication infrastructure

4 Blue gene/l simulation environment 21

5 Trends in supercomputer 23

6 Programming methodologies for petascale computing 24

6.1 New algorithms for petascale computing

7 Validating the architecture with application programs 29

8 Blue Gene science applications development 31

9 Conclusion 33

2
Abstract

Blue Gene is a massively parallel computer being developed at the IBM


Thomas J. Watson Research Center. Blue Gene represents a hundred-fold
improvement on performance compared with the fastest supercomputers of
today. It will achieve 1 PetaFLOP/sec through unprecedented levels of
parallelism in excess of 4,0000,000 threads of execution. The Blue Gene
project has two important goals, in which understanding of biologically import
processes will be advanced, as well as advancement of knowledge of cellular
architectures (massively parallel system built of single chip cells that integrate
processors, memory and communication), and of the software needed to
exploit those effectively. This massively parallel system of 65,536 nodes is
based on a new architecture that exploits system-on-a-chip technology to
deliver target peak processing power of 360 teraFLOPS (trillion floating-point
operations per second). The machine is scheduled to be operational in the
2004-2005 time frame, at price/performance and power
consumption/performance targets unobtainable with conventional
architectures.

3
Chapter 1: INTRODUCTION

In November 2001 IBM announced a partnership with Lawrence Livermore


National Laboratory to build the Blue Gene/L (BG/L) supercomputer, a
65,536-node machine designed around embedded PowerPC processors.
Through the use of system-on-a-chip integration coupled with a highly
scalable cellular architecture, Blue Gene/L will deliver 180 or 360 Teraflops of
peak computing power, depending on the utilization mode. Blue Gene/L
represents a new level of scalability for parallel systems. Whereas existing
large scale systems range in size from hundreds to a few of compute nodes,
Blue Gene/L makes a jump of almost two orders of magnitude. Several
techniques have been proposed for building such a powerful machine. Some of
the designs call for extremely powerful (100 GFLOPS) processors based on
superconducting technology. The class of designs that we focus on use current
and foreseeable CMOS technology. It is reasonably clear that such machines,
in the near future at least, will require a departure from the architectures of the
current parallel supercomputers, which use few thousand commodity
microprocessors. With the current technology, it would take around a million
microprocessors to achieve a petaFLOPS performance. Clearly, power
requirements and cost considerations alone preclude this option. The class of
machines of interest to us use a ―processorsin- memory‖ design: the basic
building block is a single chip that includes multiple processors as well as
memory and interconnection routing logic. On such machines, the ratio of
memory-to-processors will be substantially lower than the prevalent one. As
the technology is assumed to be the current generation one, the number of
processors will still have to be close to a million, but the number of chips will
be much lower. Using such a design, petaFLOPS performance will be reached

4
within the next 2-3 years, especially since IBM hasannounced the Blue Gene
project aimed at building such a machine. The system software for Blue
Gene/L is a combination of standard and custom solutions. The software
architecture for the machine is divided into three functional Entities arranged
hierarchically: a computational core, a control infrastructure and a service
infrastructure. The I/O nodes (part of the control infrastructure) execute a
version of the Linux kernel and are the primary off-load engine for most
system services. No user code directly executes on the I/O nodes.

5
Chapter 2:Overview of system blue gene/l supercomputer

Blue Gene/L is a new architecture for high performance parallel computers


based on low cost embedded PowerPC technology.

2.1 Overall Organization


The basic building block of Blue Gene/L is a custom system-on-a-chip that
integrates processors, memory and communications logic in the same piece of
silicon. The BG/L chip contains two standard 32-bit embedded PowerPC 440
cores, each with private L1 32KB instruction and 32KB data caches. L2 caches
acts as prefetch buffer for L3 cache.
Each core drives a custom 128-bit double FPU that can perform four
double precision floating-point operations per cycle. This custom FPU consists
of two conventional FPUs joined together, each having a 64-bit register file
with 32 registers. One of the conventional FPUs (the primary side) is
compatible with the standard PowerPC floatingpoint instruction set. In most
scenarios, only one of the 440 cores is dedicated to run user applications while
the second processor drives the networks. At a target speed of 700 MHz the
peak performance of a node is 2.8 GFlop/s. When both cores and FPUs in a
chip are used, the peak performance per node is 5.6 GFlop/s. To overcome
these limitations BG/L provides a variety of synchronization devices in the
chip: lockbox, shared SRAM, L3 scratchpad and the blind device. The lockbox
unit contains a limited number of memory locations for fast atomic test-and
sets and barriers. 16 KB of SRAM in the chip can be used to exchange data
between the cores and regions of the EDRAM L3 cache can be reserved as an
addressable scratchpad. The blind device permits explicit cache management.

6
The low power characteristics of Blue Gene/L permit a very dense
packaging as in research paper [1]. Two nodes share a node card that also
contains SDRAM-DDR memory. Each node supports a maximum of 2 GB
external memory but in the current configuration each node directly addresses
256MB at 5.5 GB/s bandwidth with a 75-cycle latency. Sixteen compute cards
can be plugged in a node board. A cabinet with two mid planes contains 32
node boards for a total of 2048 CPUs and a peak performance of 2.9/5.7
TFlops. The complete system has 64 cabinets and 16 TB of memory.
In addition to the 64K-compute nodes, BG/L contains a number of I/O
nodes (1024 in the current design). Compute nodes and I/O nodes are
physically identical although I/O nodes are likely to contain more memory.

2.2 Blue Gene/L Communications Hardware

Through messages only inter –node communication is possible,The BG/L


ASIC supports five different networks.
 Torus
 Tree
 Ethernet

7
 JTAG
 Global interrupts

2.2.1 TORUS:
The main communication network for point-to-point messages is a three-
dimensional torus. The topology is a three-dimensional torus constructed with
point-to-point, serial links between routers embedded within the Blue Gene/L
ASICs. Therefore, each ASIC have six nearest-neighbor connections, some of
which may traverse relatively long cables. The target hardware bandwidth for
each torus link is 175MB/sec in each direction. The 64K nodes are organized
into a partitionable 64x32x32 three-dimensional torus. The torus and the tree
networks are memory mapped devices. Processors send and receive packets by
explicitly writing to (and reading from) special memory addresses that act as
FIFOs. These reads and writes use the 128-bit SIMD registers. The I/O nodes
are not part of the torus network.

8
Figure: Basic architecture of the torus router

2.2.2 TREES:
I/O and compute nodes share the tree network. Tree packets are the main
mechanism for communication between I/O and compute nodes. The tree
network is a token-based network with two VCs. Packets are non-blocking
across VCs. Setting programmable control registers flexibly controls the
operation of the tree. In its simplest form, packets going up towards the root of
the tree can be either point-to-point or combining. Point-to-point packets are
used, for example, when a compute node needs to communicate with its I/O
node. The tree module, shown in Figure 6, is equipped with an integer ALU
for combining incoming packets and forwarding the resulting packet. Packets
can be combined using bit-wise operations such as XOR or integer operations
such as ADD/MAX for a variety of data widths. To do a floating-point sum

9
reduction on the tree requires potentially two round trips on the tree. In the
first trip each processor submits the exponents for a max reduction. In the
second trip, the mantissas are appropriately shifted to correspond to the
common exponent as computed on the first trip and then fed into the tree for
an integer sum reduction. Alternatively, converting the floating-point numbers
to their 2048-bit integer representations, thus requiring only a single pass
through the tree network can perform double precision floating-point
operations.

2.2.3 ETHERNET:
Only I/O nodes are attached to the Gbit/s Ethernet network, giving
1024x1Gbit/s links to external file servers. Linux cluster or SP system that
resides outside the Blue Gene/L core. The service nodes communicate with the
computational core through the IDo chips. The current packaging contains one
IDo chip per node board and another one per midplane. The IDo chips are

10
25MHz FPGAs that receive commands from the service nodes using raw UDP
packets over a trusted private 100 Mbit/s Ethernet control network.

2.2.4 JTAG:
The JTAG protocol is used for reading and writing to any address of the 16
KB SRAMs in the BG/L chips.

Figure: Outline of the Blue Gene/L system software.

2.3 Hardware Technologies for petascale computing

Grouping hardware technologies for achieving petaflop/s computing


performance into five main categories we have: Conventional technologies,
Processing in-memory (PIM) designs, Designs based on super conducting
processor technology ,Special purpose hardware designs Schemes that use the
aggregate computing power of Web-distributed processors Technologies
currently available or that are expected to be available in the near future. Thus,
we don’t discuss designs based on spectulative technologies, such as quantum6
or macromolecular7 computing, although they might be important in the long
run.

11
2.4 Special Purpose hardware
Researchers on the Grape project at the University of Tokyo have designed
and built a family of special-purpose attached processors for performing the
gravitational force computations that form the inner loop of N-body simulation
problems. The computational astrophysics community has extensively used
Grape processors for N-body gravitational simulations. A Grape-4 system
consisting of 1,692 processors, each with a peak speed of 640 Mflop/s, was
completed in June 1995 and has a peak speed of 1.08 Tflop/s.14 In 1999, a
Grape-5 system won a Gordon Bell Award in the performance per dollar
category, achieving$7.3 per Mflop/s on a tree-code astrophysical simulation. A
Grape-6 system is planned for completionin 2001 and is expected to have a
performance of about 200 Tflop/s. A 1-Pflop/s Grape system is planned for
completion by 2003.

12
Chapter 3:Blue Gene System Software

Addressing the challenges of scalability and complexity posed by BG/L we


have developed the system software architecture presented here. This
architecture is described in detail in this section.

3.1 System Software for the I/O Nodes


The Linux kernel that executes in the I/O nodes is based on a standard
distribution for PowerPC 440GP processors. Although Blue Gene/L uses
standard PPC 440 cores, the overall chip and card design required changes in
the booting sequence, interrupt management, memory layout, FPU support,
and device drivers of the standard Linux kernel. There is no BIOS in the Blue
Gene/L nodes, thus the configuration of a node after power-on and the initial
program load (IPL) is initiated by the service nodes through the control
network. We modified the interrupt and exception handling code to support
Blue Gene/L’s custom Interrupt Controller (BIC).
The implementation of the kernel MMU remaps the tree and torus
FIFOs to user space. We support the new EMAC4 Gigabit Ethernet controller.
We also updated the kernel to save and restore the double FPU registers in
each context switch. The nodes in the Blue Gene/L machine are diskless, thus
the initial root file system is provided by a ramdisk linked against the Linux
kernel. The ram disk contains shells, simple utilities, shared libraries, and
network clients such as ftp and nfs. Because of the non-coherent L1 caches,
the current version of Linux runs on one of the 440 cores, while the second
CPU is captured at boot time in an infinite loop. We an investigating two main
strategies to effectively use the second CPU in the I/O nodes: SMP mode and
virtual mode. We have successfully compiled a SMP version of the kernel,

13
after implementing all the required interprocessor communications
mechanisms, because the BG/L’s BIC is not [2] compliant. In this mode, the
TLB entries for the L1 cache are disabled in kernel mode and processes have
affinity to one CPU.
Forking a process in a different CPU requires additional parameters to
the system call. The performance and effectiveness of this solution is still an
open issue. A second, more promising mode of operation runs Linux in one of
the CPUs, while the second CPU is the core of a virtual network card. In this
scenario, the tree and torus FIFOs are not visible to the Linux kernel. Transfers
between the two CPUs appear as virtual DMA transfers. We are also
investigating support for large pages. The standard PPC 440 embedded
processors handle all TLB misses in software. Although the average number of
instructions required to handle these misses has significantly decreased, it has
been shown that larger pages improve performance

14
3.2 System Software for the Compute Nodes
The ―Blue Gene/L Run Time Supervisor‖ (BLRTS) is a custom kernel that
runs on the compute nodes of a Blue Gene/L machine. BLRTS provides a
simple, flat, fixed-size 256MB address space, with no paging, accomplishing a
role similar to [2] The kernel and application program share the same address
space, with the kernel residing in protected memory at address 0 and the
application program image loaded above, followed by its heap and stack. The
kernel protects itself by appropriately programming the PowerPC MMU.
Physical resources (torus, tree, mutexes, barriers, scratchpad) are partitioned
between application and kernel. In the current implementation, the entire torus
network is mapped into user space to obtain better communication efficiency,
while one of the two tree channels is made available to the kernel and user
applications.
BLRTS presents a familiar POSIX interface: we have ported the GNU
Glibc runtime library and provided support for basic file I/O operations
through system calls. Multi-processing services (such as fork and exec) are
meaningless in single process kernel and have not been implemented. Program
launch, termination, and file I/O is accomplished via messages passed between
the compute node and its I/O node over the tree network, using a point-to-point
packet addressing mode.
This functionality is provided by a daemon called CIOD (Console I/O
Daemon) running in the I/O nodes. CIOD provides job control and I/O
management on behalf of all the compute nodes in the processing set. Under
normal operation, all messaging between CIOD and BLRTS is synchronous:
all file I/O operations are blocking on the application side.We used the CIOD
in two scenarios:

15
1. Driven by a console shell (called CIOMAN), used mostly for simulation
and testing purposes. The user is provided with a restricted set of commands:
run, kill, Ps, set and unset environment variables. The shell distributes the
commands to all the CIODs running in the simulation, which in turn take the
appropriate actions for their compute nodes.
2. Driven by a job scheduler (such as LoadLeveler) through a special
interface that implements the same protocol as the one defined for CIOMAN
and CIOD.
We are investigating a range of compute modes for our custom kernel.
In heater mode, one CPU executes both user and network code, while the other
CPU remains idle. This mode will be the mode of operation of the initial
prototypes, but it is unlikely to be used afterwards.
In co-processor mode, the application runs in a single, non-preempt able
thread of execution on the main processor (CPU 0). The coprocessor (CPU 1)
is used as a torus device off-load engine that runs as part of a user-level
application library, communicating with the main processor through a non-
cached region of shared memory. In symmetric mode, both CPUs run
applications and users are responsible for explicitly handling cache coherence.
In virtual node mode we provide support for two independent processes in a
node. The system then looks like a machine with 128K nodes.

3.3 System Management Software


The control infrastructure is a critical component of our design. It provides a
separation between execution mechanisms in the BG/L core and policy
decisions in external nodes. Local node operating systems (Linux for I/O
nodes and BLRTS for compute nodes) implement services and are responsible
for local decisions that do not affect overall operation of the machine. A
―global operating system‖ makes all global and collective decisions and

16
interfaces with external policy modules (e.g., LoadLeveler) and performs a
variety of system management services, including: (i) machine booting, (ii)
system monitoring, and (iii) job launching. In our implementation, the global
OS runs on external service nodes. Each BG/L mid plane is controlled by one
Midplane Management and Control System (MMCS) process which provides
two paths into the Blue Gene/L complex: a custom control library to access the
restricted JTAG network and directly manipulate Blue Gene/L nodes; and
sockets over the Gbit/s Ethernet network to manage the nodes on a booted
partition. The custom control library can perform: – low level hardware
operations such as: turn on power supplies, monitor temperature sensors and
fans, and react accordingly (i.e. shut down a machine if temperature exceeds
some threshold), configure and initialize IDo, Link and BG/L chips, – read and
write configuration registers, SRAM and reset the cores of a BG/L chip. As
mentioned in [2], these operations can be performed with no code executing in
the nodes, which permits machine initialization and boot, nonintrusive access
to performance counters and post-mortem debugging.
This path into the core is used for control only; for security and
reliability reasons, it is not made visible to applications running in the BG/L
nodes. On the other hand, the architected path through the functional Gbit/s
Ethernet is used for application I/O, checkpoints, and job launch. We chose to
maintain the entire state of the global operating system using standard database
technology. Databases naturally provide scalability, reliability, security,
portability, logging, and robustness. The database contains static state and
dynamic. Therefore, the database is not just a repository for read-only
configuration information, but also an interface for all the visible state of a
machine. External entities can manipulate this state by invoking stored
procedures and database triggers, which in turn invoke functions in the MMCS
processes. Machine Initialization and Booting. The boot process for a node

17
consists of the following steps: first, a small boot loader is directly written into
the (compute or I/O) node memory by the service nodes using the JTAG
control network. This boot loader loads a much larger boot image into the
memory of the node through a custom JTAG mailbox protocol. We use one
boot image for all the compute nodes and another boot image for all the I/O
nodes. The boot image for the compute nodes contains the code for the
compute node kernel, and is approximately 64 kB in size. The boot image for
the I/O nodes contains the code for the Linux operating system (approximately
2 MB in size) and the image of a ramdisk that contains the root file system for
the I/O node. After an I/O node boots, it can mount additional file systems
from external file servers. Since the same boot image is used for each node,
additional node specific configuration information (such as torus coordinates,
tree addresses, MAC or IP addresses) must be loaded. We call this information
the personality of a node. In the I/O nodes, the personality is exposed to user
processes through an entry in the proc file system. BLRTS implements a
system call to request the node’s personality. System Monitoring in Blue
Gene/L is accomplished through a combination of I/O node and service node
functionality. Each I/O node is a full Linux machine and uses.Linux services to
generate system logs.
A complementary monitoring service for Blue Gene/L is implemented
by the service node through the control network. Device information, such as
fan speeds and power supply voltages, can be obtained directly by the service
node through the control network. The compute and I/O nodes use a
communication protocol to report events that can be logged or acted upon by
the service node. This approach establishes a completely separate monitoring
service that is independent of any other infrastructure in the system. Therefore,
it can be used even in the case of many system-wide failures to retrieve
important information. Job Execution is also accomplished through a

18
combination of I/O nodes and service node functionality. When submitting a
job for execution in Blue Gene/L, the user specfies the desired shape and size
of the partition to execute that job. The scheduler selects a set of compute
nodes to form the partition. The compute (and corresponding I/O) nodes
selected by the scheduler are configured into a partition by the service node
using the control network. We have developed techniques for efficient
allocation of nodes in a toroidal machine that are applicable to Blue Gene/L.
Once a partition is created, a job can be launched through the I/O nodes in that
partition using CIOD as explained before.

3.4 Compiler and Run-time Support


Blue Gene/L presents a familiar programming model and a standard set of
tools. We have ported the GNU toolchain (binutils, gcc, glibc and gdb) to Blue
Gene/L and set it up as a cross-compilation environment. There are two cross-
targets: Linux for I/O nodes and BLRTS for compute nodes. IBM’s XL
compiler suite is also being ported to provide advanced optimization support
for languages like Fortran90 and C++.

3.5 Communication Infrastructure


The Blue Gene/L communication software architecture is organized into three
layers:
The packet layer is a thin software library that allows access to network
hardware; the message layer provides a low-latency, high bandwidth point-to-
point message delivery system; MPI is the user level communication library.
The packet layer simplifies access to the Blue Gene/L network
hardware. The packet layer abstracts FIFO's and devices control registers into
torus and tree devices and presents an API consisting of essentially three
functions: initialization, packet send and packet receive. The packet layer

19
provides a mechanism to use the network hardware but doesn’t impose any
policies on its use. Hardware restrictions, such as the 256 byte limit on packet
size and the 16 byte alignment requirements on packet buffers, are not
abstracted by the packet layer and thus are reflected by the API. All packet
layer send and receive operations are non-blocking, leaving it up to the higher
layers to implement synchronous, blocking and/or interrupt driven
communication models. In its current implementation the packet layer is
stateless. The message layer is a simple active message system built on top of
the torus packet layer, which allows the transmission of arbitrary messages
among torus nodes. It is designed to enable the implementation of MPI point-
to-point send/receive operations. It has the following characteristics:
No packet retransmission protocol. The Blue Gene/L network hardware is
completely reliable, and thus a packet retransmission system (such as a sliding
window protocol) is unnecessary. This allows for stateless virtual connections
between pairs of nodes, greatly enhancing scalability.

3.5.1 Packetizing and alignment.


The packet layer requires data to be sent in 256 byte chunks aligned at 16 byte
boundaries. Thus the message layer deals with the packetizing and re-
alignment of message data. Re-alignment of packet data typically entails
memory-to-memory copies.

3.5.2 Packet ordering.


Packets on the torus network can arrive out of order, which makes message re-
assembly at the receiver non-trivial. For packets belonging to the same
message, the message layer is able to handle their arrival in any order. To
restore order for packets belonging to different messages, the sender assigns
ascending numbers to individual messages sent out to the same peer.Cache

20
coherence and processor use policy. The expected performance of the message
layer is influenced by the way in which the two processors are used.
Co-processor mode is the only one that effectively overlaps computation
and communication. This mode is expected to yield better bandwidth, but
slightly higher latency, than the others. MPI: Blue Gene/L is designed
primarily to run [3] workloads. We are in the process of porting [3] currently
under development at Argonne National Laboratories, to the Blue Gene/L
hardware. MPICH2 has a modular architecture. The Blue Gene/L port leaves
the code structure of MPICH2 intact, but adds a number of plug-in modules:
Point-to-point messages. The most important addition of the Blue Gene/L port
is an implementation of ADI3, the MPICH2 Abstract Device Interface. A thin
layer of code transforms e.g. MPI Request objects and MPI Send function calls
into calls into sequences of message layer function calls and callbacks.
Process management. The MPICH2 process management primitives are
documented in. Process management is split into two parts: a process
management interface (PMI), called from within the MPI library, and a set of
process managers (PM) which are responsible for starting up/shutting down
MPI jobs and implementing the PMI functions. MPICH2 includes a number of
process managers (PM) suited for clusters of general purpose workstations.
The Blue Gene/L process manager makes full use of its hierarchical system
management software, including the CIOD processes running on the I/O
nodes, to start up and shut down MPI jobs.
The Blue Gene/L system management software is explicitly designed to
deal with the scalability problem inherent in starting, synchronizing and killing
65,536 MPI processes. Optimized collectives. The default implementation of
MPI collective operations in MPICH2 generates sequences of point-to-point
messages. This implementation is oblivious of the underlying physical
topology of the torus and tree networks. In Blue Gene/L optimized collective

21
operations can be implemented for communicators whose physical layouts
conform to certain properties. The tours hardware can be used to efficiently
implement broadcasts on contiguous 1, 2 and 3 dimensional meshes, using a
feature of the torus that allows depositing a packet on every node it traverses.
The collectives best suited for this involve broadcast in some form.
The tree hardware can be used for almost every collective that is
executed on the MPI COMM WORLD communicator, including reduction
operations. Integer operand reductions are directly supported by hardware. The
tree using separate reduction phases for the mantissa and the exponent can also
implement IEEE compliant floating-point reductions.
Non MPI COMM WORLD collectives can also be implemented using
the tree, but care must be taken to ensure deadlock free operation. The tree is a
locally class routed network, with packets belonging to one of a small number
of classes and tree nodes making local decisions about routing. The tree
network guarantees deadlockfree simultaneous delivery of no more than two
class routes. One of these routes is used for control and file I/O purposes; the
other is available for use by collectives.

22
23
Chapter 4: Blue Gene/L Simulation Environment

The first hardware prototypes of the Blue Gene/L ASIC are targeted to become
operational in mid-2003. To support the development of system software
before hardware is available, we have implemented an architecturally accurate,
full system simulator for the Blue Gene/L machine. The node simulator, called
bglsim , is built using techniques described in. Each component of the BG/L
ASIC is modeled separately with the desired degree of accuracy, trading
accuracy for performance. In our simulation environment, we model the
functionality of processor instructions. That is, each instruction correctly
changes the visible architectural state, while it takes one cycle to execute. We
also model memory system behavior (cache, scratch-pad, and main memory)
and all the Blue Gene/L specific devices: tree, torus, JTAG, device control
registers, etc. A bglsim process boots the Linux kernel for the I/O nodes and
BLRTS for the compute nodes. Applications run on top of these kernels, under
user control.
When running on 1.2 GHz Pentium III machine, bglsim simulates an
entire BG/L chip at approximately 2 million simulated instructions per second
– a slow-down of about 1000 compared to the real hardware. By comparison, a
VHDL simulator with hardware acceleration has a slow-down of. As an
example, booting Linux takes 30 seconds on bglsim , 7 hours on the hardware
accelerated VHDL simulator and more that 20 days on the software VHDL
simulator. Large Blue Gene/L system is simulated using one bglsim process
per node, as shown in Figure 3. The bglsim processes run on different
workstations and communicate through a custom message passing library
(CommFabric), which simulates the connectivity within the system and
outside. Additionally, different components of the system are simulated by
separate processes that also link in CommFabric. Examples are: the IDo chip

24
simulator, a functional simulator of an IDo chip that translates packets
between the virtual JTAG network and Ethernet; the Tapdaemon and
EthernetGateway processes to provide the Linux kernels in the simulated I/O
nodes with connectivity to the outside network, allowing users to mount
external file-systems and connect using telnet, ftp, etc. We use this
environment to develop our communication infrastructure, the control
infrastructure and we have successfully executed the MPI NAS Parallel.

25
Chapter 5:Trends in Performance of SuperComputer

The scientific computing field has seen rapid changes in vendors,


architectures, technologies, and system usagein last 50 yrs. However, despite
all these changes, the longterm evolution of performance seems to remain
steady and continuous. Moore’s Law is often cited in this context. If we plot
the peak performance of the leading supercomputers over the last five decades
we see that this law holds well for almost the complete lifespan of modern
computing—on average, performance increases by two orders of magnitude
every decade.3 To provide a better basis for statistics on highperformance
computers, a group of researchers initiated the Top500 list,4 which reports the
sites with the 500 most powerful computer systems installed.
The best Linpack benchmark performance5 achieved is used as a
performance measure in ranking the computers. The Top500 list has been
updated twice a year since June 1993. Although many aspects of the HPC
market change over time, the evolution of performance seems to follow
empirical laws, such as Moore’s Law. The Top500 data provides an ideal basis
to verify an observation like this. Looking at the computing power of the
individual machines in the Top500—and the evolution of the total installed
performance—we can plot the performance of the systems at positions one, 10,
100 and 500 in the list, as well as the total accumulated performance. In Figure
2, the curve of position 500 shows an average increase by a factor of two per
year.

26
Chapter 6: Programming methodologies for petascale computing

Many of the lessons learned in the HTMT project are likely to be incorporated
in first generation petaflop computers—a multithreaded execution model and
sophisticated context management are approaches that might be necessary to
provide hardware and system level software support. These approaches could
help control the impact of latency across multiple levels of hierarchical
memory. Thus, although a petaflop computer might have a global name space,
the application programmer. The main programming methodologies likely to
be used on petascale systems will be
• A data parallel programming language, such as HPF, for very regular
computations.
• Explicit message passing (using, for example, MPI) between sequential
processes, primarily to ease porting of legacy parallel applications to the
petascale system. In this case, an MPI process would be entirely virtual and
correspond to a bundle of threads.
• Explicit multithreading to get acceptable performance on less regular
computations.
Given the requirements of high concurrency and latency tolerance,
highly tuned software libraries and advanced problem-solving environments
are important in achieving high performance and scalability on petascale
computers. Clearly, the observation that applications with a high degree of
parallelism and data locality perform best on high-performance computers will
be even truer for a petascale system characterized by a deep memory
hierarchy. It is also often the case that to achieve good performance on an HPC
system, a programming style must be adopted that closely matches the
system’s execution model. Compilers should be able to extract concurrent
threads for regular computations, but a class of less regular computations must

27
be programmed with explicit threads to get good performance. It might also be
necessary for application programmers to bundle groups of threads that access
the same memory resources Into strands in a hierarchical way that reflects the
hierarchical memory’s physical structure. Optimized library routines will
probably need to be programmed in this way. IBM’s proposed Blue Gene uses
massive parallelism to achieve petaop/s performance. Therefore, it can be
based on a less radical design than the HTMT machine, although it still uses
onchip memory and multithreading to support a high-bandwidth, low-latency
communication path from processor to memory. Blue Gene is intended for use
in computational biology applications, such as protein folding, and will contain
on the order of one million processor–memory pairs. Its basic building block
will be a chip that contains an array of processor–memory pairs and interchip
communication logic. These chips will be connected into a 3D mesh.

6.1 New algorithms for petascale computing

Computations that possess insufficient inherent parallelism are not well suited
for petascale systems. Therefore, a premium is placed on algorithms that are
regular in structure but that may have a higher operation count over more
irregular algorithms with a lower operation count. Dense matrix computations,
such as those the Lapack software library provides, 20 optimized to exploit
data locality are an example of regular computations that should perform well
on a petascale system. The accuracy and ability of numerical methods for
petascale computations are also important issues because some algorithms can
suffer significant losses of numerical precision. Thus, slower but more
accurate and stable algorithms are favored over faster but less accurate and
stable ones. To this end, hardware support for 128-bit arithmetic is desirable,
and interval- based algorithms may play a more important role on future

28
petascale systems than they do on current high-performance machines. Interval
arithmetic provides a means of tracking errors in a computation, such as initial
errors and uncertainties in the input data, errors in analytic approximations,
and rounding error. Instead of being represented by a single floating-point
number, each quantity is represented by a lower and upper bound within which
it is guaranteed to lie. The interval representation, therefore, provides rigorous
accuracy information about a solution that is absent in the point representation.
The advent of petascale computing systems might promote the adoption of
completely new methods to solve certain problems. For example, cellular
automata are highly parallel, are very ―Slower but more accurate and stable
algorithms are favored over faster but less accurate and stable ones‖. Regular
in structure, can handle complex geometries, and are numerically stable, and
so are well suited to petascale computing. CA provides an interesting and
powerful alternative to classical techniques for solving partial differential
equations. Their power derives from the fact that their algorithm dynamics
mimic (at an abstract level) the fine-grain dynamics of the actual physical
system, permitting the emergence of macroscopic phenomenology from
microscopic mechanisms. Thus, complex collective global behavior can arise
from simple components obeying simple local interaction rules.
CA algorithms are well suited for applications involving nonequilibrium
and dynamical processes. The use of CA on future types of ―programmable
matter‖ computers is discussed elsewhere21—such computers can deliver
enormous computational power and are ideal platforms for CA-based
algorithms. Thus, on the five-to-10-year timescale, CA will play an increasing
role in the simulation of physical (and social) phenomena. Another aspect of
software development for A Petascale system is the ability to automatically
tune a library of numerical routines for optimal performance on a particular
architecture. The Automatic Tuned Linear Algebra Software project

29
exemplifies this concept.22 Atlas is an approach for the automatic generation
and optimization of numerical software for computers with deep memory
Hierarchies and pipelined functional units. With Atlas, numerical routines are
developed with a large design space spanned by many tunable parameters,
such as blocking size, loop nesting permutations, loop unrolling depths,
pipelining strategies, register allocations, and instruction schedules. When
Atlas is first installed on a new platform, a set of runs automatically
determines the optimal parameter values for that platform, which are then used
in subsequent uses of the code. We could apply this idea to other application
areas, in addition to numerical linear algebra, and extend it to develop
numerical software that can dynamically explore its computational
environment and intelligently adapt to it as resource availability changes. In
general, a multithreaded execution model implies that the application
developer does not have control over the detailed order of arithmetic
operations (such control would incur a large performance Cost). Thus,
numerical reproducibility, which is already difficult to achieve with current
architectures, will be lost. A corollary of this is that for certain problems, it
will be impossible to predict a priori which solution method will perform best
or converge at all.
An example is the iterative solution of linear systems for which the
matrix is nonsymmetrical, indefinite, or both. In such cases, an ―algorithmic
bombardment‖ approach can help. The idea here is to concurrently apply
several methods for solving a problem in the hope that at least one will
converge to the solution. 23 A related approach is to apply a fast, but
unreliable, method first and then to check a posteriori if any problem occurred
in the solution. If so, we use a slower method to fix the problem.

30
These poly-algorithmic approaches can be available for application
developers either as black boxes or with varying degrees of control over the
methods used.

31
Chapter 7: Validating the architecture with application programs

A wide variety of scientific applications, including many from DOE’s NNSA


Laboratories, have been used to assist in the design of BlueGene/L’s hardware
and Software. Blue Gene/L’s unique features are especially appealing for
ASCI-scale scientific applications.
The global barrier and combining trees will vastly improve the
scalability and performance of widely-used collective operations, such as
MPI _Barrier and MPI_Allreduce. Our analysis shows that a large majority of
scientific applications such as SPPM (simplified piecewise-parabolic method),
Sweep3D (discrete ordinates neutron transport using wavefronts), SMG2000
(semicoarsening multigrid solver), SPHOT (Monte Carlo photon transport),
SAMRAI (Structured Adaptive Mesh Refinement Application Infrastructure)
and UMT2K (3D deterministic multigroup, photon transport code for
unstructured meshes) use these collective operations to calculate the size of
simulation timesteps and validate physical conservation properties of the
simulated system. Most applications use MPI's nonblocking point-to-point
messaging operations to allow concurrency between computation and
communication; BG/L's distinct communication and computation processors
will allow the computation processor to transfer overhead for messaging to the
communication processor. In addition, we have identified several important
applications whose high flops/loads ratio and alternating
compute/communicate behavior will allow effective use of the second
floating-point unit in each node. We are continuing to study application
performance through tracing and simulation analysis, and will analyze the
actual hardware as it becomes available. The results of analysis performed

32
with collaborators at the San Diego Supercomputer Center and Cal Tech’s
Center for Advanced Computing Research will be reported elsewhere.

33
Chapter 8: Blue Gene science applications development

To carry out the scientific research into the mechanisms behind protein folding
announced in December 1999, development of a molecular simulation
application kernel targeted for massively parallel architectures is underway.
For additional information about the science application portion of the
BlueGene project. This application development effort serves multiple
purposes: (1) it is the application platform for the Blue Gene Science
programs. (2) It serves as a prototyping platform for research into application
frameworks suitable for cellular architectures. (3) It provides an application
perspective in close contact with the hardware and systems software
development teams.
One of the motivations for the use of massive computational power in
the study of protein folding and dynamics is to obtain a microscopic view of
the thermodynamics and kinetics of the folding process. Being able to simulate
longer and longer time-scales is the key challenge. Thus the focus for
application scalability is on improving the speed of execution for a fixed size
system by utilizing additional CPUs. Efficient domain decomposition and
utilization of the high performance interconnect networks on BG/L (both torus
and tree) are the keys to maximizing application scalability. To provide an
environment to allow exploration of algorithmic alternatives, the applications
group has focused on understanding the logical limits to concurrency within
the application, structuring the application architecture to support the finest
grained concurrency possible, and to logically separate parallel
communications from straight line serial computation. With this separation and
the identification of key communications patterns used widely in molecular
simulation, it is possible for domain experts in molecular simulation to modify

34
detailed behavior of the application without having to deal with the complexity
of the parallel communications environment as well.
Key computational kernels derived from the molecular simulation
application have been used to characterize and drive improvements in the
floating-point code generation of the compiler being developed for the BG/L
platform. As additional tools and actual hardware become available, the effects
of cache hierarchy and communications architecture can be explored in detail
for the application.

35
Chapter 9: Conclusion

Blue Gene/L is the first of a new series of high performance machines being
developed at IBM Research. The hardware plans for the machine are complete
and the first small prototypes will be available in late 2003. We have presented
a software system that can scale up to the demands of the Blue Gene/L
hardware. We have also described the simulation environment that we are
using to develop and validate this software system. Using the simulation
environment, we are able to demonstrate a complete and functional system
software environment before hardware becomes available. Nevertheless,
evaluating scalability and performance of the complete system still requires
hardware availability. Many of the implementation details will likely change
as we gain experience with the real hardware.

36
Bibliography

[1] ―An Overview of the Blue Gene /L System Software Organization‖, 5 5


George Alm’asi, Ralph Bellofatto, Jos´e Brunheroto , C¢alin Cas¸caval , Jos´e
G. Casta˜nos , Luis Ceze , Paul Crumley , C. Christopher Erway , Joseph
Gagliano , Derek Lieber , Xavier Martorell , Jos´e E. Moreira , Alda Sanomiya
, and Karin Strauss
[2] ―High Performance Computing-The Quest for Petascale Computing‖, Jack
J. Dongarra, David W. Walker
[3] ―An Overview of Blue Gene/L Supercomputer‖, W Barrett, C Engel, B
Drehmel, B Hilgart, D Hill, F Kasemkhani, D Krolak, CT Li, T Liebsch, J
Marcella, A Muff, A Okomo, M Rouse, A Schram, M Tubbs, G Ulsh, C Wait,
J Wittrup

37

You might also like