Blue Gene Super Computer PDF
Blue Gene Super Computer PDF
Blue Gene Super Computer PDF
1
INDEX
Abstract 1
1 Introduction 2
5 Trends in supercomputer 23
9 Conclusion 33
2
Abstract
3
Chapter 1: INTRODUCTION
4
within the next 2-3 years, especially since IBM hasannounced the Blue Gene
project aimed at building such a machine. The system software for Blue
Gene/L is a combination of standard and custom solutions. The software
architecture for the machine is divided into three functional Entities arranged
hierarchically: a computational core, a control infrastructure and a service
infrastructure. The I/O nodes (part of the control infrastructure) execute a
version of the Linux kernel and are the primary off-load engine for most
system services. No user code directly executes on the I/O nodes.
5
Chapter 2:Overview of system blue gene/l supercomputer
6
The low power characteristics of Blue Gene/L permit a very dense
packaging as in research paper [1]. Two nodes share a node card that also
contains SDRAM-DDR memory. Each node supports a maximum of 2 GB
external memory but in the current configuration each node directly addresses
256MB at 5.5 GB/s bandwidth with a 75-cycle latency. Sixteen compute cards
can be plugged in a node board. A cabinet with two mid planes contains 32
node boards for a total of 2048 CPUs and a peak performance of 2.9/5.7
TFlops. The complete system has 64 cabinets and 16 TB of memory.
In addition to the 64K-compute nodes, BG/L contains a number of I/O
nodes (1024 in the current design). Compute nodes and I/O nodes are
physically identical although I/O nodes are likely to contain more memory.
7
JTAG
Global interrupts
2.2.1 TORUS:
The main communication network for point-to-point messages is a three-
dimensional torus. The topology is a three-dimensional torus constructed with
point-to-point, serial links between routers embedded within the Blue Gene/L
ASICs. Therefore, each ASIC have six nearest-neighbor connections, some of
which may traverse relatively long cables. The target hardware bandwidth for
each torus link is 175MB/sec in each direction. The 64K nodes are organized
into a partitionable 64x32x32 three-dimensional torus. The torus and the tree
networks are memory mapped devices. Processors send and receive packets by
explicitly writing to (and reading from) special memory addresses that act as
FIFOs. These reads and writes use the 128-bit SIMD registers. The I/O nodes
are not part of the torus network.
8
Figure: Basic architecture of the torus router
2.2.2 TREES:
I/O and compute nodes share the tree network. Tree packets are the main
mechanism for communication between I/O and compute nodes. The tree
network is a token-based network with two VCs. Packets are non-blocking
across VCs. Setting programmable control registers flexibly controls the
operation of the tree. In its simplest form, packets going up towards the root of
the tree can be either point-to-point or combining. Point-to-point packets are
used, for example, when a compute node needs to communicate with its I/O
node. The tree module, shown in Figure 6, is equipped with an integer ALU
for combining incoming packets and forwarding the resulting packet. Packets
can be combined using bit-wise operations such as XOR or integer operations
such as ADD/MAX for a variety of data widths. To do a floating-point sum
9
reduction on the tree requires potentially two round trips on the tree. In the
first trip each processor submits the exponents for a max reduction. In the
second trip, the mantissas are appropriately shifted to correspond to the
common exponent as computed on the first trip and then fed into the tree for
an integer sum reduction. Alternatively, converting the floating-point numbers
to their 2048-bit integer representations, thus requiring only a single pass
through the tree network can perform double precision floating-point
operations.
2.2.3 ETHERNET:
Only I/O nodes are attached to the Gbit/s Ethernet network, giving
1024x1Gbit/s links to external file servers. Linux cluster or SP system that
resides outside the Blue Gene/L core. The service nodes communicate with the
computational core through the IDo chips. The current packaging contains one
IDo chip per node board and another one per midplane. The IDo chips are
10
25MHz FPGAs that receive commands from the service nodes using raw UDP
packets over a trusted private 100 Mbit/s Ethernet control network.
2.2.4 JTAG:
The JTAG protocol is used for reading and writing to any address of the 16
KB SRAMs in the BG/L chips.
11
2.4 Special Purpose hardware
Researchers on the Grape project at the University of Tokyo have designed
and built a family of special-purpose attached processors for performing the
gravitational force computations that form the inner loop of N-body simulation
problems. The computational astrophysics community has extensively used
Grape processors for N-body gravitational simulations. A Grape-4 system
consisting of 1,692 processors, each with a peak speed of 640 Mflop/s, was
completed in June 1995 and has a peak speed of 1.08 Tflop/s.14 In 1999, a
Grape-5 system won a Gordon Bell Award in the performance per dollar
category, achieving$7.3 per Mflop/s on a tree-code astrophysical simulation. A
Grape-6 system is planned for completionin 2001 and is expected to have a
performance of about 200 Tflop/s. A 1-Pflop/s Grape system is planned for
completion by 2003.
12
Chapter 3:Blue Gene System Software
13
after implementing all the required interprocessor communications
mechanisms, because the BG/L’s BIC is not [2] compliant. In this mode, the
TLB entries for the L1 cache are disabled in kernel mode and processes have
affinity to one CPU.
Forking a process in a different CPU requires additional parameters to
the system call. The performance and effectiveness of this solution is still an
open issue. A second, more promising mode of operation runs Linux in one of
the CPUs, while the second CPU is the core of a virtual network card. In this
scenario, the tree and torus FIFOs are not visible to the Linux kernel. Transfers
between the two CPUs appear as virtual DMA transfers. We are also
investigating support for large pages. The standard PPC 440 embedded
processors handle all TLB misses in software. Although the average number of
instructions required to handle these misses has significantly decreased, it has
been shown that larger pages improve performance
14
3.2 System Software for the Compute Nodes
The ―Blue Gene/L Run Time Supervisor‖ (BLRTS) is a custom kernel that
runs on the compute nodes of a Blue Gene/L machine. BLRTS provides a
simple, flat, fixed-size 256MB address space, with no paging, accomplishing a
role similar to [2] The kernel and application program share the same address
space, with the kernel residing in protected memory at address 0 and the
application program image loaded above, followed by its heap and stack. The
kernel protects itself by appropriately programming the PowerPC MMU.
Physical resources (torus, tree, mutexes, barriers, scratchpad) are partitioned
between application and kernel. In the current implementation, the entire torus
network is mapped into user space to obtain better communication efficiency,
while one of the two tree channels is made available to the kernel and user
applications.
BLRTS presents a familiar POSIX interface: we have ported the GNU
Glibc runtime library and provided support for basic file I/O operations
through system calls. Multi-processing services (such as fork and exec) are
meaningless in single process kernel and have not been implemented. Program
launch, termination, and file I/O is accomplished via messages passed between
the compute node and its I/O node over the tree network, using a point-to-point
packet addressing mode.
This functionality is provided by a daemon called CIOD (Console I/O
Daemon) running in the I/O nodes. CIOD provides job control and I/O
management on behalf of all the compute nodes in the processing set. Under
normal operation, all messaging between CIOD and BLRTS is synchronous:
all file I/O operations are blocking on the application side.We used the CIOD
in two scenarios:
15
1. Driven by a console shell (called CIOMAN), used mostly for simulation
and testing purposes. The user is provided with a restricted set of commands:
run, kill, Ps, set and unset environment variables. The shell distributes the
commands to all the CIODs running in the simulation, which in turn take the
appropriate actions for their compute nodes.
2. Driven by a job scheduler (such as LoadLeveler) through a special
interface that implements the same protocol as the one defined for CIOMAN
and CIOD.
We are investigating a range of compute modes for our custom kernel.
In heater mode, one CPU executes both user and network code, while the other
CPU remains idle. This mode will be the mode of operation of the initial
prototypes, but it is unlikely to be used afterwards.
In co-processor mode, the application runs in a single, non-preempt able
thread of execution on the main processor (CPU 0). The coprocessor (CPU 1)
is used as a torus device off-load engine that runs as part of a user-level
application library, communicating with the main processor through a non-
cached region of shared memory. In symmetric mode, both CPUs run
applications and users are responsible for explicitly handling cache coherence.
In virtual node mode we provide support for two independent processes in a
node. The system then looks like a machine with 128K nodes.
16
interfaces with external policy modules (e.g., LoadLeveler) and performs a
variety of system management services, including: (i) machine booting, (ii)
system monitoring, and (iii) job launching. In our implementation, the global
OS runs on external service nodes. Each BG/L mid plane is controlled by one
Midplane Management and Control System (MMCS) process which provides
two paths into the Blue Gene/L complex: a custom control library to access the
restricted JTAG network and directly manipulate Blue Gene/L nodes; and
sockets over the Gbit/s Ethernet network to manage the nodes on a booted
partition. The custom control library can perform: – low level hardware
operations such as: turn on power supplies, monitor temperature sensors and
fans, and react accordingly (i.e. shut down a machine if temperature exceeds
some threshold), configure and initialize IDo, Link and BG/L chips, – read and
write configuration registers, SRAM and reset the cores of a BG/L chip. As
mentioned in [2], these operations can be performed with no code executing in
the nodes, which permits machine initialization and boot, nonintrusive access
to performance counters and post-mortem debugging.
This path into the core is used for control only; for security and
reliability reasons, it is not made visible to applications running in the BG/L
nodes. On the other hand, the architected path through the functional Gbit/s
Ethernet is used for application I/O, checkpoints, and job launch. We chose to
maintain the entire state of the global operating system using standard database
technology. Databases naturally provide scalability, reliability, security,
portability, logging, and robustness. The database contains static state and
dynamic. Therefore, the database is not just a repository for read-only
configuration information, but also an interface for all the visible state of a
machine. External entities can manipulate this state by invoking stored
procedures and database triggers, which in turn invoke functions in the MMCS
processes. Machine Initialization and Booting. The boot process for a node
17
consists of the following steps: first, a small boot loader is directly written into
the (compute or I/O) node memory by the service nodes using the JTAG
control network. This boot loader loads a much larger boot image into the
memory of the node through a custom JTAG mailbox protocol. We use one
boot image for all the compute nodes and another boot image for all the I/O
nodes. The boot image for the compute nodes contains the code for the
compute node kernel, and is approximately 64 kB in size. The boot image for
the I/O nodes contains the code for the Linux operating system (approximately
2 MB in size) and the image of a ramdisk that contains the root file system for
the I/O node. After an I/O node boots, it can mount additional file systems
from external file servers. Since the same boot image is used for each node,
additional node specific configuration information (such as torus coordinates,
tree addresses, MAC or IP addresses) must be loaded. We call this information
the personality of a node. In the I/O nodes, the personality is exposed to user
processes through an entry in the proc file system. BLRTS implements a
system call to request the node’s personality. System Monitoring in Blue
Gene/L is accomplished through a combination of I/O node and service node
functionality. Each I/O node is a full Linux machine and uses.Linux services to
generate system logs.
A complementary monitoring service for Blue Gene/L is implemented
by the service node through the control network. Device information, such as
fan speeds and power supply voltages, can be obtained directly by the service
node through the control network. The compute and I/O nodes use a
communication protocol to report events that can be logged or acted upon by
the service node. This approach establishes a completely separate monitoring
service that is independent of any other infrastructure in the system. Therefore,
it can be used even in the case of many system-wide failures to retrieve
important information. Job Execution is also accomplished through a
18
combination of I/O nodes and service node functionality. When submitting a
job for execution in Blue Gene/L, the user specfies the desired shape and size
of the partition to execute that job. The scheduler selects a set of compute
nodes to form the partition. The compute (and corresponding I/O) nodes
selected by the scheduler are configured into a partition by the service node
using the control network. We have developed techniques for efficient
allocation of nodes in a toroidal machine that are applicable to Blue Gene/L.
Once a partition is created, a job can be launched through the I/O nodes in that
partition using CIOD as explained before.
19
provides a mechanism to use the network hardware but doesn’t impose any
policies on its use. Hardware restrictions, such as the 256 byte limit on packet
size and the 16 byte alignment requirements on packet buffers, are not
abstracted by the packet layer and thus are reflected by the API. All packet
layer send and receive operations are non-blocking, leaving it up to the higher
layers to implement synchronous, blocking and/or interrupt driven
communication models. In its current implementation the packet layer is
stateless. The message layer is a simple active message system built on top of
the torus packet layer, which allows the transmission of arbitrary messages
among torus nodes. It is designed to enable the implementation of MPI point-
to-point send/receive operations. It has the following characteristics:
No packet retransmission protocol. The Blue Gene/L network hardware is
completely reliable, and thus a packet retransmission system (such as a sliding
window protocol) is unnecessary. This allows for stateless virtual connections
between pairs of nodes, greatly enhancing scalability.
20
coherence and processor use policy. The expected performance of the message
layer is influenced by the way in which the two processors are used.
Co-processor mode is the only one that effectively overlaps computation
and communication. This mode is expected to yield better bandwidth, but
slightly higher latency, than the others. MPI: Blue Gene/L is designed
primarily to run [3] workloads. We are in the process of porting [3] currently
under development at Argonne National Laboratories, to the Blue Gene/L
hardware. MPICH2 has a modular architecture. The Blue Gene/L port leaves
the code structure of MPICH2 intact, but adds a number of plug-in modules:
Point-to-point messages. The most important addition of the Blue Gene/L port
is an implementation of ADI3, the MPICH2 Abstract Device Interface. A thin
layer of code transforms e.g. MPI Request objects and MPI Send function calls
into calls into sequences of message layer function calls and callbacks.
Process management. The MPICH2 process management primitives are
documented in. Process management is split into two parts: a process
management interface (PMI), called from within the MPI library, and a set of
process managers (PM) which are responsible for starting up/shutting down
MPI jobs and implementing the PMI functions. MPICH2 includes a number of
process managers (PM) suited for clusters of general purpose workstations.
The Blue Gene/L process manager makes full use of its hierarchical system
management software, including the CIOD processes running on the I/O
nodes, to start up and shut down MPI jobs.
The Blue Gene/L system management software is explicitly designed to
deal with the scalability problem inherent in starting, synchronizing and killing
65,536 MPI processes. Optimized collectives. The default implementation of
MPI collective operations in MPICH2 generates sequences of point-to-point
messages. This implementation is oblivious of the underlying physical
topology of the torus and tree networks. In Blue Gene/L optimized collective
21
operations can be implemented for communicators whose physical layouts
conform to certain properties. The tours hardware can be used to efficiently
implement broadcasts on contiguous 1, 2 and 3 dimensional meshes, using a
feature of the torus that allows depositing a packet on every node it traverses.
The collectives best suited for this involve broadcast in some form.
The tree hardware can be used for almost every collective that is
executed on the MPI COMM WORLD communicator, including reduction
operations. Integer operand reductions are directly supported by hardware. The
tree using separate reduction phases for the mantissa and the exponent can also
implement IEEE compliant floating-point reductions.
Non MPI COMM WORLD collectives can also be implemented using
the tree, but care must be taken to ensure deadlock free operation. The tree is a
locally class routed network, with packets belonging to one of a small number
of classes and tree nodes making local decisions about routing. The tree
network guarantees deadlockfree simultaneous delivery of no more than two
class routes. One of these routes is used for control and file I/O purposes; the
other is available for use by collectives.
22
23
Chapter 4: Blue Gene/L Simulation Environment
The first hardware prototypes of the Blue Gene/L ASIC are targeted to become
operational in mid-2003. To support the development of system software
before hardware is available, we have implemented an architecturally accurate,
full system simulator for the Blue Gene/L machine. The node simulator, called
bglsim , is built using techniques described in. Each component of the BG/L
ASIC is modeled separately with the desired degree of accuracy, trading
accuracy for performance. In our simulation environment, we model the
functionality of processor instructions. That is, each instruction correctly
changes the visible architectural state, while it takes one cycle to execute. We
also model memory system behavior (cache, scratch-pad, and main memory)
and all the Blue Gene/L specific devices: tree, torus, JTAG, device control
registers, etc. A bglsim process boots the Linux kernel for the I/O nodes and
BLRTS for the compute nodes. Applications run on top of these kernels, under
user control.
When running on 1.2 GHz Pentium III machine, bglsim simulates an
entire BG/L chip at approximately 2 million simulated instructions per second
– a slow-down of about 1000 compared to the real hardware. By comparison, a
VHDL simulator with hardware acceleration has a slow-down of. As an
example, booting Linux takes 30 seconds on bglsim , 7 hours on the hardware
accelerated VHDL simulator and more that 20 days on the software VHDL
simulator. Large Blue Gene/L system is simulated using one bglsim process
per node, as shown in Figure 3. The bglsim processes run on different
workstations and communicate through a custom message passing library
(CommFabric), which simulates the connectivity within the system and
outside. Additionally, different components of the system are simulated by
separate processes that also link in CommFabric. Examples are: the IDo chip
24
simulator, a functional simulator of an IDo chip that translates packets
between the virtual JTAG network and Ethernet; the Tapdaemon and
EthernetGateway processes to provide the Linux kernels in the simulated I/O
nodes with connectivity to the outside network, allowing users to mount
external file-systems and connect using telnet, ftp, etc. We use this
environment to develop our communication infrastructure, the control
infrastructure and we have successfully executed the MPI NAS Parallel.
25
Chapter 5:Trends in Performance of SuperComputer
26
Chapter 6: Programming methodologies for petascale computing
Many of the lessons learned in the HTMT project are likely to be incorporated
in first generation petaflop computers—a multithreaded execution model and
sophisticated context management are approaches that might be necessary to
provide hardware and system level software support. These approaches could
help control the impact of latency across multiple levels of hierarchical
memory. Thus, although a petaflop computer might have a global name space,
the application programmer. The main programming methodologies likely to
be used on petascale systems will be
• A data parallel programming language, such as HPF, for very regular
computations.
• Explicit message passing (using, for example, MPI) between sequential
processes, primarily to ease porting of legacy parallel applications to the
petascale system. In this case, an MPI process would be entirely virtual and
correspond to a bundle of threads.
• Explicit multithreading to get acceptable performance on less regular
computations.
Given the requirements of high concurrency and latency tolerance,
highly tuned software libraries and advanced problem-solving environments
are important in achieving high performance and scalability on petascale
computers. Clearly, the observation that applications with a high degree of
parallelism and data locality perform best on high-performance computers will
be even truer for a petascale system characterized by a deep memory
hierarchy. It is also often the case that to achieve good performance on an HPC
system, a programming style must be adopted that closely matches the
system’s execution model. Compilers should be able to extract concurrent
threads for regular computations, but a class of less regular computations must
27
be programmed with explicit threads to get good performance. It might also be
necessary for application programmers to bundle groups of threads that access
the same memory resources Into strands in a hierarchical way that reflects the
hierarchical memory’s physical structure. Optimized library routines will
probably need to be programmed in this way. IBM’s proposed Blue Gene uses
massive parallelism to achieve petaop/s performance. Therefore, it can be
based on a less radical design than the HTMT machine, although it still uses
onchip memory and multithreading to support a high-bandwidth, low-latency
communication path from processor to memory. Blue Gene is intended for use
in computational biology applications, such as protein folding, and will contain
on the order of one million processor–memory pairs. Its basic building block
will be a chip that contains an array of processor–memory pairs and interchip
communication logic. These chips will be connected into a 3D mesh.
Computations that possess insufficient inherent parallelism are not well suited
for petascale systems. Therefore, a premium is placed on algorithms that are
regular in structure but that may have a higher operation count over more
irregular algorithms with a lower operation count. Dense matrix computations,
such as those the Lapack software library provides, 20 optimized to exploit
data locality are an example of regular computations that should perform well
on a petascale system. The accuracy and ability of numerical methods for
petascale computations are also important issues because some algorithms can
suffer significant losses of numerical precision. Thus, slower but more
accurate and stable algorithms are favored over faster but less accurate and
stable ones. To this end, hardware support for 128-bit arithmetic is desirable,
and interval- based algorithms may play a more important role on future
28
petascale systems than they do on current high-performance machines. Interval
arithmetic provides a means of tracking errors in a computation, such as initial
errors and uncertainties in the input data, errors in analytic approximations,
and rounding error. Instead of being represented by a single floating-point
number, each quantity is represented by a lower and upper bound within which
it is guaranteed to lie. The interval representation, therefore, provides rigorous
accuracy information about a solution that is absent in the point representation.
The advent of petascale computing systems might promote the adoption of
completely new methods to solve certain problems. For example, cellular
automata are highly parallel, are very ―Slower but more accurate and stable
algorithms are favored over faster but less accurate and stable ones‖. Regular
in structure, can handle complex geometries, and are numerically stable, and
so are well suited to petascale computing. CA provides an interesting and
powerful alternative to classical techniques for solving partial differential
equations. Their power derives from the fact that their algorithm dynamics
mimic (at an abstract level) the fine-grain dynamics of the actual physical
system, permitting the emergence of macroscopic phenomenology from
microscopic mechanisms. Thus, complex collective global behavior can arise
from simple components obeying simple local interaction rules.
CA algorithms are well suited for applications involving nonequilibrium
and dynamical processes. The use of CA on future types of ―programmable
matter‖ computers is discussed elsewhere21—such computers can deliver
enormous computational power and are ideal platforms for CA-based
algorithms. Thus, on the five-to-10-year timescale, CA will play an increasing
role in the simulation of physical (and social) phenomena. Another aspect of
software development for A Petascale system is the ability to automatically
tune a library of numerical routines for optimal performance on a particular
architecture. The Automatic Tuned Linear Algebra Software project
29
exemplifies this concept.22 Atlas is an approach for the automatic generation
and optimization of numerical software for computers with deep memory
Hierarchies and pipelined functional units. With Atlas, numerical routines are
developed with a large design space spanned by many tunable parameters,
such as blocking size, loop nesting permutations, loop unrolling depths,
pipelining strategies, register allocations, and instruction schedules. When
Atlas is first installed on a new platform, a set of runs automatically
determines the optimal parameter values for that platform, which are then used
in subsequent uses of the code. We could apply this idea to other application
areas, in addition to numerical linear algebra, and extend it to develop
numerical software that can dynamically explore its computational
environment and intelligently adapt to it as resource availability changes. In
general, a multithreaded execution model implies that the application
developer does not have control over the detailed order of arithmetic
operations (such control would incur a large performance Cost). Thus,
numerical reproducibility, which is already difficult to achieve with current
architectures, will be lost. A corollary of this is that for certain problems, it
will be impossible to predict a priori which solution method will perform best
or converge at all.
An example is the iterative solution of linear systems for which the
matrix is nonsymmetrical, indefinite, or both. In such cases, an ―algorithmic
bombardment‖ approach can help. The idea here is to concurrently apply
several methods for solving a problem in the hope that at least one will
converge to the solution. 23 A related approach is to apply a fast, but
unreliable, method first and then to check a posteriori if any problem occurred
in the solution. If so, we use a slower method to fix the problem.
30
These poly-algorithmic approaches can be available for application
developers either as black boxes or with varying degrees of control over the
methods used.
31
Chapter 7: Validating the architecture with application programs
32
with collaborators at the San Diego Supercomputer Center and Cal Tech’s
Center for Advanced Computing Research will be reported elsewhere.
33
Chapter 8: Blue Gene science applications development
To carry out the scientific research into the mechanisms behind protein folding
announced in December 1999, development of a molecular simulation
application kernel targeted for massively parallel architectures is underway.
For additional information about the science application portion of the
BlueGene project. This application development effort serves multiple
purposes: (1) it is the application platform for the Blue Gene Science
programs. (2) It serves as a prototyping platform for research into application
frameworks suitable for cellular architectures. (3) It provides an application
perspective in close contact with the hardware and systems software
development teams.
One of the motivations for the use of massive computational power in
the study of protein folding and dynamics is to obtain a microscopic view of
the thermodynamics and kinetics of the folding process. Being able to simulate
longer and longer time-scales is the key challenge. Thus the focus for
application scalability is on improving the speed of execution for a fixed size
system by utilizing additional CPUs. Efficient domain decomposition and
utilization of the high performance interconnect networks on BG/L (both torus
and tree) are the keys to maximizing application scalability. To provide an
environment to allow exploration of algorithmic alternatives, the applications
group has focused on understanding the logical limits to concurrency within
the application, structuring the application architecture to support the finest
grained concurrency possible, and to logically separate parallel
communications from straight line serial computation. With this separation and
the identification of key communications patterns used widely in molecular
simulation, it is possible for domain experts in molecular simulation to modify
34
detailed behavior of the application without having to deal with the complexity
of the parallel communications environment as well.
Key computational kernels derived from the molecular simulation
application have been used to characterize and drive improvements in the
floating-point code generation of the compiler being developed for the BG/L
platform. As additional tools and actual hardware become available, the effects
of cache hierarchy and communications architecture can be explored in detail
for the application.
35
Chapter 9: Conclusion
Blue Gene/L is the first of a new series of high performance machines being
developed at IBM Research. The hardware plans for the machine are complete
and the first small prototypes will be available in late 2003. We have presented
a software system that can scale up to the demands of the Blue Gene/L
hardware. We have also described the simulation environment that we are
using to develop and validate this software system. Using the simulation
environment, we are able to demonstrate a complete and functional system
software environment before hardware becomes available. Nevertheless,
evaluating scalability and performance of the complete system still requires
hardware availability. Many of the implementation details will likely change
as we gain experience with the real hardware.
36
Bibliography
37