ThesisModeling Parallel Processes in Biosystems

Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

Modeling Parallel Processes in Biosystems

A. Goncearenco June, 2004

Abstract The problem of building an AI system basing on exhausted symbolic AI paradigm forces to search for a new one in optics and neurophysiology. A pseudooptical neural network approach is considered. Prerequisites of this approach are examined. The model is an articial network of interfering neurons. Potentials of these neurons change as a result of interference of periodical signals. Modeling of holographic and other optical phenomena is possible in networks of this kind. This pseudo-optical neuronal model is a step toward the articial holographic brain and brain-distributed associative memory. The process of calculation of the neuron potentials in the network is massively parallel. The model is implemented in software simulator. Parallel computing is considered. The modern state of HPC domain is examined and the trends are observed. A educational Beowulf cluster, supporting MPI, is constructed and tested. The simulator is parallelized to run on cluster. Keywords: articial intelligence, parallel computing, cluster, pseudooptical neural networks, modeling

Contents
Introduction 1 The Shift of Articial Intelligence Paradigm 4 6

1.1 The particularities of natural informational processes . . . . . . . . . . . . 7 1.2 Brain chemistry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Articial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Holography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 The holographic brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6 The optics of memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.7 Solving with optical methods . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Parallel Computing 17 2.1 Types of parallel computers . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.1 2.1.2 2.1.3 2.1.4 2.1.5 2.1.6 2.1.7 Scalar computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Parallel vector processors . . . . . . . . . . . . . . . . . . . . . . . . 17 Shared-memory multiprocessors . . . . . . . . . . . . . . . . . . . . 18 Distributed memory MPPs . . . . . . . . . . . . . . . . . . . . . . . 20 Cluster computers . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Network computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Metacomputing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 The latest conventions in HPC . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.1 The HPC trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.2 Linux clusters in TOP500 . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.3 2.2.4 2.2.5 PC-to-supercomputer. CS301 chip . . . . . . . . . . . . . . . . . . . 24 First optical processor in stock . . . . . . . . . . . . . . . . . . . . 24 Microsoft entering HPC arena . . . . . . . . . . . . . . . . . . . . . 25

2.3 Communication libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 Message-passing libraries . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 One-sided communications . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.3 Higher level libraries . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Message-passing using MPI . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1

2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6

Initialization and clean up . . . . . . . . . . . . . . . . . . . . . . . 28 Basic sends and receives . . . . . . . . . . . . . . . . . . . . . . . . 28 Asynchronous communications . . . . . . . . . . . . . . . . . . . . . 30 Performance and portability . . . . . . . . . . . . . . . . . . . . . . 31 Global functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4.7 Running MPI Codes with MPICH . . . . . . . . . . . . . . . . . . . 33 2.4.8 Parallelization Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5 Beowulf class cluster computers . . . . . . . . . . . . . . . . . . . . . . . . 35 2.6 Creating a Beowulf cluster powered by MPI . . . . . . . . . . . . . . . . . 36 2.6.1 Power consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6.2 2.6.3 2.6.4 VIA EPIA-V C3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Desktop mini-cluster . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Hardware conguration . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.6.5 Software conguration . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.7 Benchmark tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7.1 The algorithm of Bef f . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7.2 NAS parallel benchmark . . . . . . . . . . . . . . . . . . . . . . . . 42

3 Modeling Pseudooptical Neural Networks 44 3.1 Pseudooptical neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Complete rectilinear model . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.1 Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.2 Hologram construction and calculation . . . . . . . . . . . . . . . . 47 3.2.3 Image reconstruction calculations . . . . . . . . . . . . . . . . . . . 49 3.3 Creating a simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.1 Consecutive implementation . . . . . . . . . . . . . . . . . . . . . . 51 3.3.2 Parallel implementation . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4 Studying model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Conclusions A Tables and diagrams 54 59

A.1 TOP500 supercomputing resource . . . . . . . . . . . . . . . . . . . . . . . 59 A.2 VIA EPIA platform technical specications . . . . . . . . . . . . . . . . . . 62 A.3 Cluster drawings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 A.4 Cluster pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 A.5 Performance benchmark tests . . . . . . . . . . . . . . . . . . . . . . . . . 72 A.5.1 Without compiler optimization . . . . . . . . . . . . . . . . . . . . 72 A.5.2 With compiler optimization . . . . . . . . . . . . . . . . . . . . . . 73 2

A.5.3 Optimization gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 A.5.4 Processor scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 74 A.5.5 Controlling node performance . . . . . . . . . . . . . . . . . . . . . 75 A.6 Development tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 A.7 Running the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Introduction
The problem of building an articial intelligence (AI) system, comparable to human intelligence is challenging the researchers all over the world. Most of them agree in opinion that symbolic AI paradigm is exhausted and develop other methods of attacking AI problems, which require experimental survey. One of the promising methods is to develop the idea of resemblance between brain processes and optical holography. The goals of the project, described in this paper, are: to dene the problems of modern AI and the constraints of symbolic AI paradigm; to examine the pseudooptical approach to AI problems; to create a software simulator for the model of pseudooptical neural network; to build a parallel hardware computational platform for the simulator; The structure of this paper is determined by the requirements to dene suitable and theoretically valid methodological, algorithmic, hardware and software solution. It is essential to analyze the background of the problem put by. This requires a review of some signicant works and papers in theoretical articial intelligence, neurophysiology, psychology and neuroinformatics. Some most important works in adjoining areas, such as optics and holography also require to be reviewed. A correct choice of hardware and software platform for parallel high performance computations requires the author to perform systematization analyses of available solutions and to trace the trends of future development in this area. The constructed computer is to be thoroughly tested and prepared for parallel computations. The model is implemented in consecutive, scalar way and then parallelized to run on high performance platform. This also requires to be described. This paper consists of three chapters. One of the modern approaches to AI problems is described in the rst chapter. Some of the most important works in informatics and neurobiology are referred and analyzed. The concept of holographic brain and holographic associative memory is described. The second chapter is devoted to parallel computing. The modern state of parallel systems industry and parallel programming is considered. The description of construction of a sample of a Beowulf Cluster is presented. The cluster is prepared for parallel computations and this process is described in details. 4

The mathematical model of pseudooptical neuronal network (PNN) is presented in the third chapter. The notions, the most important theorems and algorithms are presented. The process of calculating a hologram is described along with the process of original image reconstruction. The third chapter contains the detailed description of the simulation software and the results of simulation in consecutive and parallel environments. The experimental results of simulation are cited. The eect of models parameters on the model is considered to nd the sets of parameters suitable for image reconstruction. The paper is enriched by a collection of bibliography references tightly related to the subject of the work. The results of cluster performance measurements, main cluster conguration les, simulation software source code, application GUI screen-shots and results of simulation are included as appendixes.

Chapter 1 The Shift of Articial Intelligence Paradigm


Since homo sapiens appeared one of the major goals of the mankind is self-cognition. The basic skills to survive, the religions, the philosophy and the science describe the rules of interaction between the human being and the world. The latest centuries, characterized by tremendous scientic achievements, unveiled a lot of natures mysteries. The human body received a profound study. People surrounded themselves with an army of machines simplifying and perplexing the human life at the same time. The computational power of computers is measured in Gigaops1 and Teraops. Although these facts coexist with another unfavourable fact: there still is lot of problems which can not be eciently solved by machines, moreover a lot of problems can not be solved by modern machines at all. There doesnt exist any operational articial intelligence system comparable to human one. At last there where no published announcements yet. There exist a large set of intellectual processes which are well known to run much quicker in human brain than in fastest computers. The examples of these processes are: data analysis, recognition, nding resemblance, making decisions based on imprecise conditions, abstract thinking. The marvellous thing is that the signal propagates millions times slower in neural networks (the speed is approximately 100m/s) than it does in electronic devices. What are the key-features of the structure and of the processes, determining high eectiveness of the brain? How can we achieve such eectiveness? The eld of articial intelligence (AI), formally founded in 1956, attempts to understand, model and design intelligent systems. Since the beginning of AI, two alternative approaches were pursued to model intelligence: on the one hand, there was the symbolic approach which was a mathematically oriented way of abstractly describing processes leading to intelligent behaviour. On the other hand, there was a rather physiologically oriented approach, which favoured the modelling of brain functions in order to reverse1

Flops stands for FLoating point Operations Per Second

engineer intelligence. A profound comparison of these paradigms can be found in A. Homanns book [1]. Between the late 1960s and the mid-1980s, virtually all research in the eld of AI and cognitive science was conducted in the symbolic paradigm. This was due to the highly inuential analysis of the capabilities and limitations of the perceptron by Minsky and Paperhthe 1969. The perceptron was a very popular neural model at that time. In the mid-1980s a renaissance of neural networks took place under the new title of connectionism, challenging the dominant symbolic paradigm of AI. The brain-oriented connectionist paradigm claims that research in the traditional symbolic paradigm cannot be successful since symbols are insucient to model crucial aspects of cognition and intelligence. Since then a debate between the advocates of both paradigms is taking place, which frequently tends to become polemic in many writings on the virtues and vices of either the symbolic or the connectionist paradigm. The main arguments put forward by the advocates on both sides fail to address the methodologically important and ultimately decisive question for or against a paradigm: How feasible is the development of an AI system or the understanding of a theory of cognition?

1.1

The particularities of natural informational processes

The main distinguishing principles of knowledge handling by natural intellect are displayed in [7, 3]: The indivisibility of information: particular and complicated things are more common for a human than abstract simple things; each human thought, every element of the thought has a very large explicit and implicit contexts which are unique for every person. A notion can not be separated from context, even if it is abstract. The context is formed out of thoughts bound to each other. The volume of knowledge is determined by the number of cross-relations. The knowledge is stored and processed in form of images. It is not only about graphical images, which are exceptionally well processed by humans. The thoughts and ideas are also images. The images are fuzzy and vague, that is the main dierence from the denite notions. It is the resemblance, not the identity operation that can be applied to images. The processes of generalization and notion formation are based on resemblance operations. The chains of fuzzy image transformations can not be deep. The whole system is tolerant to violations, conicts and paradoxes. Memory apparatus totally diers from computer memory: the information it is addressed directly, no pointers and references; the search process is bidirectional ; 7

the memory is global, has no local spaces. The information is not stored in particular neurons; it is, as will be shown below, stored distributed. The quality and speed of memory access heavily depends on the characteristics of information. Another great distinction of human memory is the characteristics of memory durability which has no analogue in modern computers; the durability and brightness of dierent reminiscences diers; the information can not be intentionally forgotten. Controllability of information processes is also a distinction. There are supraliminal processes controlled by consciousness which can be described by human. Other processes can not be eected by human, these are subconsciousness processes. Only the results of this processes indicate their existence. Supraliminal processes are consecutive and can be expressed in terms of endophasia (by Vygodsky). The lack of the above described principles in symbolic paradigm stipulates the shift of the AI paradigm required to overcome its constraints. The minimal requirements for new AI paradigm are: non-symbolic, analog information representation; existence of resemblance-nding procedures instead of identication procedures; quick processes with small depth and high level of parallelism; the parallelism has to be natural and analog and should be implemented as an internal property of memory and not as an optional addition; the distribution of information across the large memory zones, which store a large amounts of information simultaneously; parametric model, where the characteristics of model depends mostly on contiguous parameters than on the models architecture. Most likely the explanation and the answers are at the bottom of understanding totally dierent mechanisms of intellectual processes. Studying the human brain is essential.

1.2

Brain chemistry

The human brain contains more than 1011 computational elements, called neurons. This astronomical number of neurons is interconnected by 1015 of neural connectors, called synapses. This provide oxygen and nutritional support to this complex neural network a dense network of blood vessels is contained by the brain. To protect chemical-sensitive

brain from possible toxins it is separated from the main blood-stream by a so called bloodbrain (hematoencephalic) barrier. The structure of the brain if formed of glia cells. These cells actually ll the spaces between blood vessels and the neurons. The brain is very energy-eective. It consumes only 20W ! But comprising 2% of the body weight the brain consumes 20% of body oxygen. A neuron is a structural component of the neural network. It is a biological cell but several considerable distinctions allow it to perform calculating functions. As shown in the Figure 1.2.1 the neuron consists of three main parts: cell body, dendrites and axon. The axon of a cell is connected to the dendrites of the other cells. Brain chemicals are stored in sacks located at the end of a neuron branch, the site of the synapse, the space where one neuron sends its message to the next. An electrical charge frees chemicals from their holding tanks making their way across the synapse to the connecting neuron. The biochemical process is described step by step as in [4]. When a neuron res an electrical impulse (see Figure

Figure 1.2.1.

Biological

Neuron Structure.

1.2.2) it travels down a ber called the axon until it reaches the end of its line where the chemical molecules are stored. The electrical blast starts the chemical transmissions. The molecules that cross the synapse bombard the receiving neuron which has special receptors set up to bind with them. Molecules that travel through neurons are called neurotransmitters which have been said to modify and shape human behavior. Selective serotonin reuptake inhibitors (SSRI) regulate the neurochemical serotonin by inhibiting the reuptake of serotonin thus depleting the supply by blocking the receptor. Monoamine oxidase (MAO) is an enzyme that is found in many parts of the body. In the brain, monoamine oxidase destroys neurotransmitters, such as norepinephrine and serotonin. MAO inhibitors block Figure 1.2.2. The dynamics of neural the breakdown of those neurotransmitters by potential. limiting the activity of monoamine oxidase.

There are more than 30 substances, that can act as neurotransmitters and a lot of types of receptors with dierent feedback reactions. The complexity of biochemical brain processes is very high. Neurobiology researchers are trying to nd simple rules to generalize these processes. Among the results of such researches is the discovery of dierent types of electrochemical brain activity. The creation of biologically identical model is still in progress and will hardly be nished soon.

1.3

Articial neural networks

Modern articial neural networks are based on a extremely simplied model, ignoring the details of brain functionality. The task is to create more accurate model. Such a complex task requires a dierent mathematical methodology. There is no universally accepted denition of an Articial Neural Network (later NN). But perhaps most people in the eld would agree that an NN is a network of many simple processors ("units"), each possibly having a small amount of local memory. The units are connected by communication channels ("connections") which usually carry numeric (as opposed to symbolic) data, encoded by any of various means. The units operate only on their local data and on the inputs they receive via the connections. The restriction to local operations is often relaxed during training. Some NNs are models of biological neural networks and some are not, but historically, much of the inspiration for the eld of NNs came from the desire to produce articial systems capable of sophisticated, perhaps "intelligent", computations similar to those that the human brain routinely performs, and thereby possibly to enhance our understanding of the human brain. An articial neuron, as a unit of NN, is a rst approximation imitation of a biological neuron. The structure is described as shown in Figure 1.3.1. The input of a neuron is formed by a set of signals, each of those is output of another neuron. Each input is multiplied by a weight, analogous to synaptic stimulation. Then all the inputs are summed up. The acquired value determines the level of neuron activation.

Figure 1.3.1. Articial neuron scheme. The activation function is a nonlinear sigmoid function which determines the value of output signal. Most NNs have some sort of "training" rule whereby the weights of connections are adjusted on the basis of data. In other words, NNs "learn" from examples, as children learn to distinguish dogs from cats based on examples of dogs and cats. If trained carefully, 10

NNs may exhibit some capability for generalization beyond the training data, that is, to produce approximately correct results for new cases that were not used for training. NNs normally have great potential for parallelism, since the computations of the components are largely independent of each other. Some people regard massive parallelism and high connectivity to be dening characteristics of NNs, but such requirements rule out various simple models, such as simple linear regression (a minimal feed-forward net with only two units plus bias), which are usefully regarded as special cases of NNs. Here is a sampling of denitions of NN. According to the DARPA Neural Network Study : ... a neural network is a system composed of many simple processing elements operating in parallel whose function is determined by network structure, connection strengths, and the processing performed at computing elements or nodes. According to Haykin [10]: A neural network is a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects: Knowledge is acquired by the network through a learning process. Inter-neuron connection strengths known as synaptic weights are used to store the knowledge. According to Nigrin [11]: A neural network is a circuit composed of a very large number of simple processing elements that are neurally based. Each element operates only on local information. Furthermore each element operates asynchronously; thus there is no overall system clock. According to Zurada [12]: Articial neural systems, or neural networks, are physical cellular systems which can acquire, store, and utilize experimential knowledge. More information on neural science can be obtained from [19].

1.4

Holography

Denition.

11

The optical recording of the object wave formed by the resulting interference pattern of two mutually coherent component light beams. In the holographic process, a coherent beam rst is split into two component beams, one of which irradiates the object, the second of which irradiates a recording medium (see Figure 1.4.1). The diraction or scattering of the rst wave by the object forms the object wave that proceeds to and interferes with the second coherent beam, or reference wave at the medium. The resulting pattern is the threedimensional record (hologram) of the object wave. The history. Holography dates from 1947, when British (native of Hungary) scientist Dennis Gabor developed the theory of holography [18] while working to improve the resolution of an electron microscope. Gabor coined the term hologram from the Greek words holos, meaning "whole," and gramma, meaning "message". Further development in the eld was stymied during the next decade because light sources available at the time were not truly coherent2 . This barrier was overcome in 1960 by Russian scientists N. Bassov and A. Prokhorov and American scientist Charles Towns with the invention of the laser , whose pure, intense light was ideal for making holograms. In that year the pulsed-ruby laser was de-

veloped by Dr. T.H. Maimam. This laser system (unlike the continuous wave laser normally used in holography) emits a very powerful burst of light that lasts only a few nanoseconds. It eectively freezes movement and makes it possible to produce holograms of high-speed events, such as a bullet in ight, and of living subjects. The rst hologram of a person was made in 1967, paving the way for a specialized application of holography: pulsed holographic portraiture. In 1962 Emmett Leith and Juris Upatnieks [13] of the University of Michigan recognized from their work in side-reading radar that holography could be used as a 3-D visual medium. In 1962 they read Gabors paper and "simply out of curiosity" decided to duplicate Gabors technique using the laser and an "o-axis" technique borrowed from their work in the development of side-reading radar. Also in 1962 Dr. Yuri N. Denisyuk [14] from Russia produced a white-light reection hologram which, for the rst time, could be viewed in light from an ordinary incandescent light bulb.
2

Figure 1.4.1. The holography scheme.

coherent - from a single point, and of a single wavelength: monochromatic or one-color

12

Another major advance in display holography occurred in 1968 when Dr. Stephen A. Benton invented white-light transmission holography while researching holographic television at Polaroid Research Laboratories. In 1972 Lloyd Cross developed the integral hologram by combining white-light transmission holography with conventional cinematography to produce moving 3-dimensional images. More information about holography can be obtained from [16].

1.5

The holographic brain

One of the problems facing neural science is how to explain evidence that local lesions in the brain do not selectively impair one or another memory trace. In his works Karl Pribram [20,21,22] uses some experimental and philosophic evidences to support the idea of holographic nature of reality, introduced by famous physicist Bohm. He provides the proof of the idea that the brain implements holonomic transformations that distribute episodic information over regions of the brain (and later "refocuses" them into a form in which we re-member). Karl Pribrams holonomic3 theory reviews evidence that the dendritic processes function to take a "spectral" transformation of the "episodes of perception". This transformed "spectral" information is stored distributed over large numbers of neurons. When the episode is remembered, an inverse transformation occurs that is also a result of dendritic processes. It is the process of transformation that gives us conscious awareness. The idea of holography is that every part of the hologram contains information about the whole object. If you cut the holographic photographic plate up into small pieces, the whole image can still be extracted from any of them (although with some loss of clarity). How was this idea to combine holography with neuroscience born? For decades numerous studies have shown that rather than being conned to a specic location, memories are dispersed throughout the brain. In a series of landmark experiments in the 1920s, brain scientist Karl Lashley [2] found that no matter what portion of a rats brain he removed he was unable to eradicate its memory of how to perform complex tasks it had learned prior to surgery . The only problem was that no one was able to come up with a mechanism that might explain this curious "whole in every part" nature of memory storage. Then in the 1960s Pribram encountered the concept of holography and realized he had found the explanation brain scientists had been looking for. It seemed immediately plausible that the distributed memory of the brain might resemble this holographic record. In the spring of 1965, two physicists, Bela Julesz and K. S. Pennington, called attention to a similarity, in principle, between a new type of optical hologram and memory stores in
3

Pribram uses the term holonomy to refer to a dynamic (or changing) hologram.

13

living brains. Philip Westlake, a brilliant UCLA cyberneticist, has shown that equations of physical holograms match what the brain does with information. D. Gabor publishes Associative Holographical Memories in 1969 [17]. Pribrams theory also explains how the human brain can store so many memories in so little space. It has been estimated that the human brain has the capacity to memorize something on the order of 10 billion bits of information during the average human lifetime (or roughly the same amount of information contained in ve sets of the Encyclopedia Britannica). Our ability to quickly retrieve whatever information we need from the enormous store of our memories becomes more understandable if the brain functions according to holographic principles. Similarly, it is more understandable how the brain is able to translate the avalanche of frequencies it receives via the senses (light frequencies, sound frequencies, and so on) into the concrete world of our perceptions.

1.6

The optics of memory

Paul Pietsch - a Professor Emeritus at Indiana University, has been testing such ideas for several years by transplanting and reorganizing the brains of salamanders [23, 25, 24]. His point of departure was to imagine memory as a deck of cards with all the suits and numbers on each card; shuing the deck would not change the deal. In earlier experiments, he found that drastic shuings of a carnivorous salamanders brain do not scramble the beasts appetite an instinctive behavior. Since then, he extended shuebrain research to learned behavior. He took an adult axolotl (a Mexican salamander that spends its whole life in water) and taught it to look up (as in Figure 1.6.1) when someone taps the side of its bowl and rewards it with a beef liver or a live earthworm. Soon, tapping alone called forth the response, even after weeks with immediate reward. The experiment was to transplant various parts of a trained adults brain into the head of a naive larva (see Figure 1.6.2). What happened is that a previously naive host became a looker-up without any training. The transplant alone sufced to give the animal the necessary memory. Neither the part of the brain, nor the site in the hosts cranium changed Figure 1.6.1. A looking-up the results. What about the donors? They still remembered salamander. the task. In other words, the necessary information existed both in the pieces of transplanted brain and in the parts left in place. The implication is that memory is holographic; all memory can theoretically be contained in the smallest unit of the medium, whatever that may be. 14

Holographic theory would also explain the chemical transfer of memory how information from the brain of one worm, rat, mouse, or hamster might be extracted into a test tube and injected into another animal, there to mediate recall in the absence of the recipients previous experience. Such reports from a dozen laboratories over the past few years have excited the press and reading public. Yet a hologram can write itself into anything, including a molecule. At the very same time, the theory in no way at all restricts the brains programs to molecules, as such. Theres no rule against using, say, molecules, voltages on cells, or groups of neurons to carry the information. The program might even be carried at many dierent levels simultaneously. Still, the theory does not completely rule out uneven distribution of memory, particularly in the complex brains of higher animals. Indeed, it is not hard to make a case for dierent storage within the two hemispheres of the human cerebrum. Michael Gazzaniga published a book with the results of almost twenty years of "splitbrain" research. He cut the neural cord, connecting the hemispheres. The left cerebral hemisphere emerged as the dominant, verbal, arithmetic side, while the right brain held recollections of form and texture. The tendencies appear to hold whether patients were left- or right-handed. Early in 1971, music was found among the repertoire of the right hemisphere. In addition, a totally illiterate right hemisphere can learn to read and write in less than six months as though it had a tremendous head start. On top of this, Gazzanigas observations convince him that the consignment of memories to one side of the brain emerges with maturity. Children seem to employ both hemispheres. Thus it would seem that the brain can reshape its contents and make decisions about what will go where. But it is also quite possible that split-brain research identies not unequal storage but unequal access. The cerebral hemispheres, after all, do mirror rather than carbon-copy each other. These facts are very important to understand the pseudooptical neural approach to the problems of articial intelligence. Although further discussion of this subject goes beyond the limits of this paper, other works [6, 5], continuing the investigation of intellectual properties of left and right cerebral hemispheres have to be mentioned here. Figure 1.6.2. Brain transplant.

15

1.7

Solving with optical methods

The project described in this paper is aimed at solution of the basic problem of articial intelligence by optical holography methods. It was shown that this approach has a strong background in neurophysiology, neurophsyhology and physics. Objectives: A search for a model, implementing the described holographic principles to articial intelligence; A search for a computational environment, a platform, for implementing the model; Development of model-simulation software with highly demanded massive-parallelism; Experiments with the model to achieve holographic phenomena; Further research of learnability (holography), associability of response (holography). The methods may vary, for example [7] and [3].

16

Chapter 2 Parallel Computing


2.1 Types of parallel computers

There are many types of computers available today, from single processor or scalar computers to machines with vector processors to massively parallel computers with thousands of microprocessors. Each platform has its own unique characteristics. Understanding the dierences is important to understanding how best to program each and to nd a suitable
ower platform for the project in terms of PCost . However, the real trick is to try to write programs that will run reasonably well on a wide range of computers. This systematization

analysis is based on the research of David Turner [26].

2.1.1

Scalar computers

Your typical PC or workstation is a scalar, or single-processor, computer. These use many dierent processors, and run a wide variety of operating systems. Below in Table 2.1.1 is a list of some current processors and the operating systems each can run. Linux is now available for almost all processors. CPU Manufacturer/Name Intel Pentiums, AMD Athlons Alpha 21264 IBM Power3 Sun UltraSparc3 SGI R12000 Motorola G4 Frequency 1,5 - 3 GHz 667 MHz 375 MHz 500 MHz 400 MHz 500 MHz Operating Systems Linux, Windows, *BSD, Solaris Tru64, Unix, Alpha, Linux AIX, Linux Solaris, Linux IRIX MacOS, Linux

Table 2.1.1. Processors and operating systems.

2.1.2

Parallel vector processors

The great supercomputers of the past were built around custom vector processors. 17

These are the expensive, high performance masterpieces pioneered by Seymour Cray. There are currently only a few examples of computers in production that still use vector processors, and all are parallel vector processors (PVPs) that run small numbers of vector processors within the same machine. Examples of computers with parallel processors see below in Table 2.1.2. Vector processors operate on large vectors of data at the same time. The compiler automatically vectorizes the innermost loops to break the work into blocks, often of 64 elements in size, when possible. The functional units are pipelined to operate on all 64 elements within a single clock cycle, and the memory subsystem is optimized to keep the processors fed at this rate. Since the compilers do much of the optimization automatically, the user only needs to attack those problem areas where there is some impediment to the compiler understanding how to vectorize the loop. CPU Manufacturer/Name Cray SV1 Fujitsu VSX4 Computational Power peak of 4.8 GFlops/proc peak of 1.2 GFlops/proc Figure 2.1.1. Cray SV1. Number of Processors 100 processors 64 processors

Table 2.1.2. Vector systems. MPI, compiler directives, OpenMP, and pThreads packages can be used to parallelize a program to run across multiple vector processors.

2.1.3

Shared-memory multiprocessors

Shared-memory multiprocessor (SMP) systems [33] have more than 1 scalar processors that share the same memory and memory bus. This category includes everything from a dual-processor Intel PC to a 256 processor Origin3000. Each processor may have its own cache on a dedicated bus, but all processors are connected to a common memory bus and memory bank. In a well designed system, the memory bus must be fast enough to keep data owing to all the processors. Large data caches are also necessary, as they allow each processor to pull data into local memory to crunch the data while other processors use the memory bus. Most current SMP systems share these two design criteria. Early Pentium SMP systems, and the multiprocessor nodes of the Intel Paragon, did not and therefore the additional processors were relatively useless. Below are some other SMP systems, and the number of processors that each system can have. 18

CPU Manufacturer/Name SGI Origin3000 SGI Origin2000 SGI Origin200 Compaq ES40 Compaq DS20 IBM 43p IBM 44p Intel Pentium PCs

Frequency 256 MIPS R12000 processors 128 MIPS R10000 processors 2-4 MIPS R10000 processors 2-4 Alpha 21264 processors 2 Alpha 21264 processors 1-2 Power3 processors 2-4 Power3 processors 2-4 Pentium processors Table 2.1.3. SMP Systems.

Operating Systems IRIX IRIX IRIX Tru64 or Linux Tru64 or Linux AIX or Linux AIX or Linux MS Windows Linux, others

SMP systems can be programmed using several dierent methods. A multithreading approach can be used where a single program is run on the system. The program divides the work across the processors by spawning multiple light-weight threads, each executing on a dierent processor and performing part of the calculation. Since all threads share the same Figure 2.1.2. Shared memory multiprocessor. program space, there is no need for any explicit communication calls.

Compilers are sophisticated enough to using multi-threading to automatically parallelize some codes. Unfortunately, this is not the case very often. While using multithreading is undoubtedly the most ecient way to program SMP systems, it is not easy to do it manually. Plus there are many choices when it comes to choosing a multi-threading package. OpenMP may be the most popular standard at the moment, but there are many vendor specic packages and other standards such as the POSIX pThreads package. In summary, multi-threading may produce the most ecient code for SMP systems, but it may not be the easiest way parallelize a code. It also may not be portable, even across SMP systems, unless a standard like OpenMP is chosen that is supported almost everywhere. Of even more concern is that a multi-threading code will not run on distributed memory systems. In the message-passing paradigm, each processor is treated as a separate computer running a copy of the same program. Each processor operates on a dierent part of the problem, and data exchange is handled by passing messages. More details about this approach will be presented in the next section. While not as ecient as using multi-threading, the resulting code is much more portable since message-passing is the predominant method for programming on dis19

tributed memory systems. Message-passing implementations on SMP systems are very ecient since the data doesnt have to traverse the network as in distributed memory systems. It is just copied from memory to memory, which should occur at high speed and with little overhead.

2.1.4

Distributed memory MPPs


There is an even wider variety of the largest MPP (massively parallel processor) systems, the distributed memory computers. However, these systems all share a few traits in common. These systems are made of many individual nodes, each of which is essentially an independent computer in itself. In fact, in the case of workstation/PC clusters, each node is a computer in itself. Each node consists of at least one processor, its own memory, and a link to the

Figure 2.1.3. Intel Paragon (2D mesh topology).

network that connects all nodes together. Aside from these generalizations, distributed memory systems may look very dierent. Traditional MPP systems may contain hundreds or thousands of individual nodes. They have custom networks that are typically very fast with relatively low latencies. The network topologies very widely, from completely connected systems to 2D and 3D meshes and fat trees. Below is a partial listing (Table 2.1.4) of some of the more common MPP systems available today, and some of the characteristics of each. There have been many other large MPP systems over the last decade. While some systems may remain, the companies have gone out of business. These include machines such as the Thinking Machines CM-5, the Kendall Square systems, the nCube 2, and the Intel Paragon listed above. These systems soared in popularity, then crashed just as fast. The reasons for this vary, from the diculty of surviving the high cost of using custom components (especially processors) to having good technology but a poor business model. Manufacturer Cray T3E IBM SP Intel Paragon Nodes 1000 nodes 512 nodes 1836 nodes Processors Alpha 21164 Power3 i860 proc Topology 3D Torroid Colony 2D mesh Capacity 340 MB/sec 900 MB/sec 130 MB/sec Latency 2-3 s ? s 100 s

Table 2.1.4. MPP systems. MPP systems are programmed using message-passing libraries. Most have their own custom libraries, but all current systems also support the industry standard MPI message20

passing interface. This allows codes programmed in MPI to be portable across distributed memory systems, and also SMPs as described previously. Unfortunately, the MPI implementation is not always as ecient as the native communication library, so there is still some temptation to us the native library at the expense of program portability. The Cray T3E is an extreme case, where the MPI implementation only achieves 160 MB/sec while the native SHMEM library can deliver twice that rate.

2.1.5

Cluster computers

Distributed memory computers can also be built from scratch using mass produced PCs and workstations. These cluster computers are referred to by many names, from a poormans supercomputer to COW s (clusters of workstations), and NOW s (networks of workstations) [37, 38]. They are much cheaper than traditional MPP systems, and often use the same processors, but are more dicult to use since the network capabilities are currently much lower. Cluster computers are also usually much smaller, most often involving fewer than 100 computers. This is in part because the networking and software infrastructure for cluster computing is less mature, making it dicult to make use of very large systems at this time. Below is a list (Table 2.1.5) of some local clusters and their characteristics, plus some other notable systems from around the country. One look at the communication rates and message latenFigure 2.1.4. 4 node PC cluster.

cies shows that they are much worse than for traditional MPP systems. Obviously you get what you pay for, but the networks for cluster computers are quickly closing the gap. Name Octopus Alice Gecoa IBM Cluster C-PLANT Nodes 16 64 24 52 ? Processors Pentium III dual-Pentium Alpha 21164 Power3 IBM Alpha 21164 OS Linux Linux Linux AIX Tru64 Network Fast Ethernet Fast Ethernet Gigabit Ethernet Gb.Eth. + Myrinet Myrinet Capacity 8.5 MB/S 8.5 MB/S 30 MB/S 100 MB/s 100 MB/s Latency ~100 s ~100 s ~100 s ~20 s ~15 s

Table 2.1.5. Cluster systems. It is therefore more dicult to get many programs to run well on clusters computers. Many applications will not scale to as many nodes due to the slower networking, and some codes simply will not run well at all and must be limited to MPP systems. There is also a wide range of capabilities illustrated by the mixture of clusters above. This range starts with the ultra-cheap PC clusters connected by Fast Ethernet. These 21

systems can be built for tens of thousands of dollars, but the slow speed of the Fast Ethernet interconnects greatly limits the number of codes that can utilize this type of a system. The adoption of clusters, collections of workstations/PCs connected by a local network, has virtually exploded (see Figure A.1.1) since the introduction of the rst Beowulf cluster (see Section 2.5) in 1994. The attraction lies in the (potentially) low cost of both hardware and software and the control that builders/users have over their system. There is nowadays a fair choice of communication networks available in clusters. Of course 100 Mb/s Ethernet is always possible, which is attractive for economic reasons, but has the drawbacks of a very modest maximum bandwidth (about 10 MB/s) and a high latency (about 100 s). Gigabit Ethernet has a maximum bandwidth that is 10 times higher but has about the same latency. Alternatively, there are for instance networks that operate from user space, like Myrinet [27], Giganet cLAN [28], and SCI [29]. The rst two have maximum bandwidths in the order of 100 MB/s and a latency in the range of 1520 s. SCI has a higher bandwidth (400500 MB/s theoretically) and a latency under 10 s. The latter solution is more costly but is nevertheless employed in some cluster congurations. So, possibly apart from the speed of the processors and of the software that is provided by the vendors of supercomputers, the distinction between clusters and other classes of HPC machines becomes rather small and will undoubtly decrease in the coming years.

2.1.6

Network computing

Cluster computers are made from many computers, usually identical, that are located in the same room. Heterogeneous clusters use dierent computers, and are much more dicult to program because the computing rates and even the communication rates may vary. Network computing, or Internet computing, is using a heterogeneous mixture of geographically separated workstations to perform calculations. The idea of using the spare cycles on desktop PCs or workstations has been around for years. Unfortunately, it only works for a very limited number of applications due to the very low communication rate of the network, the large latencies, the diering CPU rates. and the need to allow for fault tolerance. The SETI at home project is probably the most famous application that can make use of any spare cycles on the Internet. There are also commercial companies such as Entropia that help companies to do the same.

22

2.1.7

Metacomputing

Metacomputing is a similar idea, but with loftier goals. Supercomputers that may be geographically separated can be combined to run the same program. However, the goal in metacomputing is usually to provide very high bandwidths between the supercomputers so that these connections do not produce a bottleneck for the communications. Scheduling exclusive time on many supercomputers at the same time can also pose a problem. This is still an area of active research.

2.2
2.2.1

The latest conventions in HPC


The HPC trends

TOP500 Project [30] has started nearly a decade ago. It is an ultimate resource about supercomputers, a starting point. One of its goals is to track the history of supercomputing by publishing the lists of most powerfull supercomputers (A.1.1, A.1.2). The database gathered allows analysts to plot the trends. See current types of clusters in Figure A.1.1, current architecture/performance distribution in Figure A.1.2. The trends(forecasts) can be followed in Figure A.1.3.

2.2.2

Linux clusters in TOP500

Go back more than a few years ago and Linux had zero representation among the worlds fastest supercomputers, which relied on traditional, monolithic mainframe machines running Unix or other operating systems. In the last few years, however, the open source operating system has begun dominating the list, thanks to clustering and Intel hardware in the supercomputing market. The number of cluster systems has increased strongly in the last three to four years from a few to more than half the list (280/500). See Tables A.1.1, A.1.2 in Appendix. Japans Earth Simulator Center remains the worlds fastest supercomputer, according to the Top 500 list. The United States owns the second- through fth-fastest computers in the world. Although list creators, who use the Linpack benchmark [53] to measure performance of the worlds biggest machines, do not break down systems by operating system, a look at the number of clusters is a good gauge of the number of Linux systems. IBM held three of the top ve positions with high performance cluster systems, which are used in academic, government, and other supercomputing research. As has been the case in industry, Linux is lling the slots previously occupied by Unix systems, such as the SuperDome systems, which have dropped o as Linux has risen.

23

In addition to cluster systems from IBM, HP, Dell, and other manufacturers, Linux has also been used in other systems by SGI and others using Itanium 2 or Xeon processors from Intel and Opteron processors from AMD. Claybrook, a supercomputer analyst, said, referring to the price/performance advantage of the open source operating system. "Eventually, I think youll nd Linux is going to replace everything on the Top 500 list."

2.2.3

PC-to-supercomputer. CS301 chip

A computer chip that will enable personal computers to perform some calculations as fast as some supercomputers was unveiled recently [35]. Developed by ClearSpeed Technologies, based in California, the CS301 chip is capable of 25 GFlops. Putting around 20 ClearSpeed chips into a few personal computers could potentially provide the sort of power normally only found in a supercomputer built from hundreds of parallel processors. The CS301 works as a supplementary component to a regular processor. A chipset carrying one or two of the chips can be plugged into a normal PC like a graphics card and perform intensive calculations on behalf of the machines normal processor. The chip is also very power-ecient, consuming only 3W and ClearSpeed is working on a version for laptop computers.

2.2.4

First optical processor in stock

A superfast computing processor that uses light, not electrons, to perform calculations has gone on sale recently [34]. Lenslet, the Israeli company that developed the processor, say its light speed calculations deliver the power of a supercomputer in a single device. The device is called Enlight and can perform 8000 billion arithmetic operations per second, about 1000 times faster than a standard processor. Enlight is not a general purpose processor like a Pentium. Instead, each processor will be custom-built to perform a specic set of tasks, and will not be programmable. Much research has been done to try to exploit the much faster speed at which light travels compared to electronic signals, but most commercial work in this area has focused mainly on optical interfaces. These devices allow bre optic and related systems to communicate with traditional electronic systems. Strictly speaking, Enlight is a hybrid device, housing both electronic and optical circuits, but it is the optical processing that make it so fast. What is very important, it allows you to do a massive level of operations in parallel. The processing in the Enlight device is carried out using a process called vector matrix-multiplication, which allows calculations to be performed on 256 optical inputs. The beams from 256 lasers are added or

24

multiplied together when shone on a matrix device called a spatial light modulator. The outputs are then read by an array of light detectors.

2.2.5

Microsoft entering HPC arena

It is wrong to underestimate the inuence of Microsoft Corp. on market in various domains. HPC is one of the areas where Microsoft has no presence at all. Althought the situation is changing. Microsoft announced [36] it would make its entry into the HPC arena and promised to deliver Windows Server 2003 HPC for that market sometime during the second half of 2005. The upcoming version is expected to give various versions of Linux-based clusters a tougher run for their money. The new version will be targeted at applications involving parallel computing, engineering, and life sciences, all of which are markets where Linuxbased systems have made signicant inroads. Windows Server 2003 HPC will adhere to established industry standards including MPI for HPC, company ocials said.

2.3

Communication libraries

As discussed earlier, there are many ways to program multiprocessor computers. SMP computers can be programmed using multi-threading packages, or using message-passing. Distributed memory systems commonly use message-passing approaches of some sort, often having their own custom communication library at the heart. The one common feature between all modern multiprocessor machines is that they now support the MPI message-passing interface. While MPI may not be the optimal solution on some systems, it does provide the same functionality and therefore makes MPI codes truly portable across a wide range of platforms. The application programmer must still optimize for the general case in order to make the code run well. For example, mapping a simulation space to a 2D virtual mesh may work well since the 2D virtual mesh maps well onto a wide variety of physical topologies such as a 3D mesh, hypercube, fat tree, etc.

2.3.1

Message-passing libraries

The goal of the Message-Passing Interface (MPI) [47] simply stated is to develop a widely used standard for writing message-passing programs. As such the interface should establish a practical, portable, ecient, and exible standard for message passing. A complete list of goals follows. Design an application programming interface (not necessarily for compilers or a system implementation library). 25

Allow ecient communication: Avoid memory-to-memory copying and allow overlap of computation and communication and ooad to communication co-processor, where available. Allow for implementations that can be used in a heterogeneous environment. Allow convenient C and Fortran 77 bindings for the interface. Assume a reliable communication interface: the user need not cope with communication failures. Such failures are dealt with by the underlying communication subsystem. Dene an interface that is not too dierent from current practice, such as PVM, NX, Express, p4, etc., and provides extensions that allow greater exibility. Dene an interface that can be implemented on many vendors platforms, with no signicant changes in the underlying communication and system software. Semantics of the interface should be language independent. The interface should be designed to allow for thread-safety. Library MPICH LAM MPI MP_Lite MPI/Pro PVM TCGMSG Vendor ANL Notre Dame Ames Lab [44] MPI/Pro [45] ORNL TCG Description Full MPI for almost every architecture Full MPI for almost every architecture High performance, light weight MPI Commercial MPI Parallel Virtual Machine The Chemistry Group Message system

Table 2.3.1. Common message-passing libraries. The MPI denes the syntax and functionality of the library calls, and is dened by a committee. It is not a library itself. There are many MPI implementations that meet this standard. Supercomputer vendors usually have their own MPI library that is optimized for their particular machines. There are also several MPI implementations that are distributed freely and work on a variety of architectures, MPICH [46] and LAM [39] being the most commonly used. Below is a listing of the more common message-passing libraries (see Table 2.3.1). Included are two older libraries that are not used much anymore due to the popularity of MPI. PVM [40] and TCGMSG [52] are predecessors of MPI, and their time has passed even though some of their codes have not.

26

2.3.2

One-sided communications

Normal message-passing communications are two-sided, require the cooperation of both the sending and receiving nodes. There are also libraries that support one-sided communications, namely gets and puts. These operations allow one node to get data from, or put data to, another node without its cooperation. This approach certainly sounds superior to using two-sided communications, where it is necessary to coordinate each data transfer on both sides. And it denitely is useful at times. In general, it is not as easy to use as it would seem since it often requires that the application programmer manually perform handshaking that is done automatically with two-sided communications. For example, a node cant get data from another node until the source node signals that the data is ready for transfer, so there are two communication calls needed anyway. The MPI standard has dened a set of one-sided calls in the newest 2.0 release. At this time, not all of the MPI implementations support these calls, so MPI programs that use them may not be portable for a while. The SHMEM library developed by Cray is a one-sided library for the Cray T3E and SGI Origin systems. It has become so popular that it is growing into a standard and is being implemented on a variety of platforms. The GPSHMEM library is a general purpose SHMEM library that provides the same one-sided interface but is implemented on top of lower level libraries.

2.3.3

Higher level libraries

There are several higher level libraries built on top of these message-passing libraries that can make life easier for the application programmer. Unfortunately, there is always a trade-o, and in this case it comes in a loss in eciency. Additional research is needed to more fully understand the full cost of these methods. The Global Array toolkit makes any distributed memory system look somewhat like an SMP at the application level. A global address space allows data to be stored across the nodes without the application programmer having to worry about the data layout. The data is then stored and retrieved when needed by any node, independent of what node it is actually on. All of the messy message-passing is hidden from the programmer. The distributed data interface (DDI) of the quantum chemistry code GAMESS is another example of this approach. This layer provides the application with a global view of memory that is abstracted from the underlying message-passing layer.

27

2.4

Message-passing using MPI

Some of the basic MPI two-sided communication examples are presented here. This code is designed for MPICH [46] library, but should work with any MPI-compatible implementation as the interfaces are dened in MPI standard. The MPI standard denes a much larger set of functions than is presented here, but most are not needed unless you are writing very advanced programs [42, 43].

2.4.1

Initialization and clean up

A typical MPI run involves many processors, each running the same program but operating on a dierent part of the data. It is possible to use a master/slave approach as well, where there is one master node that divides out the workload and collects the results. While this approach sounds tempting, it is more dicult to use in general because you need to maintain two separate programs, and also launch the master and slaves separately. This approach is not recommended. The example below shows the C syntax for the initialization and cleanup functions needed in every MPI program. The MPI_Init() function simply initializes the MPI system. The MPI_Comm_size() function returns the number of nodes that the code is running on, as specied in the mpirun command. The MPI_Comm_rank() function returns the relative node number, between 0 and the number of nodes minus 1, assigned to each of the nodes. This rank is then used in the rest of the program to determine which data each node will work with, etc. #include "mpi.h" MPI_Init( MPI_COMM_WORLD, &argc, &argv ); MPI_Comm_size( MPI_COMM_WORLD, &nprocs ); MPI_Comm_rank( MPI_COMM_WORLD, &myproc ); MPI_COMM_WORLD is an MPI communicator, signifying that the command is taking place within the full set of nodes. Simply specify MPI_COMM_WORLD where needed and forget about using communicators unless you are dealing with multiple levels of parallelism. nprocs is an integer, the number of processors or nodes. myproc is an integer, the relative node number for each node. In C, ierror, providing a way to signal the programmer if an error has occurred, is the return value of the function itself.

2.4.2

Basic sends and receives

The basic 2-sided communication involves a send function on the source node and a receive function on the destination node. There are several types of send and receive functions, 28

each useful for dierent situations. Unfortunately, the MPI standard does not exactly dene how each must be implemented, so you can get varying results depending on the MPI implementation, the architecture you are running on, or the size of the data being transferred. This is clearly a bad situation, but we must make the best of it. To write a portable code, it is therefore necessary to program for the worst case for each function. MPI_Send(&buf, count, MPI_Datatype, dest, tag, MPI_COMM_WORLD); MPI_Recv(&buf, count, MPI_Datatype, source, tag, MPI_COMM_WORLD, &status); The syntax of the basic MPI_Send() and MPI_Recv() functions are shown above. buf points to the rst element of the data to be sent, or where the data to be received is to be put. In C, this is a pointer. count is the number of elements of type MPI_Datatype to be transferred to the destination dest from source. It is common to use MPI_COMM_WORLD and ignore ierror in most cases. MPI_Datatypes for C: MPI_DOUBLE MPI_FLOAT MPI_INT MPI_BYTES MPI_CHAR MPI_DOUBLE_PRECISION MPI_DOUBLE_COMPLEX MPI_REAL MPI_INTEGER MPI_CHARACTER The MPI_Send() function sends a block of data to the destination node dest. The block starts at the address of buf and with the size calculated from count and the size of the MPI_Datatype specied, which may vary between machines. This function will block at least until the data has been sent or copied to a send buer, so that the data can not be trampled before it is sent. Depending on the implementation, and possibly the message size, the MPI_Send() may also block until the MPI_Recv() process has started on the destination node. This is an unfortunate situation, since it will force synchronization between the two nodes in some circumstances and not in others. For complete portability, you should therefore program for the worst case and assume that synchronization may occur. The standard MPI_Send() function may choose whether to use a send buer for performance reasons, or to avoid blocking. Most of the time, for small messages, this is done automatically. However, for large messages the buers may not be large enough, 29

and the implementation may choke on them. In this case, you will need to manually allocate a larger buer and use buered versions of the send and receive functions. If this need arises, look at the manpages for MPI_Bsend() and MPI_Brecv() in places like the MPICH documentations. An MPI_Recv() function will block until a message is received with a matching size, message tag, and from the specied source. A source of MPI_ANYSOURCE can be used to accept a message from any source that has the correct size and message tag. A message tag of MPI_ANYTAG is a wildcard that matches any tag. The message tag provides a secondary method for choosing between messages that come from the same source, allowing out-of-order reception if desired. While this may be convenient, out-of-order reception requires extra buering which is inecient, so this method should be used only with not time-critical tasks. Using the MPI_ANYSOURCE wildcard may also be convenient at times, but it can also lead to ineciency. In time-critical areas, it is highly recommended to always specify the the source and destination, and not to rely on message tags (set them to 0). This provides the message-passing system with all the information needed to streamline the transfer as much as possible.

2.4.3

Asynchronous communications

It is at times benecial to use asynchronous communications that initiate the send or receive but do not block on their completion. This provides greater exibility, and can produce more ecient code. As shown below, asynchronous communications have the same syntax as their blocking counterparts, except that their name includes an i which stands for immediate, and they each have an associated message id that is used in conjunction with the MPI_Wait() function to force a block on completion. MPI_Isend(&buf, count, MPI_Datatype, dest, tag, MPI_COMM_WORLD, &msg_id); MPI_Irecv(&buf, count, MPI_Datatype, source, tag, MPI_COMM_WORLD, &msg_id); MPI_Wait( &msg_id, &status); Asynchronous communications have two primary purposes, both geared toward achieving higher performance. They can be used to try to overlap communications with computations when the system has this capability. For example, a source node would initiate an MPI_Isend() as soon as the data was available, then continue performing computations while letting a message co-processor or NIC handle the data transfer hopefully without aecting the main processor greatly. The source node would block with the MPI_Wait() only when it needed to reuse the memory space being transferred, giving the communication hardware the best chance of transferring the data before the block occurs. The same 30

is true for the destination node, which can post an MPI_Irecv() as soon as possible, giving the communication hardware the best chance to handle the communications behind the scene before the destination node blocks with an MPI_Wait(). Asynchronous communications are also valuable for bypassing communication buers, and therefore increasing the eective throughput. If a receive is posted before the matching send is initiated, the data will not be buered on the source node and can ow directly into the application memory on the destination node. In order to guarantee that the receive is preposted, some sort of local synchronization is needed. This can be accomplished with a simple handshaking as shown below, where the destination node preposts the receive then sends a dummy token to the source node to signal that it is ready to receive the data. This essentially doubles the eective latency, but for large messages the latency will not be important while bypassing the send buer can improve the eective throughput enormously and also cut down the memory usage. node 0 (source node) INT token, n=1000; DOUBLE array[1000]; /* Block on the destinations go signal to avoid buffering */ MPI_Recv( &token, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, NULL); MPI_Send( array, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD); node 1 (destination node) INT token, n=1000, msg_id; MPI_STATUS_TYPE status; DOUBLE array[1000]; /* Prepost the rec. and send the go sig. to the source node */ MPI_Irecv( array, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &msg_id); MPI_Send( &token, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); /* ... perform computations ... */ MPI_Wait( &msg_id, &status);

2.4.4

Performance and portability

While MPI has many variants of the send and receive functions, the methods demonstrated above provide both the best performance and portability. Basic MPI_Send() and MPI_Recv() functions should be used when blocking is necessary, and the messages are short. Otherwise, the basic send may behave dierently depending on the implementation and message size. For long messages, preposting receives as demonstrated above guarantees the minimum amount of buering. Using asynchronous communications also allows communications to occur while the computations continue on most systems. Both of these are very important

31

for achieving the maximum throughput for a long message. The only drawback is that the message latency is doubled, but for long messages this is irrelevant. Using just a few of the commands, along with good programming practices, provides the best performance in a portable manner. More complex functionality is described in full documentations for MPI [47].

2.4.5

Global functions

There are also many useful global communications that add to the point-to-point communications above. These functions operate across all nodes, and most result in a rough synchronization upon completion.

MPI_Bcast(&array, count, MPI_Datatype, root_node, MPI_COMM_WORLD); MPI_Allreduce( &array, result, count, MPI_Datatype, MPI_Op, MPI_COMM_WORLD); MPI_Barrier(MPI_COMM_WORLD); This table illustrates some of the most common functions. The MPI_Bcast() function broadcasts data from one node to all others. One common use for this is for global I/O functions, where node 0 would open a le, read in the data, and broadcast it to all other nodes. The MPI_Allreduce() function performs a global reduction of the data across the nodes. Typical uses are to do a summation, maximum, or minimum of a variable or each element of an array. These correspond to the MPI_SUM, MPI_MAX, and MPI_MIN MPI_Op operator types. Memory overow problems may arise when operating on very large arrays. Since these functions use binary exchanges to process the data, a node that comes in late may have log2(nprocs) copies of other arrays waiting in receive buers. Good implementations will avoid this by forcing synchronization at the beginning of the operation, or by pipelining and synchronizing the operations throughout. If you run into any memory problems like this, try putting an MPI_Barrier() function before each large global operation to force synchronization.

2.4.6

Timing

All codes should have timing functions embedded. These will help the user to understand where the time is being spent, pointing out where to concentrate optimization eorts, and ultimately lead to more ecient use of computer time. Below are the basic MPI timing routines. These return an absolute time in seconds since the program started. Only the relative time should be used, so bracket the area 32

of code of interest and take the dierence in time as illustrated below. After summing the time contributions of all compute and communication intensive sections of the code, simply dump an analysis of the timing results at the end of each run. DOUBLE PRECISION t0, t_code t_code = 0.0d0 t0 = MPI_Wtime() t_code = t_code + ( t0 - MPI_Wtime() ) double t0, t_code=0, MPI_Wtime(); t0 = MPI_Wtime(); t_code += ( t0 - MPI_Wtime() );

2.4.7

Running MPI Codes with MPICH

If you MPI implementation has been set up correctly, then compiling and running your code should be fairly easy. Unfortunately, this is not always the case. The MPI standard does not specify a default location, library name, or subdirectory structure. Vendor-specic MPI implementations will often have the MPI libraries and include les in standard places where the compilers will automatically nd them. In this case, you do not need to manually specify a path to the MPI library or include les at compilation time. Portable implementation such as MPICH may have their own compiler wrappers set up to automatically link the correct MPI libraries. Simply use the mpicc instead of the native cc (gcc). For complete documentation on the mpirun command, do a man mpirun. Its usage will vary depending on the type of parallel system you are running on. Below is a example for a Unix cluster running MPICH. Unix cluster running MPICH mpicc test.c -o test mpirun -np 4 test

2.4.8

Parallelization Strategy

Below is the listing of general steps that need to be performed to change scalar, consecutive code into a parallel program. 1. Compile and run the scalar code (a) Set up small and large test cases (b) keep a copy of the scalar version (handy for comparison) 2. Add initialization, clean up, and timing routines 33

(a) Time the full code, and the sections where most CPU time is spent (proling) (b) Optimize the scalar version 3. Track the memory utilization and major I/O (a) Clean up memory usage if possible (b) Get rid of I/O if possible 4. Determine a parallelization strategy (a) Refer to the literature available to nd similar examples (b) Consult an expert, discuss memory and CPU utilization, I/O (c) Calculate the transmission times and compare to computational times (model the proposed parallelization strategy) 5. Add in communication commands (a) Add parallel I/O, sends and receives, and global operations (b) Compile and check the single processor results (c) Try on 2 nodes, then more 6. Tune the performance (a) Analyze the performance using the timing dumps for various sized systems (b) Try asynchronous communications to bypass buers and overlap communications and computations (c) Map the application to a virtual network topology that can in turn be mapped to most real topologies (2D mesh) (d) Try other parallelization strategies if necessary (e) Test the code on other parallel computers In most cases the parallel code should run as well on one processor as the scalar code. This allows you to maintain a single code that will run on both scalar and multiprocessor systems.

34

2.5

Beowulf class cluster computers

In 1994 Thomas Sterling and Don Becker built a cluster computer consisting of 16 DX4 processors connected by channel bonded Ethernet. They called their machine Beowulf. The machine was an instant success and their idea of providing COTS1 base systems to satisfy specic computational requirements quickly spread through NASA and into the academic and research communities. The development eort for this rst machine quickly grew into a what we now call the Beowulf Project [41]. Some of the major accomplishment of the Beowulf Project will be chronicled below, but a non-technical measure of success is the observation that researcher within the High Performance Computer community are now referring to such machines as "Beowulf Class Cluster Computers". That is, Beowulf clusters are now recognized as genre within the HPC community. In the last ten years we have experienced a number of events that have brought together all the pieces for the genesis of the Beowulf project. The creation of the causal computing market (oce automation, home computing, games and entertainment) now provides system designer with a new type of cost eective components. The COTS industry now provides fully assembled subsystems (microprocessors, motherboards, disks and network interface cards). Mass market competition has driven the prices down and reliability up for these subsystems. The development of publicly available software, in particular, the Linux operating system [48], the GNU compilers and programming tools and the MPI and PVM message passing libraries (see Section 2.3.1), provide hardware independent software. A second aspect to this history of working with parallel platforms is an increased reliance on computational science and therefore an increased need for high performance computing. One could argue that the combination of the these conditions: hardware, software, experience and expectation, provided the environment that makes the development of Beowulf clusters seem like a natural evolutionary event. This often lead to dead ends with respect to software development. The cost eectiveness and Linux support for high performance networks for PC class machines has enabled the construction of balanced systems built entirely of COTS technology which has made generic architectures and programming model practical. An important characteristic of Beowulf clusters is that these sorts of changes processors type and speed, network technology, relative costs of components do not change the programming model. Therefore, users of these systems can expect to enjoy more forward compatibility then we have experienced in the past. Another key component to forward compatibility is the system software used on Beowulf. With the maturity and robustness of Linux, GNU software and the "standardization" of message passing via PVM and MPI, programmers now have a guarantee that the
1

COTS - Commodity O The Shelf

35

programs they write will run on future Beowulf clusters regardless of who makes the processors or the networks. Beyond the seasoned parallel programmer, Beowulf clusters have been built and used by programmer with little or no parallel programming experience. In fact, Beowulf clusters provide universities, often with limited resources, an excellent platform to teach parallel programming courses and provide cost eective computing to their computational scientists as well. In the taxomony of parallel computers, Beowulf clusters fall somewhere between MPP (see 2.1.4) and NOWs (see A.1.1). The Beowulf project benets from developments in both these classes of architecture. MPPs are typically larger and have a lower latency interconnect network than an Beowulf cluster. Programmers are still required to worry about locality, load balancing, granularity, and communication overheads in order to obtain the best performance. Even on shared memory machines, many programmers develop their programs in a message passing style. Programs that do not require ne-grain computation and communication can usually be ported and run eectively on Beowulf clusters. The future of the Beowulf project will be determined collectively by the individual organizations contributing to the Beowulf project and by the future of mass-market COTS. As microprocessor technology continues to evolve and higher speed networks become cost eective and as more application developers move to parallel platforms, the Beowulf project will evolve to ll its niche.

2.6

Creating a Beowulf cluster powered by MPI

As the project requires a parallel architecture to be involved in model calculation, the question of choosing the right platform arose. It was demonstrated in previous Sections (2.5, 2.2.2, 2.4) that a Beowulf class cluster running Linux OS with MPI library at comower munication layer would be a fair optimal choice for this project in terms of PCost and time required to build a sample of a parallel system. The next sections describe the process of building a educational mini-cluster using o-shelf components and OpenSource software.

2.6.1

Power consumption

During the past year there have been many developments in the supercomputing eld. Although machines continue to get faster at a rapid clip, the problem of heat dissipation is becoming acute. Already, companies like Google are using heat dissipation as a criteria for buying computers, since it is becoming increasingly dicult to cool server farms with tens of thousands of processors. As a result, supercomputer makers may soon face two choices: 36

either use high-performance, hot running chips like the Itanium 2 and install elaborate and costly cooling systems, or use low-powered processors and sacrice performance. Institutions and corporations which choose the former approach may be forced to spend as much money on the computing infrastructure as on the computers themselves. Moreover, the electricity bill for conventional supercomputing centers and compute farms/ranches is no longer negligible. For these reasons, computing installations based on low-powered CPUs, such as Intels Pentium M or VIA, may become an increasingly attractive option.

2.6.2

VIA EPIA-V C3

The VIA EPIA-V series Mini-ITX mainboard [31] (see Figure A.2.1) is a member of the Mini-ITX family. It is a small (170x170 mm2 ), highly integrated x86 platform. Both VIA Eden ESP processor core for fanless systems and VIA C3 E-Series processor are supported. The EPIA-V Mini-ITX mainboard also features embedded onboard I/O, Fast Ethernet port and graphics. The detailed description of VIA EPIA hardware platform follows in Appendix (see Figures A.2.2 and A.2.1). This platform was chosen as a base component for low-cost, low-energy cluster. The peak power consumption of 40W allows to feed the 4-node cluster from a single standard 350W power supply unit. The performance of VIA-C3 800MHz processors is very low comparing to Intel Pentium M, for instance. It is caused by low L2 cache capacity (64Kb comparing to 1024Kb in P-M), the absence of SSE and SSE-2 instruction sets and low-ecient FP (oating point) module. Nevertheless these C3 processors were found suitable for the projects needs as it is the concept not the power that is expected from this cluster. In fact there was recently announced about another one Mini-ITX cluster installation [32].

2.6.3

Desktop mini-cluster

You may nd the term desktop cluster confusing, as clusters usually occupy a room or a building. But this time the talk is about a prototype cluster, which has some properties of huge number-crunching systems. The next sections describe the process of construction of a massively parallel Mini-ITX cluster using 4 x VIA-C3 800 Mhz EPIA-V nodes. The machine runs Linux 2.6, and MPICH 1.2.5.2. After working with the machine and running some basic tests (see Sections 2.7.1, 2.7.2), it was found nearly equivalent to on scalar Intel Pentium-M computer, achieving computational power of c.a. 200 MFlops. With the exception of the metalwork, power and network wiring, and power/reset switching, everything is o the shelf. 37

The 4 nodes used were VIA EPIA V8000, 800 MHz motherboards. Troubles installing and conguring Linux and MPICH were few. In fact, there were no major issues with either Linux or MPICH.

2.6.4

Hardware conguration

The construction is simple and inexpensive. The whole cluster is mounted in a 0, 5 m lightweight rack structure and is easily portable. The rack (see the drawing in Figure A.3.1) is assembled of two identical frames (left and right). Each side is assembled of six aluminium angle stock. The total length of angle-bar used is 4, 2 m. Aluminium is a compliant material, pleasant to work with. The angle stock was cut by a hand saw and holed by a drill. The whole construction is fastened by 6 mm and 5 mm screws and screw nuts. The motherboards are stacked using bronze standos and then mounted on aluminum frame. Diagonal stieners is also fabricated from aluminum angle stock to reduce exing of the rack assembly. The handles are fabricated from aluminium pipes to facilitate portability. The power supply unit is slighly upgraded to be four-headed. Three more sets of cabling are sealed in. The peak power consumption of all 4 nodes is below 150 W . The whole cluster can be fed with DC current from a single 350 W power supply unit. Figure 2.6.1. Cluster rack with nodes. The rack is designed to hold: 1 power supply unit, a D-link DES-1008D Fast Ethernet

switch, 4 EPIA-V mainboards, 1 HDD mobile rack, 1 control plate and the cables. For drawings of assembled cluster see Figures 2.6.1 and A.3.2. The control plate holds 4 power led indicators, main power switch and 4 power/reset switches. The network patch cords are crimped four twisted pair (8 conductor) cat 5 cables. The liberal use of nylon cable ties helps to keep an order in cabling. The full-sized pictures of the mini-cluster assembled are in Figures A.4.1, A.4.2 and A.4.3.

38

The clusters master node is a Compaq Evo N620c (Pentium-M) laptop, running Linux 2.6. See the picture of the mini-cluster with the laptop together A.4.4 to imagine the actual dimensions of the assembly.

2.6.5

Software conguration

From the point of view of software engineer (systems administrator) the cluster consists of a controlling node, with a large capacity hard drive, and four computational nodes, sharing another large capacity hard drive over the network. The nodes are diskless and are booted over the net. Every EPIA node has an Intel PXE network loader which alows the node to send the requests for ip conguration and the kernel image lename. The diskless Linux load is best described in [50]. The software which performs the parallelization (MPI) is installed on the controlling node, and the computational nodes mount a shared directory on the controlling node via NFS2 . Communications between the nodes is established via ssh by MPI, and shared les are found via the mounted NFS le system. The networking is Fast Ethernet and makes use of a Fast Ethernet switch switch. Gigabit Ethernet is faster for large I/O but has the same latency. Thus 100 Mbit Ethernet is quite adequate for the tasks of this project. The version of MPI used is mpich-1.2.5.2. The Operating system for the controlling node and all the computational nodes is ALT Linux Sisyphus [49]. The cluster needs a high degree of connectivity and has rather poor security. The cluster should reside behind the controlling node, which should be separated from the Internet by a rewall (see Figure 2.6.2). Figure 2.6.2. Cluster network protecThe whole conguration routine should be tion. performed on controlling node, since other nodes are already factory congured to boot from network. First of all, a special Linux kernel [48] is build to support VIA C-3 processor, VT6102 Ethernet controller, VT8601 Appolo chipset and Trident VGA controller. The attention should be payed to IP onboot conguration options, NFS support and root lesystem by NFS support and Reiserfs support. The kernel should be built monolithic no critical parts built as modules. The DHCP3 daemon is congured the following way
2 3
Controlling Node F-Ethernet Switch Firewall WAN Router Internet

Computational Nodes

Network File System Dynamic Host Conguration Protocol

39

ddns-update-style none; use-host-decl-names on; subnet 192.168.1.0 netmask 255.255.255.0 { option routers 192.168.1.112; option time-offset option ntp-servers 2; # Eastern Standard Time 192.168.1.112;

group { next-server 192.168.1.112; default-lease-time -1; filename "pxelinux.0"; host epia1 { hardware ethernet 00:40:63:ca:6a:09; fixed-address 192.168.1.101; } ... } } This DHCP conguration allows to congure nodes and point them to a linux kernel (pxelinux.0). The kernel is served using TFTP4 daemon which is congured via xinetd superdaemon. A small syslinux kernel is loaded to initialize x86 high memory areas. Then a real prepared Linux kernel can be fetched by TFTP and loaded. The kernel autocongures IP and tryes to mount root lesystem via NFS. The following entries should present in /etc/exports:

/nfsroot/epia1 epia1(async,rw,no_wdelay,no_root_squash,no_all_squash) ... /opt /usr /home *.neksa.net(async,ro,no_root_squash,no_all_squash) *.neksa.net(async,ro,no_root_squash,no_all_squash) *.neksa.net(async,rw,no_wdelay,no_root_squash,no_all_squash

Every node has its own root lesystem with dierent /etc/fstab, /etc/HOSTNAME and /etc/sycong/network les. Edit /etc/hosts to include the hostnames and ips of the controlling node, computational nodes, and any external computers which need to access the controlling node. The SSH5 daemon should be congured to start on node bootup. The public keys of nodes should be registered in /etc/sshd_cong to allow ssh login without password prompt.
4 5

Trivial File Transfer Protocol Secure SHell

40

The sequence diagram is shown in Figure A.3.3. After checking access to the nodes, MPICH should be congured. The nodes are registered in /usr/local/mpi/share/machines.linux. Initial tests are executed /usr/locall/mpi/sbin/tstmachines -v. It is also very useful to congure console redirection to serial port. The syslog daemon shoul be congured to send all the messages to the controlling node. The minicom program can be used to debug kernel errors during bootup. Syslog is always useful, even to debug the parallel software.

2.7
2.7.1

Benchmark tests
The algorithm of Bef f

The eective bandwidth Bef f measures the accumulated bandwidth of the communication network of parallel and/or distributed computing systems. Several message sizes, communication patterns and methods are used. The algorithm uses an average to take into account that short and long messages are transferred with dierent bandwidth values in real applications. Denition of the eective bandwidth Bef f in Equation 2.7.1.
maxmthd (maxrep (b(ring.pat., L, mthd, rep))) 21 L maxmthd (maxrep (b(rand.pat., L, mthd, rep))) ) 21
L

Bef f

= logavg(logavgring pat logavgrand pat

(2.7.1)

with
b(pat, L, mthd, rep) =

L(tpatmesg)looplength maxproctime

where tpatmesg is total number of messages of a pattern "pat" and maxproctime is maximum time on each process for executing the communication pattern looplength times. Each measurement is repeated 3 times (rep = 1..3). The maximum bandwidth of all repetitions is used (see maxmthd in the formula above). Each pattern is programmed with three methods. The maximum bandwidth of all methods is used (maxmthd). The measurement is done for dierent sizes of a message: The message length L has the following 21 values: L = 1B, 2B, 4B, ...2kB, 4kB, 4kB a1 , 4kB a2 , ...4kBa8 with 4kBa8 = Lmax and Lmax =
mperproc 128

The looplength is reduced dynamically to achieve a execution time for each loop between 2.5 and 5msec. The looplength for the rst iteration is calculated with some PRE_MSG_LOOPS. The minimum looplength is 1. The average of the bandwidth of all messages sizes is computed
L(...) . 21

A set of ring patterns and random patterns is 41

used (see details section below). The average for all ring patterns and the average of all random patterns is computed on the logarithmic scale (logavgring patterns and logavgrandom patterns). Finally the eective bandwidth is the logarithmic average of these two values (logavg(logavgring patterns, logavgrandom patterns). The eective bandwidth is Bef f = 17.268M Byte/s on 5 processes ( =3.454 MByte/s * 5 processes) Latency: 62.753 microsec Lmax : 1 MB Bef f at Lmax : 18.250 MByte/s on 5 processes (3.650 MByte/s * 5 processes) Bef f at Lmax (ring pattern): 18.115 MByte/s on 5 processes ( : 3.623 MByte/s * 5 processes) Latency ring pattern: 12.702 microsec Ping-pong latency: 71.431 microsec Ping-pong bandwidth at Lmax : 10.760 MByte/s at Lmax = 1.0 MB M Byte/s = 1e6Byte/s, M B = 220 Byte System parameters : 5 nodes, 128 MB/node
Number of processors Bef f , MByte/s Lmax Bef f Lmax rings random MByte/s total per process 5 17 3 1MB 18 4 18 4 & at Bef f Lmax rings only MByte/s at Latency rings & random microsec 62.754 63.510 Latency rings only microsec Latency pingpong microsec 71.431 pingpong bandwidth, MByte/s 11

Table 2.7.1. Bef f result table. Bef f = 17.268M B/s = 3.454 5P Es with 128M B/P E

2.7.2

NAS parallel benchmark

The NAS Parallel Benchmark (NPB) [51] is a set of 8 programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational uid dynamics (CFD) applications, consist of ve kernels and three pseudoapplications. NPB 2.3 is MPI-based source-code implementations written and distributed by NAS. They are intended to run with little or no tuning, and approximate the performance a typical user can expect to obtain for a portable parallel program. The LU benchmark is based on the NX reference implementation from 1991. This code requires a powerof-two number of processors. A 2-D partitioning of the grid onto processors occurs by halving the grid repeatedly in the rst two dimensions, alternately x and then y, until all 42

powero - two processors are assigned, resulting in vertical pencil-like grid partitions on the individual processors. This ordering of point based operations constituting the SSOR procedure proceeds on diagonals which progressively sweep from one corner on a given z plane to the opposite corner of the same z plane, thereupon proceeding to the next z plane. Communication of partition boundary data occurs after completion of computation on all diagonals that contact an adjacent partition. This constitutes a diagonal pipelining method and is called a wavefront method. It results in relatively large number of small communications of 5 words each. A NAS benchmark that we chose to present here is LU. For the LU benchmark, the sizes were class A, B, S and W. The execution time of LU, compilied with standard optimization is shown in Table A.5.1 and Figure A.5.1. Then LU was recompiled with -O3 -mmmx -m3dnow -march=i586 optimization. Optimized benchmarks are presented in Table A.5.2 and Figure A.5.2. The optimization gain is shown in Figure A.5.3. As a measure of scalability, we selected parallel speedup, as classically calculated. The serial time was obtained by running the benchmarks on one processor. The scalability of LU benchmark is reported in Table A.5.3 and Figure A.5.4. The controlling(master) node, which has a dierent architecture showed performance, presented in Figure A.5.5.

43

Chapter 3 Modeling Pseudooptical Neural Networks


3.1 Pseudooptical neural networks

Pseudooptical neural network (PNN) is a class of neural networks, introduced by Kuznetsov in [8], with a dierent model of articial neuron interfering neuron. Potentials of these neurons change as a result of interference of periodical signals. Modeling of holographic and other optical phenomena is possible in networks of this kind. Interfering neuron notion is a basic notion of the below described method. Its description is presented in [9, 7]. An interfering neuron N has mN inputs, qN outputs and is characterized by three real parameters: threshold PN , input intensity IN and potential UN (t). The potential depends on time t and does not exceed the threshold PN . The neuron can be either in passive or in active state. The neuron receives input signals when in active state, otherwise it generates an output signal. The neurons are interconnected by one-way bres. These bres have the following characteristics: length d and signal speed v. The signal Si of i duration is a function Si (t) = Ii si (t) dened on i wave length interval, were si (t) - is a periodical functions with frequency i , Ii is constant signal intensity. The type of si function is not signicant. It can be si (t) = sin 2t for example. All the input signals have the same frequency - i . The form of signal Si is v completely dened by ternary (Ii , i , i ). The wave length i = i . The phase dierence is calculated by: ij = 2(t1j t1i ) 44 (3.1.1)

where t1i and tj - are the moments of appearance of signals Si and Sj on dierent inputs. The integral input intensity at the point t is calculated by formulae: I(t) =
im jm

Ii (t)Ij (t) cos ij

(3.1.2)

When a neuron is in passive state and receives integral input intensity I on interval [t, t ], then 1. if UN (t) + I(t t) < PN , then UN (t ) = UN (t) + I(t t); (3.1.3)

2. if UN (t) + I(t t) PN , then neuron becomes active and generates signal IN SN = , , N , qN on every qN output, where N = PN IN (3.1.4)

Then, after N time, the neuron becomes passive, with UN = 0. Interference Theorem If m signals were received on input of neuron N on interval [t, t ] and potential UN did not exceed the threshold then:

UN (t ) = UN (t) + 2

Ii Ij cos ij ij +
i,jm i=1

Ii i

(3.1.5)

where i - signal duration on input i, ij - duration of simultaneous signal existence on inputs i and j, and summation is performed with all unordered couples (i, j).

3.2
3.2.1

Complete rectilinear model


Model description

Complete rectilinear model was proposed in [9,7]. This model contains four neural layers: source layer A; 45

image layer B (with neural potentials corresponding to the brightness of the points of the source image); hologram layer C, storing potential distribution as a holographic record of image B reconstructed image D Every layer is a set of neurons with the same parameters, distributed evenly upon the surface of the layer. The neurons of the same layer are not connected with each other. Outputs of A are connected with inputs of B. Outputs of A and B together with inputs of C. Outputs of C with inputs of D. The schema of the network is presented in Figure 3.2.1.
n A -k A -1 A A
1

Ak
0

Layer A
R
AB

Layer B

B0 C
ji

Bj

BC

Layer C

Ci d

il

R D
l

CD

Layer D

Figure 3.2.1. Pseudo-optical network. Models parameters are listed below. 1. All the layers are rectilinear coplanar segments. The distances between them are: rAC = rAB + rBC , rCD . The bers, connecting the layers are rectilinear and will be called beams later. The thresholds of the neurons are: PA , PB , PC . The intensities are: IA , IB , IC . 2. The numbers of neurons on the layers are: nC = nD = n. The rst neuron has the number 0, the last (n 1). All the distances are calculated in wavelengths. The number of outputs of neurons of B and C are qB = qC = n. 3. The A layers will generate a plain wave. The connections between A and B are build on the assumption that 46

(a) every Bi point is connected with n points of B. The assemblage of connecting beams is called Bi pencil; (b) Bi pencil geometry is constant; the structure is presented in Figure 3.2.1; This means that A layer has to contain 2n neurons. The same connection exists between A and C. We assume that qA = 2n. 4. The starting potentials are 0 for each neurons of C and D. The level of brightness of image dots corresponds to the level of B potentials. The potentials of neurons Ai , Bj , Ck , Dl are UAi , UBj , UCk , UDl correspondingly. 5. The output signals are SAi , SBj , SCk , SDl . 6. The distances aij , bjk , ckl are calculated by Pythagorean theorem: for instance b0k =
2 rBC + k 2 e2 . We assume that bik = b0,|ik| , bk = b0k and bij = dij ( i, j).

3.2.2

Hologram construction and calculation

For the start, let us formulate a corollary of theorem 3.1.5: If all the input intensities are equal and the intervals are equal to then

UN (t ) = UN (t) + I demonstrated in [9].

im jm

ij cos ij , 2

(3.2.1)

where the summation is performed i, j : |ij | 2 . The corollary is

During hologram construction and calculation the following processes take place: a plane wave from A goes to C and B; the potentials neurons of B are charged up to the threshold, then generate signals SBj going from B to C; in every Cj dot interference of signals from all neurons of A and B takes place; hologram potential distribution is calculated by formula 3.1.5 which can be split into a sum of three interferences: between SAi and SAj , between SBi and SBj and between SAi and SBj ; the rst two can be calculated using formula 3.2.1; Let us introduce the table of symbols: Aijk phase dierence between SAi and SAj in Bk ; 47

Aijk phase dierence between SAi and SAj in Ck ; Bijk phase dierence between SBi and SBj in Ck ; ABijk phase dierence between SAi and SBj in Ck ; tM Bj the point of Bj full-charge; t1Ai the time of arrival of SAi to Ck ; t1Bi the time of arrival of SBi to Ck .

Taking in consideration the above mentioned assumptions we can calculate UCk potential by

UCk =

|Aijk | IA n1 n1 A cos Aijk + qA i0 j0 2 |Bijk | IB n1 n1 B cos Bijk + qB i0 j0 2 IA IB qA qB


n1 n1

(3.2.2)

ABijk cos Bijk


i0 j0 PA IA

Taking in consideration, that A = UCk = where

and B =

PB IB

we can reduce 3.2.2 to: (3.2.3)

IB IA Sum1 + Sum2 + qA qB

IA IB Sum3, qA qB

n1 n1

Sum1 =
i0 j0 n1 n1

PA |Aijk cos Aijk IA 2 PB |Aijk | cos Bijk IB 2 TABijk cos Bijk

(3.2.4)

Sum2 =
i0 j0

(3.2.5)

n1 n1

Sum3 =
i0 j0

(3.2.6)

TABijk = ABijk To calculate phase dierences we can use formulas from [9]: Aijk = 2(cjk cik ), 48 (3.2.7)

Bijk = 2((tM Bj tM Bi ) + bjk bik ),

(3.2.8)

ABijk = 2(tM Bj + bjk cik ),

(3.2.9)

To calculate tM Bl point of Bl saturation up to the threshold we use fan algorithm, described in [9]. tM Bl = PB UBl (tBk ) qA VBk IA (3.2.10)

where tBk and UBl (tBk ) are calculated recursively by formula 3.2.11 until UBl reaches PB threshold. UBl (tB,k+1 ) = UBl (tBk ) + VBk is calculated recursively by formula
k

IA VBk (tB,k1 tBk ) qA

(3.2.11)

VB,k+1 = VBk + 4 2
i=1

cos Ai,k+1 + cos A0,k+1 + 1

(3.2.12)

TABikj = ABijk is also by an algorithm, described in [9]: 1. if ABijk < 2A , then TABijk = min 2. if ABijk > 2A , then TABijk = min PB ABijk PA , . IB 2 IA (3.2.14) PA ABijk PB , . IA 2 IB (3.2.13)

The hologram, that is neural potential distribution on layer C is calculated by formula 3.2.3, where Sum1, Sum2, Sum3 are calculated by formulas 3.2.4, 3.2.5, 3.2.6; phase dierence Aijk ,Bijk ,ABijk are calculated by formulas 3.2.7, 3.2.8, 3.2.9; tM Bl using fan algorithm, which uses formulas 3.2.11 and 3.2.10; TABijk by formulas 3.2.13 and 3.2.14.

3.2.3

Image reconstruction calculations

Let us supplement the table of symbols: Cijl phase dierence between SCi and SCj in Dl ; tM Ci the point of Ci full-charge up to the threshold; 49

t1Ci the time of arrival of SCi to Dl . The process of reconstruction in layer D is described in [9]. Layer A illumiates layer C. As a result Ci charge up to the threshold at the tM Ci point, depending on potential UCi , which holds a dot of hologram, recorded on the previous step (Section 3.2.2). On activation Ci generates output signal SCi . These signals are input signals for Dl and they interfere with each other. The point tM Ci is calculate analogously to fan algorithm, desctibed in Section 3.2.2. The dierence is in parameters. The time of arrival of SA to Ci is tC0 . tC0 = c1 ck c0 , tC1 = , . . . , tCk = , v v v (3.2.15)

potential UCi (tC, k+1 ) is calculated by formula, similar to formula 3.2.11 UCi (tC, k+1 ) = UCi (0) + IA (VC0 (tCl tC0 ) + . . . + VCk (tC, k+1 tCk )), qA (3.2.16)

where UCi (0) is calculated by 3.2.2, VCi are calculated by formula 3.2.12, where Aij is replaced by Aij :
k

VB,k+1 = VBk + 4
i=1

cos Ai,k+1 + cos A0,k+1 + 1

(3.2.17)

The fan algorithm transforms into following: 1. saturating UCi , by consecutively calculating UCi (tCk ), until at some point k it reaches the threshold PC . 2. tM Ci is calculated using the k found: tM Ci = PC UC (tCk ) qA + tCk . Vk IA

The passage time il of signal from A through Ci until Dl il = tM Ci + dil , v

the phase dierence ijl = 2(jl il ) = 2(dil dil + tM Cj tM Ci ). Potential distribution of layer D can be calculated by 3.2.1 where I= IC PC , = . qC IC

50

As a result UDl =

n1 n1 i0 j0

PC |ijl | cos ijl , IC 2

where i, j : |ijl | 2 .

3.3

Creating a simulator

The simulators designation is to interactively calculate the model, described in Section 3.2. Interactively means reading models parameters from user input, calculating the model and displaying the results. The simulator is to be created rst in scalar, consecutive way. Then, after testing the validity of algorithms and their implementation, the simulator should be parallelized, using parallelization strategy, described in Section 2.4.8.

3.3.1

Consecutive implementation

The software should run in Linux, Solaris and preferably in Windows. This means that it is required to use crossplatform libraries and compilers. The library chosen for GUI implementation is QT [54]. It is a set of crossplatform C++ classes and tools of a very high quality. But the built-in classes for data visualization and control does not support high granularity double value scientic charts. QT was extended by Qwt Library [55]. KDevelop was chosen as a development environment. A screenshot of KDevelop workplace is presented in FigureA.6.1. GUI design is assisted by use of QT Designer. A screenshot of QT Designer in action is in Figure A.6.2. Model calculation is coded in C++ as a single class using algorithms described in Sections 3.2.2 and 3.2.3. This class is accessed from GUI classes to calculate the model. Model initialization, parameter check and calculation methods are exported public. The screenshot of simulator in action is in Figure A.7.1.

3.3.2

Parallel implementation

The simulator code was parallelized to use MPI (see Section 2.4.8 for parallelization strategy). The model is easy to parallelize, because it makes an extensive use of summation. The neurons on layers C and D are split into groups so that every node gets an equal number of neurons to process. It is possible because neurons of the same layer do not interact.

51

Figure 3.3.1. Parallel model. Single data block processing. Every node receives parameters, starts calculating the model and then returns its portion of calculated neurons. Then 0 node reduces the results of calculation to obtain an integral calculated layer. This is true for both hologram calculation and image reconstruction. The parallel software can log its progress to SLOG MPI facility. Then these logs can be vizualized using the tools shipped in MPICH distribution. An example of log transcript, displaying the process of parallel calculation is presented in Figure 3.3.1.

3.4

Studying model parameters

The optimal parameters must provide the highest quality of image reconstruction. Finding the parameters is a complex task which lays outside the frames of this paper. Although, some research was performed and the results of the experiments are listed below in Table 3.4.1.
Ex. 1 n 25 e 20 f req 585 RAB 1515 RBC 9394 IA 683 PA 75758 IB 661 PB 4849 IC 1 PC 10000 Comments rec. some of

the single dots 2 25 2 146 9092 17 5 1000 5 10000 5 10000 image inversion registered 3 67 100 4606 99939 546 877 8273 5 76151 1 44636 strong diraction eect 4 98 2.212123 0.1 145 896 1 10 1 30 425 45488 noise some tions with correla-

Table 3.4.1. A sample set of model experiments

52

The interactive simulator allows to notice the impact of each parameter on the model. Parameters were changed one by one with dierent granularity. The results of these experiments are presented in Table 3.4.2.
Par. PA IA PC IC f req e n Gran. 103 102 100 100 102 107 100 Eect on the hologram shifts diagram up/down shifts diagram down/up no eect no eect diagram changes slightly diagram changes slightly number of waves, sharpness Gran. 103 100 102 102 108 1014 100 Eect on the reconstructed image no eect (if no image) complete diagram rearrangement shifts diagram up/down, smoothing shifts diagram down/up diagram changes slightly diagram changes slightly complete diagram rearrangement

Table 3.4.2. Model parameters impact

53

Conclusions
In the cource of this project the research into the following areas has been carried out: 1. The problems of articial intellect, cognition, selfcognition; holographical principles of human brain; 2. The modern state of HPC industry, parallel computing, software and hardware standards and solutions; the trends; 3. Pseudooptical neural networks (PONN); Kuznetsovs model; parallel implementation possibilities; parameters and their impact on the model; The following practical work was performed: 1. A computational Beowulf class cluster was built using VIA-EPIA-C3 computational nodes, OS Linux and communication library MPI; 2. A software simulator for PONN modeling was created in Unix using C++ language and QT + Qwt libraries; 3. The simulator was parallelized to run on specic parallel hardware cluster Optical approach to the problems of articial intelligence is found promising and prospective. PONN modeling is performing eciently in parallel computational environment and has to be examined in details using computer simulation.

54

Bibliography
[1] Homann A. Paradigms of Articial Intelligence. Springer-Verlag, 1998. [2] Lashley K.S. Brain mechanisms and intelligence: a quantitative study of injuries to the brain. New York: Dover Publications, 1929. [3] A. V. Pavlov, Fourier holography as a modern paradigm of articial intelligence. Journal of Optical Technology, Volume 70, Issue 5, 340-347, 2003. [4] Biochemical basis of Behavior, http://www.maui.net/~jms/brainuse.html Personality & Perception.

[5] ., .. // / . ... - .: - . - 1995. - .8-14. [6] .., .. ? // . . - 1996. - N2. - .66-71. [7] .. // . . . 1995. N5. [8] .. // . 1992. . 324. N3. [9] .., .. - // . . . . 2000. N5. c. 168-176. [10] Haykin, S., Neural Networks: A Comprehensive Foundation, NY: Macmillan. 1994. [11] Nigrin, A., Neural Networks for Pattern Recognition, Cambridge, MA: The MIT Press. 1993. [12] Zurada, J.M., Introduction To Articial Neural Systems, Boston: PWS Publishing Company. 1992. [13] Leith E. N., Upatnieks J. Journ. Opt. Soc. Amer., v. 54, 1295, 1964. 55

[14] Denisyuk Y. N. Optika i Spektroskopiya, v. 15, 522, 1963. [15] .. // // . . .2. .: , 1982.

[16] Fyodorov B. F., Cibulkin L. M. Golographiya. Moscow, Radio I Svyaz, 1989. [17] Gabor D. Associative Holographical Memories // IBM J. of research and development. 1969. V. 13. N2. [18] Gabor, D. Theory of communication. Journal of the Institute of Electrical Engineers, 1946. 93:429-441. [19] Kandel, E. R., Schwartz, J. H., Jessell, T. M. Principles of Neural Science third edition. Elsevier, New York. 1991. [20] Pribram, K.H. The Cognitive Revolution and Mind/Brain Issues. American Psychologist. 41(5). p. 507-520. 1986. [21] Pribram, K.H. Languages of the brain: experimental paradoxes and principles in neuropychology. Prentice-Hall, Inc., Englewood Clis, New Jersey. 1971. [22] Pribram, K.H. What the fuss is all about. in The Holographic Paradigm, ed. Ken Wilbur, 1982. p.33. [23] Pietsch, P. Scrambled salamander brains: a test of holographic theories of neural program storage. Anatomical Record 172:383-384, 1972. [24] Pietsch, P. Shue Brain // Harpers Magazine (vol. 244, No. 1464). 1972. [25] Pietsch, P. The Optics of Memory // Harpers Magazine (vol. 251, No. 1507). 1975. [26] Turner, D. Introduction to Parallel Computing. SCL. Ames Laboratory. 2001. http://cmp.ameslab.gov/cmp/parallel_computing/ [27] Myricom Myrinet. http://www.myri.com/myrinet/ [28] Emulex (Giganet cLan). http://www.emulex.com/products/ [29] Dolphin SCI hardware products. http://www.dolphinics.com/products/index.html [30] TOP500 Supercomputers. TOP500 List. http://www.top500.org/ [31] VIA Technologies, Mini-ITX products. http://www.via.com.tw/en/VInternet/mini_itx.jsp

56

[32] Gardner, G. The Mini-Cluster. 2004. http://www.mini-itx.com/projects/cluster/ [33] C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu & w. Zwaenepoel. Treadmarks: Shared memory computing on Network of workstations // IEEE Computers, Volume 29, February 1996. [34] New processor computes at light speed. Newscientist online. 2003. http://www.newscientist.com/news/news.jsp?id=ns99994331 [35] New chip gives PCs supercomputing muscle. Newscientist online. 2003. http://www.newscientist.com/news/news.jsp?id=ns99994274 [36] Ed Scannell; Microsoft jumps into the HPC game. http://www.infoworld.com/, June 23, 2004 [37] R. Buyya, High Performance Cluster Computing: System and Architectures, Vol. 1, Prentice Hall PTR, NJ, 1999. [38] R. Buyya, High Performance Cluster Computing: Programming and Applications, Vol. 2, Prentice Hall PTR, NJ, 1999. [39] LAM/MPI Parallel Computing. http://www.lam-mpi.org/ [40] PVM Parallel Virtual Machine. http://www.epm.ornl.gov/pvm/ [41] T. L. Sterling, J. Salmon, D. J. Backer, and D. F. Savarese, How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters, 2nd Printing, MIT Press, Cambridge, Massachusetts, USA, 1999. [42] B. Wilkinson and M. Allen, Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, Prentice Hall PTR, NJ, 1999. [43] M. Wolfe, High-Performance Compilers for Parallel Computing, Addison-Wesley Publishing, NY, 1996. [44] Ames Lab High performance, light weight MPI. http://cmp.ameslab.gov/MP_Lite [45] MPI/Pro Commercial MPI. http://www.mpi-softtech.com/ [46] MPICH A Portable unix.mcs.anl.gov/mpi/mpich/ Implementation of MPI. http://www-

[47] The ocial version of the MPI documents. http://www.mpi-forum.org/docs/ [48] The Linux kernel. http://www.kernel.org/ 57

[49] ALT Linux. http://www.altlinux.com/ [50] Linux Terminal Server Porject. http://www.ltsp.org/ [51] NAS Parallel Benchmarks. http://www.nas.nasa.gov/Research/Reports/Techreports/2003/nas-03-010abstract.html [52] The Chemistry Group Message system. http://www.emsl.pnl.gov:2080/docs/parsoft/tcgmsg/tcgmsg.html [53] Linpack Benchmark. http://www.top500.org/lists/linpack.php [54] Trolltech, QT Library. http://www.troll.no/ [55] Wilgenn J., Rathmann U. Qwt Widget Library. http://qwt.sf.net/

58

Appendix A Tables and diagrams


A.1 TOP500 supercomputing resource

These tables and gures are taken from [30]. Computer, Processors Manufacturer Rmax Rpeak

Rank

Site Country/Year

Earth Simulator - Center Japan/2002

Earth-Simulator / 5120/ NEC

35860 40960

Lawrence Livermore National Laboratory - United States/2004 Los Alamos National Laboratory - United States/2002 IBM - Thomas Watson Research Center - United States/2004 NCSA - United States/2003 ECMWF - United Kingdom/2004 Institute of Physical and Chemical Res. (RIKEN) Japan/2004

Thunder Intel Itanium2 Tiger4 1.4GHz - Quadrics / 4096/ California Digital Corp. ASCI Q - AlphaServer SC45, 1.25 GHz / 8192 / HP BlueGene/L DD1 Prototype (0.5GHz PowerPC 440 w/Custom) / 8192 IBM/ LLNL Tungsten PowerEdge 1750, P4 Xeon 3.06 GHz, Myrinet / 2500 Dell eServer pSeries 690 (1.9 GHz Power4+) / 2112 IBM RIKEN Super Combined Cluster / 2048 Fujitsu

19940 22938 13880 20480 11680 16384 9819 15300 8955 16051 8728 12534

3 4

5 6 7

Table A.1.1. TOP500 list for June 2004 (part 1).

59

Rank

Site Country/Year IBM - Thomas Watson Research Center - United States/2004 Pacic Northwest National Laboratory - United States/2003 Shanghai Supercomputer Center - China/2004 Los Alamos National Laboratory United - States/2003 Lawrence Livermore National Laboratory - United States/2002 Lawrence Livermore National Laboratory - United States/2000 NERSC/LBNL - United States/2002 NCSA - United States/2004 Lawrence Livermore National Laboratory - United States/2003 Lawrence Livermore National Laboratory - United States/2004 HPCx - United Kingdom/2004 Grid Technology Research Center, AIST - Japan/2004 Oak Ridge National Laboratory - United States/2004 Forschungszentrum Juelich (FZJ) Germany/2004 National Aerospace Laboratory of Japan - Japan/2002 UCSD/San-Diego Supercomputer Center - United States/2004 Kyoto University Japan/2004

Computer, Processors Manufacturer BlueGene/L DD2 Prototype (0.7 GHz PowerPC 440) / 4096 IBM/ LLNL Mpp2 Integrity rx2600 Itanium2 1.5 GHz, Quadrics / 1936 HP Dawning 4000A, Opteron 2.2 GHz, Myrinet / 2560 Dawning Lightning Opteron 2 GHz, Myrinet / 2816 Linux Networx MCR Linux Cluster Xeon 2.4 GHz - Quadrics / 2304 Linux Networx/Quadrics ASCI White, SP Power3 375 MHz / 8192 IBM Seaborg SP Power3 375 MHz 16 way / 6656 IBM TeraGrid, Itanium2 1.3/1.5 GHZ, Myrinet / 1776 IBM United States/2003 xSeries Cluster Xeon 2.4 GHz - Quadrics / 1920 IBM/Quadrics Lilac xSeries Xeon 3.06 GHz, Quadrics / 1540 IBM eServer pSeries 690 (1.7 GHz Power4+) / 1600 IBM AIST Super Cluster P-32 AIST Super Cluster P-32, Opteron 2.0 GHz, Myrinet / 2200/ IBM Cray X1 / 504 / Cray Inc. eServer pSeries 690 (1.7 GHz Power4+) / 1312/ IBM PRIMEPOWER HPC2500 (1.3 GHz) / 2304 / Fujitsu eServer pSeries 655/690 (1.5/1.7 Ghz Power4+) / 1680/ IBM PRIMEPOWER HPC2500 (1.56 GHz) / 1472 / Fujitsu

Rmax Rpeak

8655 11469

9 10 11 12

8633 11616 8061 11264 8051 11264 7634 11060

13 14 15 16

7304 12288 7304 9984 7215 10259 6586 9216

17 18 19 20 21 22 23 24

6232 9425 6188 10880 6155 8800 5895 6451 5568 8921 5406 11980 5223 10310 4552 9185

Table A.1.2. TOP500 list for June 2004 (part 2). 60

Figure A.1.1. Clusters (NOW) systems share by TOP500.

Figure A.1.2. Architecture/performance share by TOP500.

61

Figure A.1.3. Projected performance development by TOP500.

A.2

VIA EPIA platform technical specications

These tables and gures are taken from [31].

62

Figure A.2.1. VIA EPIA-V mainboard.

63

Figure A.2.2. VIA EPIA-V Architecture.

64

Processor Chipset System Memory VGA Expansion Slots IDE Onboard Floppy LAN Onboard Audio Onboard TV Out

Onboard I/O Connectors

Back Panel I/O

BIOS

System Monitoring & Management

Form Factor

VIA C3/EDEN EBGA Processor VIA PLE133 North Bridge - VT8231 South Bridge 2 PC 133 DIMM socket - Up to 1GB memory size Integrated Trident graphics 1 PCI Onboard 1X UltraDMA 100/66 Connector 1 x FDD Connector Onboard VIA VT6103 10/100 Base-T Ethernet PHY VIA VT1612A 2 channel AC97 Codec VIA VT1621 TV out 1 USB connectors for 2 USB 1.1 ports CD Audio-in connector FIR connector CIR connector Wake-on-LAN CPU/Sys FAN 1 PS2 mouse port 1 Parallel 1 RJ-45 LAN port 1 Serial port 2 USB 1.1 ports 1 VGA port 1 RCA port (SPDIF or TV out) 1 S-Video port 3 Audio jacks: line-out, line-in and mic-in Award BIOS - 2/4Mbit ash memory CPU voltage monitoring Wake-on-LAN Keyboard-Power-on Timer-Power-on System power management AC power failure recovery Mini-ITX (4 layer) - 17 cm x 17 cm

Table A.2.1. A.2.1Via EPIA-V mainboard specication.

65

A.3

Cluster drawings

Figure A.3.1. Mini-cluster aluminium frame.

66

Figure A.3.2. Mini-cluster assembled.

67

Figure A.3.3. Node-Server communication sequence.

     

68
 H 7  3 4 )  4 7 9 3 4 0 ( A %9 55R%) 5I%  0 9 E 0 H ( ( E )  7 4 0 ( A 81 5G ##D8) 5I% 9 3   1 P &   G8%) ' 9 4 E 9  ) $ &   8!8 3 A 1 &  @  ) 9  7 1 A I8') 5DRC% 9 ( 1  @  ) 9  7 1 A %8%) 5D8C 4  7  @  ) 9  7 1 A ) 5D8C 9 1 1 4  @  ) 9  7 1 A 5') 588B% H 3  4 3 F ) 9 1 1 Q ) ( 9 @ 9 ##5#% H 3  4 3 F ) 3 2 4 3  ) ( 9 @ 9 ###!D#G 9  3 7 6 3 4 ) ( 9 @ 9 58% F E ) ( & $ 8'% 9  3 7 6 3 4 ) ( & $ 585' 4 3 @ @ 1 ) ( & $ ###8%' 4 3 2 1 0 $ ) ( & $ 5! ''% "   #!  

A.4

Cluster pictures

Figure A.4.1. Mini-cluster front view.

69

Figure A.4.2. Mini-cluster back view.

70

Figure A.4.3. Mini-cluster left view.

71

Figure A.4.4. Mini-cluster with a laptop running the model.

A.5
A.5.1

Performance benchmark tests


Without compiler optimization
Nodes 1 2 3 4 Class A 35,97 70,57 94,67 139,58 Class B 29,97 69,49 93,74 137,51 Class W 36,19 71,74 95,41 142,54 Class S 21,19 40,42 52,05 77,53

Table A.5.1. NAS benchmark w/o optimization. 72

Figure A.5.1. NAS benchmark w/o optimization.

A.5.2

With compiler optimization

-O3 -mmmx -m3dnow -march=i586 Nodes 1 2 3 4 Class A 37,19 73,1 98,48 144,31 Class B 30,88 72,23 97,31 143,16 Class W 38,23 74,69 101,2 148,47 Class S 23,51 44,56 55,19 83,75

Table A.5.2. NAS benchmark with optimization.

Figure A.5.2. NAS benchmark with optimization.

73

A.5.3

Optimization gain

Figure A.5.3. NAS benchmark optimization gain.

A.5.4

Processor scalability
Nodes 1 2 3 4 Class A 35,97 35,29 31,56 34,9 Class B 29,97 34,75 31,25 34,38 Class W 36,19 35,87 31,8 35,64 Class S 21,19 20,21 17,35 19,38

Table A.5.3. NAS benchmark scaling.

Figure A.5.4. NAS benchmark scaling.

74

A.5.5

Controlling node performance


Class A 159,12 Class B 134,75 Class W 84,51 Class S 174,23

Workstation performance

Table A.5.4. Controlling node performance.

Figure A.5.5. Controlling node performance.

75

A.6

Development tools

Figure A.6.1. KDevelop with simulator source code.

76

Figure A.6.2. QT Designer with main form of the simulator.

77

A.7

Running the model

Figure A.7.1. Reconstructing a point with the simulator.

78

You might also like