Advanced Parallel Processing
Advanced Parallel Processing
Advanced Parallel Processing
Invited Paper
This paper investigates advanced parallel processing techniques and innovativehardwarelsoftware architectures that can be applied to boost the performance of supercomputers. Critical issues on architectural choices, parallel languages, compiling techniques, resource management, concurrency control, programming environment, parallel algorithms, and performance enhancement methods are examined and the best answers are presented. We cover advanced processing techniques suitable for s u p e m p u ten, high-endmainframes,minisupers, and array processors. The coverage emphasizes vectorization, multitasking, multiprocessing, and distributed computing. In order to achieve these operation modes, parallel languages, smart compilers, synchronization mechanisms, loadbalancingmethods,mappingparallel algorithms, operating system functions, application library, and multidisciplineinteractions are investigated to ensure highperformance. At the end, we assess the potentials of optical andneural technologies for developing future supercomputers
techniques to yield higher performance. We examine below the requirements of parallel, vector, and scalar processing. Basic performance measures and benchmarking concepts areintroducedfirst.1nsubsequentsectionsweaddresscritical issues on languages, compilers, processor, memory, input/output,programming,and various performance enhancement methods for parallel processing on three classes of supercomputers. A. Evolution of Modern Supercomputers AS illustrated in Fig. 1 , the evolution of supercomputer architecture follows an increasing trendof more hardware and software functions built into system. a The skewed tree demonstratesthatmostoftoday's supercomputers are designed with look-ahead techniques, functional parallelism, pipelining at various levels, using explicitvectors and exploring parallel processing in SlMD (single instruction and multiple data streams) or M l M D (multiple instruction and multiple data streams) mode [45]. Most supers support concurrent scalar processing and vector processing with multiple functional units in a uniprocessor or in a multiprocessor system. We reserve the termmultiprocessors for shared-memory multiple-processor systems and multicomputers for loosely coupled multiple-processor systems with distributed local memories. Some authors call multicomputers parallel computers [28]. For vector processors, we divide them into two subclasses: the register-to-register architecture is being adapted in almost all supers [MI and minisupers [81], except in the Cyber 205 [98] which chooses a memory-to-memory architecture [65]. Parallelism refers to the simultaneous processing jobs, of job steps, programs, routines, subroutines, loops,or statements, as illustrated in Fig. 2. The lower the level, the finer the granularity of the software processes. In general, parallel processing refers to parallelism exploited at any or a combination of these levels. Sofar, vectorprocessing i s parallel processing of iterations of loops at level 2. Parallel execution of independent scalar statements at level 1has been implemented inmany machines with the look-ahead technique using multiple functional units. Most today's of computers allow multiprogramming, which provides for the sharing of processor resources among multiple, independent software processes. This is done even in a unipro-
I. INTRODUCTION
Parallel processing has emerged as a hot field research of and development by computer professionals. Various classes of parallel and vector supercomputers have appeared in the past two decades [2], [ H I [65], , [72], [87l, [log], [116]. However, the claimed performance may not always This is due to the be deliveredas promised by the vendors. factthat today's supercomputers areonegeneration behind the user's needs and yet one generationahead of the popular programming skills. In other words, we really need supersoftware to help boost the performance.This paper presents advancedhardwarelsoftware techniques that can help in the creationa parallel of programmingenvironment in which real supercomputer performance can be delivered. Usually, the effective performance of a supercomputer ranges between only'5 and 25 percent of its peak performance [35]. Such a pessimistic show of the delivered performance motivates many computer specialists to search for better algorithms, languages, hardware, and software
Manuscript received March 23,1987; revised June12,1987. This research was supported inpart bythe National Science Foundation under Grant DMC-84-21022, the AFOSR under Grant 86-0008, and by the NOSC under Contract 85-D-0203. Department of Electrical Engineering and The authori s with the Los Angeles, Computer Science, Universityof Southern California, CA 900894781, USA. I E E E Log Number 8716264.
1348
I/E
Ga
MM
RR
Fig. 1. Architectural evolution from sequential scalar processing to concurrent vector/ scalar processing.
Level 4
Level 3
Level 2
L o o p and Iterations
Level 1
Statements and
Fig. 2.
cessor system, in which concurrent processes are interleaved with their CPU and 1 1 0 activities. Multiprocessing is a mode of parallel processingthat provides interactive multiprogramming among two or more processors. Independent uniprocessing exploits parallelism at level 1 in multiple-SISD mode. Multitasking is a special case of multiprocessing defining a software process (a task) to be a job step or subprogram at levels 3 and 4 [25]. For machines with a small number of very powerful processors (such as the Cray X-MP) parallelism is mainly performed at the high levels (3,4, and 5)across the processors. However, within each processor, parallelism levels 1 and 2 are still practiced. For massively parallel machines (such as the MPP [6])parallelism is mainly pushed at the lower levels. The general trend is pushing the granularity down. Concurrent scalar processing, vector processing, and mulin a modern supercomputer, tiprocessing are often desired if one wishes to increase the application domains [32], [W].
HWANG: PARALLEL PROCESSING WITH SUPERCOMPUTERS
Supercomputer performance is often measured in terms of Mflops (millions of floating-point operations second) per or Mips (millions of instructions per second). The Mflops measurement reflects the number crunching capability of a computer often tied to 64-or 32-bit floating-point results. The measure of Mips indicates the instruction execution rateof acomputer includingthemixtureof arithmetic, logic, and program control instructions. The relationship between Mflops and Mips varies with different machines and different program mix. In a typical supercomputer containing scalar, vector, and control instructions,1 Mips impliesanywhere between 0.5 to 10 Mflops of performance. However, this is by no means universally true for all computers. We often use a range to indicate the expected performanceofasupercomputer.The real performanceforagiven application program often approaches the low end of the is not skillfully programmed with sufrange, if the problem ficient software support. The peak speed of a machine sets the absolute upper bound in performance. There is no such term as"average" performance, because it meansverylittle to average the performance of different application programs. The performance and cost range of threeclasses of commercial supercomputersare given inTable 1. The fullscale supers are the most expensive class, represented by Cray, ETA, and Fujitsu systems, etc. The nearsupers are highend mainframes such as the IBM3090NF, CDC Cyberplus, the Univac 11OO/ISP series, etc. The minisupers are the lowcost parallel/vector computers represented by the Alliant and Convex systems, etc. The superminis do not qualify as supercomputers. We list them here only as a reference point. Note that minisupers tend to have much higher Mflops/Mips ratio than that of the superminis. According to today's standard, a machine i s qualified as a supercomputer if it can perform O(l0') to @IO3)Mflops in typical scientific or engineering computations [107]. Understandingthe benchmarks isof fundamental impor1349
Performance Peak
200 2400 Cray Mflops $2M >to 500 Mips
Full-scale supercomputers
cost
Representative Systems
2, Cray X-MP, NEC SX-11SX-2, VP 200, ETA-10, IBM GF-11
500 Mflops
10
- 100 Mflops
IBM 3090NF, Loral MPP, CDC Cyberplus, Univac 1194/ISP $4M Connection Machine, BBN Butterfly
$lMK to
$1 M
$1.5M EncorelMultimax, iPSC, FPS 164 Max, T-Series, Warp Superminis (not a class of supercomputers, listed as a reference point)
e 0.5 Mflops
<5 Mips
$20K to $400K
VAX
Table 2 An Architectural Taxonomy of Supercomputers tance to measuring the performance of supercomputers. Only preliminary benchmark data from existing supercomArchitecture ReDresentative Svstems puters are becoming available now. The key problem is the Uniprocessors with Alliant FWl, IBM 3090, use of benchmarksin assuring comparability. Benchmarkmultiple functional CDC 7600, ing intends to indicate the relative performance various of units and vector FPS 164/264/364, Convex machines under the same workload. By definition, a comhardware options C-1, Cray 1, Cray X-MP11, Cyber 205, puter benchmark is a set of key programs or sections ofkey Amdahl500,1100,1200, programs that are executed for timing purposes. The sub1 4 0 0 (also Fujitsu VP-50, sets of key programs, called kernels, often represent the 100,200,400) Hitachi most time-consuming portion of the major codes. Kernels sa10 must be converted to run on target the machines. Using the NEC SX-1, SX-2, SCS-40 workload fractionsas weights, multiple runs are timed and SlMD processor arraysLoral or MPP, ICUDAP, FPS compared.The kernelsare real and tractableand thus often 164/MAX, processors attached used as standard tests. However, kernelsmay be too simple Connection Machine, IBM-GF11 to reveal the limitations of the target machines[99]. The Livermore kernels are often used in evaluating the Shared-memory Cray X-MP/2,4, Cray 2 , performance of scientific computers. Important steps in FW8 multiprocessor systems Alliant EncorelMultimax, Elxsi using kernel benchmarks have been summarized in [141]. 6400, Sequent 8O00, Besides using the kernels,Jack DongarraofArgonne Cray 3, IBM 30901400 VF, National Laboratoiyhas been comparing the performance Univax 119411SP. of about 100 different computers (ranging from supers to Distributed-memory NCUBE, iPSC, 14, Ametek micros) using the standard linear system solver, LINPACK, multicomputers BBN Butterfly; in a Fortran environment [35]. The timing information that CDC Cyberplus, Culler he releases from time to time reflects only the relative perPSC, FPS T-Series, formance on one problemarea. To judge the overall perWarp formance of a computer system, one shouldtest both kerHierarchical Cedar, or ETA-10, IBM RP3, nelsand specific application problems. Such benchmarking reconfigurable Remps systems performance may change with respect to software and hardware changes. The OSkompiler used makes a subtle difference, even running on the same machine. Furtherdiftiplefunctional units. The memory-to-memory architecture ferences can be found in the direct use of assembly landemands much wider memory bandwidth and longer guage coding or theuse of compiler directives. instructions, which only favors the processing of long vectors. For short vectors or scalars, its performance could be C. Advanced Architectural Choices very poor as experienced in Cyber 205. On the contrary, the register-to-register architecture performs much better with Since supercomputers are mainly used to perform a scalar/vector mix. The Alliant FW1uses a cache-based numerical computationsin science andengineering, most register-to-register architecture, which is a compromise are equipped with both scalar and vector hardware units that can operate i n parallel. We classify the architecture of between the two extremes. Multiple data streams can be modern supercomputers into categories five based on their executed by multiple functional pipelines simultaneously, interconnect and operational structures, as summarized in even if there is only one instruction stream in the system Table 2. The key featuresof these architectural choices are [721, 11091. characterized below. 2) SlMD Processor Arrays: These are parallel processors which operate synchronously in lockstep under the same 1) Multipipelined Uniprocessors: Most vector supercomcontrol unit.Physically, theprocessingelements (PES) form puters start as a pipelined uniprocessor containing mul1350
VOL.75,
a processor array, such as the mesh architecture in llliac IV, Loral MPP and ICUDAP [60], or the hypercube architecture in the Connection Machine [58]. Since these machines are often used in processing large-scale arrays of data, they are also called array processors. The FPS 164/Max [I81 belongs Accelerators (MAX$ to this class. Up-to-I5 Matrix Algebra are attached to an FPS 164 and operate in an SlMD mode. Most SlMD array processors are special-purpose machines, applied mainly to signal and image processing [96]. 3) Shared-Memory Multiprocessors: These are tightly coupled M l M D machines using shared memory among multiple processors. The interconnect architecture falls essentially into one of two classes: the bus connect and direct connect. Most minisupers choose the bus connect, in which multiple microprocessors, parallel memories, network interfaces, and devicecontrollers are tied to the same contention bus [9]. For example, the Elxsi 6400 uses a highspeed &bit bus with a bandwidth of 320 Mbytesls [loll, [119]. The direct connect architectures include crossbar, partially connected graphs, and multistage networks [22], [39]. Most high-cost supers and high-end mainframes use these direct interconnects [32]. 4) Distributed-Memory Multicomputers: This class corresponds to loosely coupled MlMD systems with distributed local memories attached to multiple processor nodes. The popular interconnect topologies include the hypercube, ring, butterflyswitch, hypertrees, and hypernets. Message passing is the major communication method among the computing nodes in a multicomputer system. Most multicomputers are designed to bescalable. The BBN but[22], by whichproterfly switchis a multistage network cessors can access each others local memories. The ring architecture requires less hardware. The hypernets [69] offer a compromise in hardware demand between hypertrees [52] and hypercubes [471,[54]. Communication efficiency and hardware connectivity are the major concerns in the choice of a cost-effective multicomputer architecture. 5) Hierarchical and Reconfigurable Supercomputers: These machines haveahybrid architecturecombining both shared memory and message passing for interprocessor communications. Several research multiprocessor systems belong to this category, such as the Cedar project [88] and the ETA-IO [36]system. Hierarchical memorysystem is built into the system. The approach is to apply macro dataflow at the level of processor clusters and still use control flow
within each processor. Therefore, parallelism is exploited at multiple levels, even with different computing models. The Remps [75] was proposed to have a hierarchical architecture, specially designed for solving PDE problems. Other parallel PDE machines have been reviewed in [109]. Representative supercomputer and high-end mainframes are summarized in Table 3. Table 4 summarizesvarious minisupercomputers. Key features being listed include the degreeofparallelism (number processors of in system), processor type, memory capacity, interconnect architecture, and peak performance of both classes of supercomputers. Other recent architectural studies can be found in [331, W , I[641, [MI, [811, [1071, [1091.
II. PARALLEL LANGUAGES AND COMPILING TECHNIQUES
We address below languages and system software issues for parallel processingon supercomputers. First, we review the development of concurrent programming languages and their compilers. Vectorization and multitasking techniqueswill be reviewed. Finally,wediscussadvanced methods in developing intelligent compilers for parallellvector computers. A. Concurrent Programming Languages To implement fast algorithms on supercomputers, we need a high-level programming language possessing the following features: Flexibility. The language should make it easy for the programmer t o specify variousforms ofparallelism in application programs. Efficiency.The language shouldbeefficientlyimplemented onvarious parallellvector computer systems. Three approaches have been practiced towards the solution of this language problem, as illustrated in Fig. 3. The Compiler Approach: Most existing applicationsoftware packages are coded with sequential languages such as Fortran. Intelligent compilersare needed to detect parallelism in sequential programs and to convert them into parallel machinecode, as illustrated in Fig. 3(a). Good examples include the CFTcompilerused in Cray X-MP [26] and the KAP/205 compiler designed for Cyber-205 [63]. An advantage of this approach is that software assets accumulated in conventional sequential codes can be used on
[n,
.
Serial Constructs Cyber 205 Multitasking Constructs Cray X-MP Multiprocessor
Intel iPSC Multicomputer
Extended C
Concurrent Constructs
(b)
Various
Multicompu
(C)
Fig. 3. Three approachesto concurrent scalar/vector programing. (a) Smart compiler approach. (b) Language extension approach. (c) New language approach.
HWANG: PARALLEL PROCESSING WITH SUPERCOMPUTERS
1351
Table 3 Supercomputers and High-End Mainframe Systems System Model Cray X-MPl4 Architecture Configuration' MP with SM and direct connect
MP with S M and
Processor Type3 Custom ECL Custom ECL GaAdECL Custom CMOS Custom Custom ECL Custom
Peak Performance'
Remarks and References6 shared registers, [21,25] among 4 Processors 16 KWILMlproc. [26] under development MM architecture [631,[981 under development [361 also Amdahl1200 P I ,[I331 16 pipes divided into 4 identical sets host scalar processor [99] MlMD pipelining [791, M I , P25l vector facility optional [51], [138]
ISP contains [129]
840 Mflops
direct connect
MP with SM
2 Gflops
cw
16 Gflops
400 Mflops
4 MW
a Proc. l a IOPS
1 Proc. 1 Proc.
256 mW 32 MW 32 MW
1 Proc.
32 MW 256 MW 2 GB CM M 16 TB E 16 MW
840 Mflops
160 Mflops
switch network
M P with S M and
480 Mflops
direct connect MP with SM and direct connect MC withDM and ring connect
67 Mflops
vector hardware
512 KW per
CDC Cyberplus
Custom
processor
65 Mflops and 620 Mips per processor >lo00 Mips 250 Mflops
Connection Machine
SlMD with DM hypercube embedded in a global mesh MP with S M via butterfly switch network
SIMD 128 X 128
64K
PES
32 MBytes
256 Proc. 16 K
PES
1 2 8 ~ ~ M68020, custom coprocessor CMOS/SOS a PES per chip custom floatingpoint proc. 32-bit RlSC Alliant/FX clusters 128 MB
256 Mips
470 Mflops
mesh with DM SlMD with a reconfigurable Benes network MP with SMlDM and fast network hierarchical M P with SM
IBM
576
PES
GF 11
P I
IBM
RP3
512
128 MW 256 MW
under development
[1131
Cedar
3.2 Gflops
1352
System Architecture Max No. of and Remarks Peak Memory Processor Max Model References6 Performance Configuration Capacity4 Type3 Processors Alliant FX/8
MP with S M
E CEs, 12 IPS
12 proc.
4 MW
192 MBytes
16 MW
35 Mips
P , 811
and bus connect MC withS M and vus connect MC with DM and VME buses UP with vector hardware UP with multiple pipes MC with DM and hypercube MC withDM hypercube MC withDM and hypercube MC with DM and hypercube SlMD array of
15 MAXs 30 NS32020 /32081 NS32020, M68020
ooo
48 MBytes
~321
Flexible/32
20 per cabinet
20 MW
[411
processor
16 MW 4 MW 20 Mflops 25.5 Mips 18 Mips, 44 Mflops 20 Mflops per
c-1XP
scs-40
Convex
4 proc. 5 lops
multiple I/O (Crayetts) [23], [81] vector hardware Cray compatible [81] multiple 110s and memory options [54]
[31
iPSC-VX
128
Intel
80286, 80287
4.5 MBytes
CNs
256
per processor
32 MW
processor
15 Mflops
Ametek
S-14
Intel
80286 80287
NCUBHIO
1024
custom VLSl
512 MBytes
[1101
FPS-T
4096
CMOS transputer
1 MByte per
node
1 proc., 15 MAXs
[561
FPS 164/
MAX ICL-DAP
64 x 64 mesh
8 MBytes
Culler-PSC
2 proc., 1 host 10
12 MW
5 Mflops
[811
custom
80 MW
100 Mflops
link to a host
[41 __
MP = Multiprocessor, MC = Multicomputer, S M = Shared Memory, DM = Distributed Memory (Local Memory), UP = Uniprocessor. *CE = Computational Elements, IP = Interactive Processor, IOP = I/O Processor, MAX = Matrix Algebra Accelerator, CN = Computer Nodes. TCM = Thermal Conduction Module, RlSC = Reduced Instruction Set Computer. A word has 64 bits, CM = Central Memory, E M = Extended Memory, SSD = Solid State Device. Mflops = Million Floating-point Operations per Second (&bits precision), Mips = Million Instructions per Second. RR = Register-to-Register, and MM = Memory-to-Memory.
parallel computers with only minor modifications. However, the sequential language forces the programmer to code parallel algorithms in sequential form. A compiler can enhance the performance by only a limited factor, due to the difficulty in detecting parallelism in acomplex program mix. The Language Extensions: Sequential languages can be extended with architecture-oriented constructs to support concurrent programming. Usually, only one type of parallelism is supported in each extended language, as illustrated in Fig. 3(b). For instance, Fortran has been augmented with service routines like EVWAIT, LOCKON, and TASKWITH SUPERCOMPUTERS
START in Cray X-MP to enable multitasking [25]. Concurrent C language has been extended from C for theFled32 multicomputer [41]. Other extensions include Vectran [Ill], Actus-2 [112], and Vector C [97]for supporting vector processing on pipeline or array processors. Because the extensionsare machine-oriented, they areoften efficientlyimplemented. However, machine dependence implies poor portability. Users may haveto recode the parallel algorithm in another extended language, when the targetmachines are changed. The New Languages: With this approach, new concurrent languages are developed forsupporting parallel pro1353
cessing. Quite a few concurrent languages have been proposed in recent years, including the Concurrent Pascal [55], Modula-2 [ I N Occam ] , [W, Ada [31], VAL, and SlSAL [loo]. Unlikean extended language, these new languagescontain some application-oriented parallel constructs. However, each of these languages is usually designed to support one form of parallelism, a s illustrated in Fig. 3(c). It i s probably due to this reason that none of these new languages has been universally accepted in commercial supercomputers. Comparing these three approaches, we observe an application barrier: On the one hand, parallel algorithms and supercomputers manifest a variety of computation modes. On the other hand, each of the proposed languages s u p ports only one ortwo computation modes. This barrier is the major source of inflexibility and inefficiency in using conventional sequential languages to code parallel algorithms on supercomputers. Various languages and compilers used in modern supercomputersare summarized in Table 5. We are searching foran ideal programminglanguage for parallelhector supercomputers. Such a high-level language should be easy to learn and flexible to apply in major scientificlengineering applications. The language should be architecture-independent. In other words,it would be used in various computation modessuch as SIMD, MIMD, pipelining, and dataflow, etc. Such an ideal programming language may haveto be developed with a combination of the above approaches; perhaps starting with a smart cornpiler, then adding some extensions to existing popular languages and gradually moving towards the development of a new language as a long-term effort. Recently, a Parallel Programminglanguage (PAL) wasproposed for supporting multimode concurrent parallel programming [143]. A new languageconstruct, called a molecule, is proposed there for the programmer to specify typedprocedures explicitly for various modes of parallel computation.
* * a *
//
/
* y
I * * * I
* * a
* *
50%
-... 1
I 2 3 4 5 6 7 8 9
3 0 %
1 0
that migrationwillassistthevectorizingcompiler inexploiting the vector hardware for that program. Hazards may occur in which an apparent lack of data independence may prevent vectorization in loops. The basic unit of vectorizationis the DOLoop. Dueto the size restriction of thevector registers, it is always necessary to split long vectors into segments, called sections. For example, Cray X-MP has section size 64 and IBM 3090l400 VF has 128 elements in asection (Fig. 5) [138]. Vectorizable loops must be first sectionized. Data independence in loops is a key factor in enabling the vectorization. A recurrence
P =
1 (1 -
f)
+ flr
(1 )
The performance P is very sensitive to the variation of f, which indicates thepercentage of code whichis vectorizable. The speedup ratior is primarily determined by hardware factors. The term vectorization refers to the compilers role in analyzing user programs and producing object codes to execute on the vector hardware. The portion of the compiler which carries out the vectorization is called a vectorizer. Vectormigration refers to the process of modifying and adapting an application program in order to reveal its vector content and to improve its performance. This implies
chd
CPU-3
Expanded Stomge
(Shared Memory)
CPU: Central Proceasom
: Vettor
Facility
1354
Supercomputers System Model Cray X-MP Cray 2 Cyber 205 ETA-1 0 NEC SX Fujitsu VP IBM 3090NF Univac 1194/ISP Cyberplus HEP-1 Convex C-1 SCS-40 Alliant/FX
Elxsi 6400
ass
Unix-System V/CFT2, C Virtual OS/FTN 200 Virtual OS/Fortran, Pascal, C ACOS/Fortran 77/SX, Pascal, C, Lisp, Prolog FACOM VSPlFortran 77
X A N S Fortran V2
macroassembler CAL 2 supported Fortran compiler,ANSI 77 with vector extension also V SOS, UNlX planned only Fortran in vector mode also appear in Amdahl 1200 series economic analyzer supports interactive vectorization ANSI Fortran 8X standard included cross-compiler with host production suspended accepts VAXNMS Fortran inputs COS and UNlX system V and C compiler under development
OS based on UNlX 4.2
1100 Executive/UCS Fortran77 Host NOSUANSI 77 Fortran UNlX Ill/Fortran 77, C, Pascal Unix 4.2 bsd/Fortran 77, C CTSS/Fortran 77 (CFTCivic) ConcentrixlFX Fortran, C, Pascal EMBOYFortran 77,Pascal, COBOL 74, C, MAINSAIL DYNIWC, Fortran 77,Pascal, Ada UNMAX 4.UC, Fortran, Pascal Ada, Lisp MMOSIFortran 77, C, Ada, Concurrent C ? L Fortran XenixlC, Fortran, Common Lisp Hypernet O W ,Fortran Host OS/OCCAM UltrixlFortran 77, C Host OSICM-C, CM-Lisp ICL PerqlDAP Fortran ChrysalisK, Common Lisp UNIWC, Fortran, Pascal UNlX 4.2 csd/Fortran 77, C AXISNERTEWFORTRAN 77, C
also run UNlX system V.2 similar to UNlX 4.2 bsd support multiprocessingalso UMAX V/OS Ada supports multiprocessing; also run UNlX System V
OS run on Intel 310 host
Ultrix is the DECs version of UNlX connection machine extensions of C and Lisp running under UNIX, similar to Fortran 8 X
OS similar to UNlX
OS modified from UNlX 4.2 bsd
Connection Machine ICUDAP BBN Butterfly IBM RP3 Culler PSC NCUBE U10
carries a dependence between the elements of a vector which prevents it from being vectorized [@I. lndirectaddressing refers t o addressing an array by using subscriptswhicharethemse1vessubscripted.Aslongasthe arraywhich is indirectlyaddresseddoes not appear o n both sides of the equal sign in a Fortran statement, vectorization is possible. In other words, indirectly addressed variables may be vectorized, if there are only loads, or onlystores of
the variables, but not if there are both. Vector reduction refers to avector operation which produces a scalar result, such as the dot product of two vectors or finding the maximum of an array of elements. Vector reduction arithmetic demands special pipeline hardware support [105]. Localvectorization techniquesare centered on DO loops. It should be noted that not all DO loops are appropriate for vectorization. In fact, not all loops are DO loops, such as
1355
a loop containingan IF statement. Some D O loops iterate too few times or have unknown iteration counts. These may not be worth the effort to vectorize. The stride of a vector is the addressing increment betweensuccessive elements. Stride1addressing often results in better performancethan addressing with longer strides. Therefore, stride minimization is often practiced to enhance local vectorization. Reorganizing data may improvethestridedistribution among nested D O loops. Prolonging the innermost DO loop makes it more appropriatefor vectorization. Other local vectorizationtechniques include the use of temporary variables, linearized multidimensional subscripts and auxiliary subscripts, statement reordering, loop segmentation, simplifying subscripts, reverse unrolling, loop distribution, IF conversion, improving vector density, and the use of equivalence for longer vectors [128]. One should isolate nonvectorizable constructs, such as C A L L , recurrences, 110, relationals, and other hazards, in application programs [SI]. Globalvector migration requires the global restructuring of application programs. This is often more difficult than local vectorization. One global technique is to incorporate loops across several computation modules or subroutines. Another approach is to change the solution method. At present, most vectorizers can only perform local vectorization. Only exceptionally smart vectorizers can perform a limited global vector migration. Most of the global program restructuring operationsare still done by skillful programmers, perhaps through the help of an interactive vectorizer, which allows the user to tune hisprograms as supported by the Fujitsu VP-200 compiler [133]. The IBM 3090NF system offers an economicanalyzer in itsVS Fortran compilerthat estimates the number of cycles (cost) needed to execute given sections of code, including the vectorization overhead. This analyzer helps users fine tune their programs with bothlocal vectorization and global restructuring to enhance performance. Multipipeline chainingas introduced inCray X-MP supports the fast execution ofa sequence of vector operations [26]. On the other hand, systolic arrays offer multidimensional pipelines for direct execution of certain vectorized algorithms [92]. Recently, a dynamic systolic approach was proposed forfast execution of vector compound functions directly by pipeline nets 1751. These are advanced hardware facilities for supporting large-grain vector computations. The pipeline nets are more attractive than linear pipeline chains or static systolic arrays in the area of programmability and flexibility for general-purpose application. For example, it has been verified that the execution of mostLivermore loops can be speeded up greatly, if they run on reconfigurable pipeline nets. Pipeline nets are implemented with a programmable crossbar switch with.input buffers. We shall describe howtoconvert synchronous program graphs into pipeline nets in Section IV. To explore these advanced hardware features, even more intelligent compilers are needed. In fact, Japans national supercomputer project emphasizes both advanced hardware and smart compilers [82].
C. Multiprocessing and Multitasking
tiple processors. The Denelcor HEP was designed as such an interactiveMIMD multiprocessor [79], [85], [86].The marketing of HEP was suspended due to the lack of multiprocessing software. The main lesson learned from HEP is that fancy hardware alone is not enough to stretch for better performance. Multiprocessing at the process level mustbe supported by the following capabilities: fast context switching among multiple processes resident in processors; multiple register sets to facilitate context switching; fast memory access with conflict free memory allocations; effective synchronization mechanism among multiple processors; software tools toachieve parallel processing and performance monitoring; system and application software for interactive users. In a multitasking environment, the tasks and data structure ofa job must be properly partitioned to allow parallel execution without conflict. However, the availabilityof processors, the orderof execution, and thecompletion tasks of are functions of the run-time conditions of the machine. Therefore, multitasking is nondeterministicwith respect to time. O n the other hand, tasks themselves must be deterministic withrespect to results. To ensure successful multitasking, the user must precisely define and add the necessary communication and synchronization mechanisms and provide the protection of shared data in critical sections. Critical sections, being accessed by only one task or one process at a time, may reside in shared memory, I/O files, subroutines, or other shared resources. One can use lockor unlock mechanisms to monitor the operation of critical sections. Reentrancy is a useful property for one copy a program of module tobe usedby more than one task in parallel. Nonreentrant code can be used only once during the lifetime of the program. Reentrant code, if residing in the critical section, can be used in a serial fashion, called seriallyreusable code. Reentrant code, which is called many times by different tasks, must be assigned with local variables and control indicators stored in independent locations, each timetheroutine is called. Stack mechanism has been employed in X-MP to support reentrancy. The Cray X-MP has developed software support to realize multitasking at several levels as illustrated in Fig. 6. Dataflowanalysis isoften performed reveal to parallelism contained in application programs. Themajor constraint of parallelism is the various forms of dependency as summarized in Fig. 7. The computational dependence is caused by either data dependence or control dependence. The nodes represent statements, processes, or even tasks. The arcsshowthedependencerelationshipamongthem..Either multiprocessing or multitasking will introduce overhead that increases overall execution time.To reduce overhead, parallelism should be exploited at the lowest level (fine granularity) possible. The storage dependence i s caused by memory conflicts. Each task or process must use independent or protected storage areas to guarantee the shared data integrity. For multitasking, storage dependence is often caused by a data dependence between the iterations of a
Ultimately, wewant to use a supercomputer that can support intrinsic multiprocessingat the process level. Such a tightlycoupled computer uses shared memoryamong mul-
13%
1.
Multitaskingatthejoblevel
E l
p 1 * * *
An aample pqram:
2. Multitaskingatthejob-steplevel
DO 1 0 I = l , N READ '4r)
(Sl)
@
S S
Dependency Typa:
S, i s flowdependent on
Sl
3 . Multitaskingattheprogramlevel
4.Multitaskingatthelooplevel
program. the execution time of the subroutineSUB(/) and the overhead introduced in service routines. Before one attempts toconvert a serial code into multitasked code, the expected performance shouldbe predicted to ensure a net gain. Factors affectingperformanceinclude task granularity, frequency o f calls, balanced partitioning o f work, and programming skill in the choice of the multitasking mechanisms [IA, [95]. The program in Fig. 8 is analyzed below to predict performance. This example is taken from the Cray Research multitasking guide [25]. The 100 loop has dependent iterationswhichcannotbeexecutedinparallel.The10loophas independent iterations, which is being attempted for multitasking. The execution time of the original program on one CPU consists of two parts: TimeU - CPU) = Time(Seq)
= (0.04 = 20.83
czs
DO 1 Iz1.N
OR VECTOR CODE)
Fig. 6. Four possible multitasking modes in aCrayX-MP/2,4 multiprocessor system. (Courtesyof Cray Research, Inc.)
loop. When the extent of a left-hand side array variable is less than the index range of the loop, such a storage dependence may occur. Speedupfrom multitasking mayoccuronlywhen thetime saved in executing parallel tasks outweighs the overhead penalty. The overhead is very sensitive to thetask granularity; it includes the initiation, management, and interaction of tasks. These are often accomplished by adding additional codeto the original code, as illustrated inFig. 8. This program can benefit from multitasking, depending on
+ Time(SUB)
+ 0.96) * (20.83 s)
s
Consider a sequential code run on a uniprocessor (Cray X-MP/I). PROGRAM MAIN DO1001 =1,50 DO101 =1,2 CALL SUB(J) CONTINUE 10
CONTINUE 100 STOP END The following multitasked code runs on a dual-processor system (Cray X-MP/Z): PROGRAM MAIN
COMMON/MT/ISTART,IDONE,JOB
TSKSTART(IDTASK,T) CALL SUBROUTINE T JOB = 1 DO 100 I = 1,50 CALL EVPOST(ISTARl7 CALL SUB(1) CALL EVWAIT(ID0NE) CALL EVCLEAR(ID0NE) CONTINUE 100 2 JOB = 2 END EVPOST(1START) CALL CALL TSKWAIT(1DTASK) STOP END
Fig. 8. Converting a sequential code intoa multitasked code on a CrayX-MP/2. (Courtesy of Cray Research, Inc.)
WITH SUPERCOMPUTERS
1357
we need isan intelligent compilerthat automatically detects the potential for vector and parallel processing i n standard Fortran (or any other high-level language) code and generates object code that uses the parallel and vector features of the hardware to fulladvantage. The compiler analyzes source code fordata, control, and storage dependencies at Time(2 - CPU) = Time(Seq) Time(SUB) Overhead the process, loop, and instruction levels. The goal is to generate optimized code, which can be executedin concurrent (2) scalar, vector, multitasked scalar, or rnultitasked vector processing modes. The optimization process is aimed at because the subrouting SUB was equally divided between enhanced vectorization and concurrency exploitation. 2 CPUs. The overhead is calculatedbelowwith some Generally speaking, innermost DO loops should be vecapproximation on the delays caused by workload imbaltorized and the next outer loop should be multitasked. ance and memory contention. The service routines, Nested DO loops and multidimensional array operations TSKSTART, TSKWAIT, EVPOST, EVCLEAR, EVWAIT, are used can thus run in multitasked vector mode. Other situations, in Cray X-MP to establish the multitasking structure [25]. like DO WHILE loops, can run in multitasked scalar mode. It is often very desirable to have an interactive compilerthat Overhead = TimeCTSKSTART)+ Time(TSKWAIT)+ allows a programmer to tune his code in the optimization 51 * Time(EVPOST)+ process. Such fine tuning can be conducted at a higher 50 * Time(EVCLEAR)+ programming level(global), such as subroutines andtasks 50 * Time(EVWAIT)+ where optimization needs feedback information from the (workload imbalance delay)+ programmers. Such acompiler should providefacilitiesfor (memory contention delay)+ programmers to monitor and modify the code optimiza= 1500000 CP+ tions. Messages and listings notify the user of conditions 1500 CP+ that affect optimization, and summarize the scope of opti51 * 1500 CP+ mization. Programmers then modify the optimization with 50 * 200 CP+ I ] . inserted compiler directives [ 50 * 1500 CP+ Compilers directives are often very helpful in achieving (0.02 * 50 * 0.2 s) better optimization on a global basis. The intelligence of most compilers is presently restricted = 0.216 s to local optimization. In order to increase the degree of multiprocessing or mulwhere the CP (clock period) is equal to 9.5 ns. Therefore, titasking, we use the compiler directives to achieve global Time (2 - CPU) = (0.4 * 20.83) * (0.96 * 20.83) 0.216 or subglobal optimizations. The directive allows the pro= 11.05 s. We thusprojectthefollowingspeedup ragrammertooverridethecompilerwhereoptimizationdoes tio: not enhance performance or may causeinvalid results. For example,adirectivecan beadded tosuppressvectorization Time(1 - CPU) 20.83 = -= 1.88. (3) Speedup = when the vector lengthis too short, and to suppress mulTime(2 - CPU) 11.05 titasking, if the overheadis too high. This speedup helps decide whether multitaskingis worthKnuth has stated thatless than 4 percent of a Fortran prowhile. The actual speedup of this program as measured by gram generally accounts for more than half of its running Cray programmers was 1 . 8 6 . This indicates that the above The best pertime [83]. Certainly, DO loops play this role. prediction is indeed very close. formance resultsare often obtained by vectorizing and/or Multitasking offers a speedup which is upper bounded multitasking such loops. Loops containing data dependenby the number of processors in a system. Because vector cies or recurrences may be inhibited from vectorization. processing offers a greater speedup potential over scalar Then the compiler should perform some loop transforprocessing (in Cray X-MP, vectorization offers a speedup mations to make the successive iterations multitaskable. 20), multitasking should not be in the range of 10 This has been done in Cray X-MP as well as in Alliant FXemployed at the expense of vectorization. In the case of series multiprocessors [135]. The example shown in Fig. 9 short vector length,scalar processing may outperform vecillustrates how concurrent processing of a DO loop with tor processing. In the case of small task size, vector prodata dependency can be achieved in an Alliant multiprocessing (or even scalar processing) may outperform mulcessor with 3processors (computation elements). The data titasking. Bothscalar and vector codes may be multitasked, dependencies are synchronized by a hardware concurdepending on the granularity and the overhead paid. For rency control bus across the processors, as shown in Fig. large-grain computations withreasonably low overhead (as 10. in the above example), multitasking is appropriateand Compiler technology plays a crucial role in the perforadvantageous [20]. mance of vector supercomputers. Most vectorizing compilersaredesigned for specific machines such as the FX/Fortran used in Alliant FW8. What we want to develop are D. Intelligent Compiler and Directives retargetable vectorizers which require low software conversion costs, when switched among different machines Both vectorization and multiprocessing needto be s u p be designed to exploit not only ported by an intelligent compiler. Sofar, most Fortran com- [89], [90]. The compiler must vectorization but also parallelism at higher levels. pilers developed for supercomputers have some vectoriImportant optimization capabilities that should be built zationcapability.Veryfewcompilers have beenfully developed to exploit parallelism for multiprocessing. What into an intelligent compiler include vectorization, concur-
where Time(SUB) accounts for the 96 percent of the time spent in subroutine SUB and Time (Seq) for4 percent spent on the remaining portion of the program. The total run time on 1 CPU is assumed to be 20.83 s. To execute the multitasked program on 2 CPUs requires
+4
+4
1358
Fig. 9. Parallel executionof a DO loop with data dependencyon an Alliant FX system with three processors. (Courtesy of Alliant Computer Systems Corp.)
T ;y q
Crceabu Switch
F i g . 10.
exploiting scalar or randomly structurd parallelism. In fact, rency, general, and directive optimizations. Conventional this new approach as been built into the Multiflow Trace compilers can only support the general optimization. Veccomputer series using a Very Long Instruction Word (VLIW) torizing compilers and multiprocessing compilersmay be architecture. The method can break theconditional branch developed separately or jointly. The joint approach is more barrier by predicting the most likely execution pathbased useful to general-purpose applications. However,its develon some heuristics or statistics gathered automatically by opment is much more involved. At present, the FWFortran 1024-bit instructionwords,28operCompiler [I] and the compiler developed for the Warp pro-program profiling. With ations can be executed in parallel per each cycle. The tracecessor [4] at CMU have some limited extent of these comscheduling compiler exploits large amounts of fine-grain bined capabilities. The Cray CFT compiler supports autoparallelism with the VLlWarchitecture. Initial benchmark maticvectorizationandmultitaskingbyaddingthose data released with the Multiflow computer suggest acomsoftware initiation and synchronization routines[25], [131]. parable performance with that of IBM 3090/200 or Alliant In real-world applications, many codes cannot be vecFX/8 inrunningthe same LINPACK programwithout torized or parallelized well. A trace scheduling compacting resorting to vectorization. The capability of using a tracecoupling technique has been developed byFisher [ 4 4 ] for
1359
Parallel techniquesfor multiple processor scheduling and for concurrent activity control are presented below with three advanced system features: the MlMD pipelining, as introduced inHEP-1, the activitycontrol mechanism develIll. NEW ARCHITECTURES AND CONCURRENCY CONTROL oped in the Univac ISP system, the scoreboard approach in This section deals with advanced architectures,resource Cyberplus, and the lockstep mechanism i n IBM CF-11. The HEP-1 was a tightly coupled multiprocessor consisting of management, and concurrency control techniques.These sixteen processorsand up to 128 memory modules that are issues will greatly affect the degree of parallelism, the resource utilization rate, and thus thesystem throughput. interconnected via a pipelined packet switching network Resource management includes the scheduling of multiple [125]. Parallelism is exploited at the process level within each processors and functional units, memory allocation and processor. The system allows 50 user processes to be con1 0 activity control. Concurrency refers to the access, and 1 currently created in each processor. Fifty instruction simultaneous execution of multiple software processes in streams are allowed per processor, with a maximum of a computer system. It combines vectorization and multi50 X 16 = 800 user instruction streams in the entire HEP tasking in a multiprocessing environment. Multiprocessing system. The pipelined HEP executes multiple instructionstreams techniques address three critical issues: partitioning, scheduling, and synchronization [50]. Various concurrency 1 , over multiple data streams. An exampleis shown i n Fig. 1 control techniques will pave the way to developing while an add i s in progress for one process, a multiplycan advanced software tools for balanced parallel processing. third, a,nd a branch be executing for another, a divide a for Weillustratetheseadvancedparallelprocessingtechfor a fourth.Instructionsbeingexecutedconcurrently niques along withsome innovative supercomputer archiare independentof each other. Thus fill-inparallelism tectures introduced in recent years. increases processor utilization. Arbitrarily structured par-
scheduling compiler to optimize non-parallel code compensates exactly the weakness of vectorizing or rnulti-processing compilers.
^-I
I
~~ ~
"3
Inst ction
.rids
Add Functions
A(4) B(5)
Stream 3 Instruction Instruction Fetch Stream 2 Instruction Omand Fetch Stream 1 Instruction Execution Phase Stream 8 Instruction Execution Phase
MIMD Pipelie
(b)
Fig. 11. MlMD pipelining in HEP-1 multiprocessor system.(a) Conventional pipelining on uniprocessors.(b) Pipelining of instructionsfrom multiplestreams in a multiprocessor. (c) Mechanism to achieve MlMD pipelining in HEP-1.
1360
PROCEEDINGS OF THE IEEE, VOL. 75, NO. 10, OCTOBER 1987
(C)
allelism is applicable to codes that do not vectorize well. The concepts of packet-switched data forwarding [22] and of MlMD pipelining[78], [79] make HEP very attractive for solving PDE problems described by sparse matrices. The architecture of Univac 1100/94/21SP is shown in Fig. 12 [129].There are four centralprocessors and four110 processors in the mainframe 1100/94. Two IntegratedScientific Processors (ISP) are attached to form near a supercomputer,
Instruction
The IBM G F l l is a modified SlMD computer conceived primarily for the numerical solution of problems in quantum chromodynamics [8]. The machine incorporates 576 floating-point processors, each with its own 2 Mbits of Mflops, giving the total machine memory, and capable 20 of over 1 Gbyte of memory and a peak processing speed of more than 11 Gflops. The floating-point processors are interconnected by a high-speed full Benes network, a nonblocking switch capable of realizing configurations incorporating any permutation of the processors. Using this switch, G F l l can be organized intoany of a number of different topologies, such as the rectangular mesh of any dimension and size, any torus, a hexagonalmesh, or some irregular organization matching perfectly with a special problem. The central controller broadcasts instructions to all processors and theswitch and communicateswith host a computer.
S P S U I
Interfaces-,
LSP LSP
7'1 I n k g r a d
& p
I
LBC
'
Scientific
P n
because each ISP isequipped with high-speed hardware for Modern Supercomputers both scalar and vector arithmetic computations. Sixteen Hardware million words of main storage are shared by ten processors System Supporting through multipledatapaths as shown in the schematic diaModel Structures gram. The major architecturaladvantage of thissystem lies Cray X-MP shared memory, in availability, reliability, and maintainability[l29]dueto its shared register clusters modular construction. What interests us here is the control of concurrent activities in the system. An activityis the unit communication ETA-10 of work scheduled for the ISP. Each activity is explicitly disbuffer, shared memory, patched by the1100/94 system. Three activity control mechanisms are used in an ISP: HEP-1 shared memory the mailbox, the status register, and the processor control hypercube iPSC box. The mailbox is used to convey information between interconnect 1100/94 system and the ISPs. Each ISP has a mailbox occumultiple rings Cyberplus pying eight words in the shared memory (theSPSU). When an activity is initiated, it contains a pointer to the processor control boxassigned for theactivity. Atactivitytermination, Multimax common budshared memory the mailbox is loaded with the contents of the hardware status register to report the termination status. The control concurrency control Alliant FW8 box, also residing in the shared memory, contains inforbus, shared cache/rnemory mation needed toprocess the activity. Another methodis tousethescoreboardoriginallyusedintheCDC7600series, hypercubdmesh Connection to match ready-to-run instructions with the available funcconnections Machine tional units in the processor. AI machine using Boltzmann Dynamic scheduling of concurrent processes depends virtual connections Machine on run-time conditions. Optimal solutions are impossible. butterfuly switching BBN However, there are many heuristics onecan choose from, network Butterfly such as first-in-first-out, round-robin, shortest-process-first, multistage network IBM RP3 or least-memory-demand-first, etc., [50], [65]. Static scheduling is determined at compile time, which is easier to SlMD network IBM GFll implement but potentially results in poor processor utilizations. The trace scheduling [44]of loop-free code offers multiple bus and FLEW32 such an approach. The activity control mechanisms being shared memory presented have beenpracticed in many machines. Of bus and shared Elxsi 6400 course, data-driven and demand-driven mechanisms could memory be considered for dataflow [29] and reduction machines hypercubeltransputer FPS-T [13T], respectively. At present, these new mechanisms are still in the research domain.
HWANG: PARALLEL PROCESSING WITH SUPERCOMPUTERS
Communication Mechanisms semaphore and shared variables shared variables and synchronization funaions tagged variables message passing (packet switched) message passing (packet switched) data transfer via bus synchronization vis special bus makedmessage passing value passing shared variables in distributed memory shared variables and message passing
SlMD broadcasting
message passing message passing OCCAM channel commands
1361
are achieved essentiallyby fourmethods: sharedvariables, message passing, marker passing, and value passing. These methods are implemented with shared memory, shared registers, communication buffers, concurrency control or interrupt buses, interprocessor connection rings, or networks as summarized in Table 6. Various interconnection networks for parallel processing have been summarizedin ~91,[1241. Sharedvariables are often storedin thecommon memory shared by many processors. Many tightly coupled multiprocessors use shared memory as a major means of communication among processors such as used in Cray Series and in the Encore/Multimax multiprocessor [9]. Besides using shared central memory, theETA-10 also uses a communication buffer for fast transfer of information among 1 0 processors, as illustrated in Fig. central processors and1 13 [36]. Taggedshared variables are used for synchronizing
concurrent processes in HEP [125]. The Cray X-MP, using shared memory for large data movement betweenprocessors, also uses clusters of shared registers for direct communications of semaphores or status information among processors, as illustrated i n Fig. 14. Synchronization, if not properly handled,may become a major barrier to parallel processing [5]. For an n-processor X-MP, the shared registers can be clustered inton 1nondisjointclustersof processors[21]. Each cluster has eight 24-bit shared addresses, eight W b i t shared scalars, and 32 I-bit semaphore registers. These shared registers allow direct passing of scalar data, and semaphores among 2,3, or 4 processors per cluster. There are 4 ports to each cluster for the 4 CPUs in the X-MP/4 system. The COS operating system dynamically allocates the clusters to CPUs. An allocated clustermay be accessed by theCPU in either user mode or supervisor mode. Note that any num-
U
Register
File
. .
(256 x 64 bits)
Fig. 13. The architecture of ETA-10 and its memory hierarchy. (a) System components. (b) Memory hierarchy. (Courtesy of ETA Systems.)
1362
2 3
El3 # ! E B
0
3
CLUSTER
1 2
0
3
CLUSTER
1 2
CLUSTER
0 1 2
E l 3
CLUSTER
0 1 2 3
Porta with the same numbers can be connected dynamically under system control.
Fig. 14. Clustering of shared registers for inter-CPU communication in the Cray X-MPI4.
ber of processors (2,3, or 4) can be clustered together to perform multiple tasks of a single job. Such a dynamic multitasking is the key improvement of X-MP over Cray 1. Message passing is popular amongloosely coupled, distributed-memory multicomputers,such as the Hypercubestructured iPSC [54], FPS T-Series [56], Ametek [3], NCUBE [IIO], and the ring-structured Cyberplus [43]. The passing of messages of arbitrary sizes and performing of complex operations on these messages demand quite powerful node processors. The hypercube topology has the advantage of being uniformly structured with log, N diameter. However, for a very large number of processor-memory nodes ( N ) , the contention and traffic congestion problems may introduceappreciablecommunication overhead amongthe nodes. The CDC Cyberplus chooses a less costly multi-ring structure for packet-switched message passing among the processors, as illustrated in Fig. 15.
The Cyberplus ring structure differs from the bus structure as in Multimax [9], Elxsi 6400 [IOI], and Balance 21000 carries 2n data packets simultaneously [132], in that the ring n processors. The traditional system in the ring which links bus can carry only onedata element at a time. The configuration consists of a maximum of four ring groups with 16 Cyberplus processors (accelerators) per group [43].Each group has two rings: the 16-bit system ring provides communications between the Cyberplus processors and the host Cyber processor, say a CDC Cyber 1801990, and the 16bitapplicationringprovidesdirectcommunicationsamong the Cyberplus processors themselves. Each Cyberplus processor has fourteen functional units for memory accesses and integer operations plus an optional five floating-point units. The nineteen functional units can operate in parallel under the coordination of a scoreboard which helpsset the crossbar to direct the flow of data among the functional unitsand initiate the simultaneous operationof all ofthefunctionalunits every machine cycle of 21 ns. This implies that each Cyberplus processor can potentially perform 620 Mips and up to 98 Mflops. Besides the dualrings, an additional memory ring can be added to provide direct memory-to-memory communications between local memories attached to the Cyberplus processors and the host Cyber processor. Sustained data transfer rates of 100 to 800 Mbytesk are built into the system. This multi-ring scheme offers directprocessor-to-processor and directmemory-to-memorycommunications, which is highlydesirable in parallel andarray processor systems. In Alliant FW8 (Fig. I O the ) , concurrency control bus is used for synchronizing activities in differentprocessors [I]. This idea was also previously used as an interrupt bus among the PDP-11 processors in the C.mmp project [142]. In a markerpassing system, the communication among processing cells (often with RlSC or bit-slice structures) i s done by passing single-bit markers [37. Each processing cell can handlea few marker bits with simple Boolean operations. These cells are connected bya hypercube as in the Connection Machine[58], in which all cells operate in lockstep synchronously under oneexternal control. The markers in a collection represent entities with a common property and are identified in asingle broadcast, thus synchronization can be avoided. Value passing systems pass around continuous numbers and perform simple arithmetic operations on these values. Multiple values arriving at a processor simultaneously are combined into a single value, hence contention will not happen and synchronization is unnecessary. Neuralnetworks and Boltzmann machines have been proposed to use this scheme. The marker passing and valuepassing are mainly used in Al-oriented processing [70].
in the
1363
ture in these systems represents state-of-the-art approaches to establishing efficient physical memory and extremely large virtual space for the next generation of supercomputers. The ETA-10 system is a shared-memory multiprocessor supercomputer, extended from the Cyber 205, which is a multi-pipelined uniprocessor system [98]. The architecture of ETA-IO is shown in Fig. 13(a). The system, consisting of eight central processors and up to 18 110processors under the coordinationof a service processor,is targeted to have a peak performance of 10 Gflops. All 27 processors have access to the large shared memory and the communication buffer. The memory hierarchy is shown in Fig. 13(b).Essentially, there are four levels of memory. The large register file is managed bythecompilers running on ETA-1O.Thecentral processor memory is local to each CPU. The shared memory (256 M words or2 Gbytes) sets the limit of the physical space for active files. The disk storage, which is so large that it contains trillions of bytes, is controlled by the 1 1 0 units. Virtual memory space consists of 238 M logical addresses. All user programs and a large portion of the ETA-10 system
code run in thisspace. The communication buffer, acting is not as a mechanism for interprocessor communications, adirect part of thevirtual memory support system. Instead, it provides fast locking and synchronizing functions. The Cedar system is one of the nations largest supercomputer projects under development in a Universityenvironment. Fig. 16 shows the hierarchical structure of the Cedar. Multiple levels of parallelism and dynamic adaptability to run-time conditions are the key features of this supercompter. The main objective of the project is to demonstrate that parallel processing can deliver high performance across a wide range of applications. The system is targeted to have eight clusters of processors; each cluster iscurrentlybuiltwithAlliantFX/8processorsrunningUNIX, as shown in the extended portion of Fig. 16 as well as previously detailed in Fig. 10.What we want to examine here is the memory hierarchy in Cedar. The global memory is shared by all clusters through the two-stage global switches. Thesearetwosetsof unidirectional crossbar switcheswhich are pipelined with input buffers. Within each cluster, there are3additional levels of memories: the registers in thecomSYSTEM
FX/ 8
FX/ 0
FX/ 8
CQlMATlONAL ELEMENT
Camcunmc* b n t r o l &n
Fig. 16. The Cedar research supercomputer prototype under development at the University of Illinois at Urbana [MI.
1364
tem, as a message-passingsystem with localized memories, putational element, the interleaved caches, and thecluster or as mixtures of these two paradigms. Furthermore, the memories. system can be partitioned into completely independent Communications among four levels are done through submachines by controlling thedegree of memory interthe cluster switch, the cluster memorybus, and the interleaving. cluster global switches.Thegloba1 memory is used for interFig. 17(b) shows how the global address space is distribcluster shared data and synchronization,for streaming longvector access and as a fast backup memory for cluster mem- uted across the processors. Part of each local memory is allocated to form the global memory. True local memory ory. Besidesthe standard FW8 hardware, the extra hardware is accessed via the cache without going through the interdeveloped at the University of Illinois includes the global connection network.The dynamic partitioning of memory interface within each cluster, the global switch, and the is determined at run time. Moving the locallglobal boundglobal memory. It would be interesting to watch the availary tothe far right makes RP3 a pure shared-memory ability of performance data, once the prototype enters the machine like the Ultracomputer. Moving it to the far left test run stage. An interactive supercompileris under develmakes it a pure local-memory multicomputer using mesopment, which emphasizes greatly the program restrucsage passing. Intermediate boundary positions provide a turing concept that has been pushed by Kuck and assomixed mode of computation. The architecture allows ciates for many years [W]. The target applications include shared-memory-oriented applications to allocate private the simulation of aerodynamic flows, dynamic structural data locally to improveefficiency, while message-oriented analysis, oil explorations, etc. applications use the global memory to balance the work The Research Parallel Processing Prototype (RP3) is being load. A hot spotis said to occur at a memory module if it undertaken by the IBM Watson Research Center [I131 in receives more than the average number of memory refconjunction with the NYU Ultracomputer Project [53], [121]. erences from the processors. Hot spots are expected to This experimental project aims at investigating the hardoccur, typically at shared-memory locations which contain ware and software aspects of highlyparallel computations. synchronization mechanisms, shared data, common The RP3 is an M l M D system consisting of 512 state-of-thequeues, etc., [108]. art 32-bit microprocessors with an RlSC architecture and a Recently, a new MlMD multiprocessor architecture has fast interconnection network (Fig. 17). The full configurabeen proposed independently by two research groups [73], tion will provide up to Mips or 800 Mflops. The system 1300 [123]. We call the architecture an Orthogonal Multiproceswill run ona modified version of BSD 4.2 UNlX operating sor (OMP) as shown in Fig. 18. An OMP consists of n prosystem. The RP3 can be configured as a shared-memory sys-
Interconnection
Network
local
addreas
global
memory
memory allocation between local and global memories. (a) RP-3 with 64 * 8 = 512 processors. (b) Dynamic memory allocation.
Node 0
Node 1
. . .
(b)
Fig. 18. The orthogonal multiprocessor (OMP) architecture for n = 4 processors (RB = Row Bus, CB = Column Bus)
r 7 1 1 .
Node 511
sequential/locd memory
cessors and n 2 memory modules connected with n column buses and n row buses. OMP is considered a partially shared-memory multiprocessor.Memorymodule Mii is shared byonlytwo processorsP,and P , . Each ofthediagonal modules serves as a local memory, i.e., Miiis accessible only by P,. The organization enbles parallel access of rowmem-
1365
ories or of column memories by all processors simultaneously. A bus controller coordinates the switching between the row buses (RBi) and the column buses (CBi) for the processors to access the memories orthogonally. This is particularlyattractive in implementing many parallel algorithms, which have the orthogonal memory access patterns, such as often found in matrix arithmetic,signal and image processing, linear programming, and solution of PDE problems. Detailed applications of the OMP for parallel processing can be found in [71]. D. IIO Multiprocessing and Front End Many supercomputing tasks are 1 1 0 bound, especially those in a real-time environment. High-speed CPU operations alone cannot solve such I/O-bound problems efficiently. We choose the 1 1 0 architectures in Cray 2, ETA-IO, Convex C-I, and Alliant FW8 to discuss how parallel processingcan beapplied to solve the 1 1 0 bottleneck problem. with Of course, many I/O problems can be greatlyalleviated the use of large centralmemory as described in Section IIIC. What we wish achieve to is to transform I/O-bound problems into CPU-bound problems with the practice of con1 0 operations. current 1 In Cray 2 [26], a foregroundprocessor is used to handle all OS and 1 1 0 f.unctions. Four QGbit communicationchannels are used to connect the foreground processor to four background processors, peripheral controllers, and common memory, as shown in Fig. 19. Data traffic travels directly
(256 Mworda)
Common Memory
interactive user jobs, 110, and other O S activities in parallel. Each IP is built around a Motorola 68012 microprocessor on a Multibuscard. The IP interfaces with the IP cache which provides access to global memory and to I/Odevices via the Multibus. A Multibus DMA device can transfer dataat 1.96 Mbitsls. High system I/O throughput is achieved with multiple IPS. Each IP can sustain a bandwidth of 3.92 Mbitsls to its associated cache. The 1 1 0 subsystem in Convex C-I is shown in the right side ofFig. 20. Fiveintelligent channelcontrol units (CCUs) are attached to a high-speed bus. Each CCU is an autonomous 1 1 0 processor with its own local memory and cache. OSfunctionsaredistributed between thecentral processor and the CCUs. Interrupts and device driver routines are executed bytheCCUs.TheMultibus I/Oprocessorcans u p port up to 18 controllers for a total of up to80 device controllers. The high-speed parallel interface taps the user with an 80-Mbit/s 110 bandwidth. The serviceprocessor unit (5PU)controlsdiagnostic program execution, initializesthe CPU's functional units, and handles the OS functions. The I/O architectures in the above systems indicate a clear trendindemandinghigher 1 1 0 bandwidthinmodern supercomputers. To match with the high I 1 0 data transfer rate, the I/O processors must use high-speed cache (such as in C-I and FW8) or a communication buffer (as in ETA-IO and Cray 2). These buffer memories form staged a memory system all the way from device controllers to the central memory. Concurrent I/O processing dem'ands the use of a large number of I10 processors or communication channels. The front end is often used to support OS and I/O operations and to coordinate the operations between the CPUs and the I10 subsystem.
Foreground Processor
The architecture of Cray 2 with four background processors for parallel computations under the supervision of a foreground processor which handles OS and I10 functions.
Fig. 19.
between the controllers and the common memory of256 Mwords. This I 1 0 section greatly alleviates the communication bottleneck with the outside world. Cray 2 uses an interactive OS based on the UNlXSystem V. Forty peripheral devicescan be attached to these high-speed 110 channels. In ETA-IO, 18 110 processors are used to perform concurrent I/O operations as described in Fig. 13(a). In Alliant FX18 (Fig. IO), 12 interactive processors (IPS) are used to offload the computational complex by executing
1366
0
1
I I
"
U
Iw
Fig. 20. The architecture of Convex C-1 with multiple functional units in CPU, five 1 1 0 processors, and a host processor. (Courtesy of Convex Computer Corp.).
method is initiated by the receiving processor using a drafting approach. Each receiving node sends requests forwork and exchanges load information only with neighboring processors. A process can migrate only once to void the possibility of circulating processes. The local and external loads can be in one ofthree states: H-load, N-load, and Lload. An H-load node is a candidate for migration processes. An N-load node does not participate in load balancing. An L-load node requests work from H-load nodes in its neighborhood usinga draffingprofoco/[106], such as using a draft age which is determined by the number of active processes handled by the processor. The drafting node will request process a from the node with the highest draft age among those responding.This is a very conservative policy, where migration takes placeonlyafter alengthy drafting protocol. 2 ) Sender-lnitiafed Load Balancing: This method is based on a gradientapproach, in which the roles of the idle and busy processors are reversed, so that the busy processors initiate the load balancing. The system exerts somecontrol over the behaviors of the nodes by setting the neighborhood diameter and a static system-wide threshold T. The threshold serves dual purposes: First, a node will begin exporting processes only when its local load exceeds the threshold. Secondly, the threshold determines the load information propagated by the node in a lightly sysloaded
tem. The sender-initiated method has the advantage of immediately beginning process migration as soon as a node enters a heavily loaded state. There is no necessity to wait for lightly loaded nodes to engage in a lengthy protocol before migration begins. 3) Hybrid Load Balancing: The receiver-initiated method acts in favor of heavily loaded systems, in which the network trafficcan be minimized by restricting to, at most, one migration per process. The sender-initiated method is in favor of lightly loadedsystems in which the load balancing process converges much faster. The hybrid method uses the receiver-initiated drafting when the system load becomes excessive and the sender-initiated scheme when the system load is light. Intuitively, this hybrid method will result in better balanced loads among the node processors. This is verified below by experimenting the scheme on the iPSC. The hybrid method requiresa distributed mechanism to manage the switching between sender-initiated and receiver-initiated modes of operation.Each node operates in eithersender-initiatedorreceiver-initiatedmodedepending onits local environment. All nodes are initialized to sender-initiated mode.A nodewill switch to receiver-initiated mode when more than a threshold number of its neighbors become heavily loaded. We refer to this threshold as the hybrid threshold H, to distinguish it from the
1367
migration is likely to occur more rapidly as the system threshold value T, that defines low and heavy load. If the reaches heavy load. The receiver-initiated method is then number of heavily loaded neighbors falls below the hybrid used to control the amount of migration and overhead. the threshold, the node switches back to the sender-initiated At larger grain sizes, this advantage is lost and the performode. mance of the hybrid method approaches that of the senderA parallel computer is scalable if its performance initiated method. The sender-initiated method has a perincreases linearly with a corresponding increase in the formancepeakaroundthegrain size50.At largergrainsizes of an number ofprocessors. Fig. 2?(a)shows the scalability there is less parallelism available and hence a consequent iPSC hypercube with four to sixteen processor nodes. The drop in performance. The receiver-initiated methodhas an best performance is obtained under the hybrid load balessentially flat curve. ancing. The lowperformanceofthereceiver-initiated The basic function of the thresholdlevelis to divide nodes method is attributed to the high overhead incurred during into busy and idle classes. A node with local load below the system initialization. Process migration must wait for load is busy. it The systemthreshold is idle, above the threshold information from busy processors before they can begin wide threshold level determines when load balancing will drafting. During initialization, only the root node is doing be activated.In Fig. 21(c), a widerange ofperformance valuseful work. The drafting method should be reserved for ues is displayed for the hybrid method under different the most heavily loaded stages of the computation. The threshold values. If the threshold is set too low, load values sender-initiated methodworkswellduringtheinitial stages are artificially high and the switch from sender-initiated to of computation when the load is light. This method tends receiver-initiated method takes place too soon. Process to produce excessive process migration. The ability of the propagation is cut off with consequent loss of parallelism hybrid method to switch to the less expensive receiver-iniand performance. The performance falloff at higher threshtiated method accounts for its better performance. old levels occurs because nodes appear idle when they are In Fig. 21(b), the performance curve for the hybrid method in fact supporting heavy computation. In all three methods, is monotonically decreasing. At smaller grain sizes, process
SPeedoP
16
I:::<:::::::*:::,
; I
t::::-::::,
A = Hybrid Method
= Senda-hitiakd
0 = Receiver-Initiated
15
1A 0 5
3
e
I
=
I
=
I
I I
8 10 12 14 18 16 h h o l dLml
(C)
20
Fig. 21. TheeffectofthreeloadbalancingmethodsontheperformanceoftheiPSChypercubewith16nodes[67].(a)Theeffeaofmachinesize.(b)Theeffectofprocessgranularity. (c) The effect of threshold levels used in trigger load balancing [ 6 7 ] .
1368
For multiprocessor systems the problem of data collection is exacerbated by the volume of material that may be produced during an execution. The traditional table-ofnumbers can easily swamp any persons ability to deterIV. PROGRAMMING AND APPLICATION SOFTWARE mine performance factors. One approach to solving this data using graphical techniques. problem is the display of We address next the issues related to the creation of a The second part of the data display problem is the sheer parallel programming environment for supercomputers. volume of material. It is impossible to view allthese data An ideal programming environment is outlined, in which in real time nor is itdesirableto run testcases multipletimes. the users are aided with system tools and visual aids for Thus we need a new medium thatcan store large volumes program trace, resource mapping, and analysis of data of data and have them displayed in a graphical form. structures. Then we examine various techniques needed to The goal is to design and construct a programming envienhance concurrent programming and thus system perronment that would be used by creators of parallel algoformance. These includealgorithm design,granularity rithms. The hardware would consist of a supercomputer tradeoffs, load balancing, resource sharing, deadlock coupled to a high-resolution, color display, a mouse, keyavoidance, synchronization schemes, asynchronous parboard, videodisk recorder, and player. The software conallelization, 1 1 0 behavior, memory contention, miscellatains support for color graphics monitor plus associated neous O S services, and software conversion considerawindowing software. There must be software for maniptions. Finally, we elaborate on multi-discipline interactions ulating the videodisk, both for recording and playback. All towards the development of efficient software packages in of these components would be obtained from current key application areas. industrial sources. Anidealprogrammingenvironment should be user-friendly. Listed below are the desired runA. User-Friendly Programming Environments time software supports in order tocreate such an environThe advent of multiprocessor or multicomputer systems ment. poses new problems for the software designers.Three ltwill display a map of the multiprocessor architecture problems stand out as crucial for optimizing performance that is available. Elements of the map can be examined of parallel algorithms. They are memory contention, probseparately using a zoom-in feature. lemdecomposition, andinterprocessor communication. For each functional element it will display an instanContention refers to the attempt to simultaneously access taneous picture of its activity. For each processor one memory by different processes. Problem decomposition can see its waivactive states, access to memory, and refers to the problem of allocating code to the various pro- 1 1 0 activity. cessors. lnterprocessor communication refers to the mesEach processor can indicate the code segment that is sages that must besent between processors to coordinate executing. Identical code segments are identified by their operation [5].We describe below several methods to color. It will permit the tracking of memory accesses, establish a programming environment that aids in the soluwhether there is local memory or shared memory or tion of these problems. both. A programming environment is a collection of software Collection of graphical data in the form of video images tools that can be used to develop software. Well-known are stored on disk. Videodisk segments can be refprogramming environments are UNlX & C, Interlisp, and erenced, examined, and played back. Smalltalk. With the advent of parallel architectures new Statistical routinesare available for analysis of the digissues in programming environment design arise. Thereare ital form of the data. The program can be edited, three key issues that must be adequately addressed. The recompiled, and re-executed in the environment. first is, what information is going to be collected. The secGiven such a visual programming environment, we ond is, how is the information recorded and displayed.The expect a balancedsystem where all processors appear to be third is what mechanisms should be provided to alter exeactive. Queues for memory access are nonempty and input cution. One must be able to monitor the performance of and outputprocessors are active. An unbalanced system is all functional units of the system. For many systems, a fixed immediately perceivedas one or moreprocessors are idle processor topology is used such as a ring, mesh, cube, cu be and memoryaccesses are wasted. Another information facconnected cycles, hypercube, etc. For these systems, probtor is the rate of transfer of the external media. These may lem decomposition to fit the network is essential. For other not be capable of sustaining the rate required by the prosystems, e.g., Cedar and RP3, it is possible to dynamically cessors. This should be visible in the environment. Another reconfigure the processors in different ways. In this latter common situation is when processors are divided intosevcase, problem decomposition is enhanced by the ability to eral levels. Each level may contain a cluster of processors reconfigure processor interconnectivity. that act in conjunction, but on different levels the procesTo help determine what should be displayed andaccessors behave entirely differently. Thus any attempt at balsible with respect to the executingsoftware, is to include ancing must also focus on the interaction across levels of on the screen all of the traditional views of a program. A processors. system such as PECAN presents on the screen: the program listing, data type schema, parse tree, symbol table, flow B. Techniques of Concurrent Programming graph, execution stack, and input-output dialog.The abilConcurrent programming is inherently a nondeterminity toinstantly view andaccess all of this informationsubistic process. The tradeoffs lie in vector processingversus stantially improves the programmers ability to understand parallel processing. Bucher [I51 has indicated that asynwhat is going on with his program. the drafting method shows a very poor performance and the hybrid method demonstrates its power in averaging the load among multipleprocessors [67].
1369
chronous parallelization of codes seems to be conceptually simpler than vectorization. Furthermore, synchronized parallelism i s much easier to implementthan asynchronous parallelism. A race situation may exist in parallel branches of an asynchronous algorithm. In support of parallel processing, resourcesharing must be employed. We have already addressed some resource control techniques in Section Ill; including the scheduling of parallel works, the interprocessor communication mechanisms, and load balancing techniques.In what follows, we discuss the use of counting semaphores for deadlockavoidance and process synchronization. Counting Semaphores are indivisible for lock, unlock, deadlockprevention,andsynchronizationofasynchronous processes. A simple definition of the P(S) and V ( S ) operators is given below, where S is a semaphore representing the availability of a specific resource type. P ( S ) : 1 IF S = 0 THEN COT0 1 ELSE S: = S - 1 V(S): S: = s 1
(4)
representing the state of some resource,monitor a contains also procedures that implement operations on that resource; and theassociated initialization code, which iniis called. tializes thevalues of variables before the monitor Monitors are used to implement mutual exclusion. The Concurrent Pascal [55] supports monitors for this purpose. System deadlock refers to the situation in a multiprocessor when multipleprocesses are holding resources and preventing each other from completing their executions. In general, adeadlock can be prevented,if one or more of the following necessary conditions are removed: a) Motualexclusions: Each process has exclusive control of its allocated resources. b) Nonpreemption: A process cannot release its allocated resources until completion. c) Wait for: Processes can hold resources while waiting for additional resources. d) Circular wait: Multiple processes wait for each others resources in a circular dependence situation. The example shown in Fig. 22 shows a circular wait situation among four concurrent processes. By modifying the
The value of S is typically initialized as 0 in thecreation of concurrent processes as exemplified below, where SIand S2 are two resource types being shared by two processes.
Process One Serial Work
Process TWO
P(S2) P(S1) P(S1) P(S5) P(S5)
Serial Work
P(S2)
P(S4)
P(S3)
P(S6) P(SG)
WS,)
*
RS,)
Parallel Work
P(S5) P(S3)
V(SS)
Parallel Work
P(S1)
P(S5)
P(SJ
Serial Work
V(SA
Serial Work
V(S1)
V(S3)
V(S2)
V(S5) V(S4)
V(S6)
V(S6) V(S1)
V(S1)
V(S5)
If the value of a semaphore is initialized as 1, it can be used as a lock for specifyinga critical section as follows: P(s) Critical Section
V(S3)
V(S2)
V(S5)
V(S)
Dijkstra has indicated three basic requirements for the execution of critical sections [30]. Knuth [83] has re-enforced statement 3) as stated i n 4).
1) At any given time, only one process is in the critical
section. 2) Stopping oneprocess outside the critical section has no effect on other processes. (b) (C) 3) The decision as to which process enters the critical Fig. 22. The resolution of a system deadlock among four section cannot be postponed indefinitely. processes. (Deadlock is prevented in (c) with no cycles on 4) Every process wanting to enter the critical section will theresourceallocationgraph.) (a) Four concurrentprobe eventually allowed to do so. cessessharingsix resourcetypes represented by semaphore variables S,, Sl, . . , S6. (b) Resource allocation graph corAnother form of shared variable for interprocess synrespondingto(a).Possibledeadlockwithexistanceofacycle. chronization is the use of monitors, a structured way of (c) Resource allocation graph modified from changing the request pattern in P , . No possibility of deadlock exists. implementing mutual exclusion. Beside having variables
1370
resource claim orderingin Process 4, the deadlock can be prevented, since there is no circular waitloop in the dependence graph, where SI, Sz, * , S6 are six resource semaphores being shared by the four processes. Each resource is assumed to have a single copy. Static deadlock prevention as outlined above may result in poor resource utilization. Dynamic deadlock avoidance depends on the runtime conditions, which may introduce a heavy overhead in detecting the potential existence of a deadlock. Although dynamic detection may lead to better resource utilization, the tradeoffsin detection and recoverycosts must be considered. At present, most parallel computers choose a static prevention method due to its simplicity to implement. Sophisticated dynamic avoidance or a recovery scheme for the deadlock problem requires one to minimize the incurred costs to justify for the net gains. The gain lies in better resource utilization, if those static deadlock prevention constraints are removed. To break a deadlock by aborting some noncritical processes should result in a minimum recovery cost. A meaningful analysis of the recovery costs associated with various options is very time-consuming. This is the main reason why sophisticated deadlock recovery system has not been built into current multiprocessors. The static prevention may be rather primitive, costs but very little to implement. Further research is needed tomake the dynamic recovery system more Cost-effective.
at hand. These approaches have been treated in [ a ] , [60], [65lt [1141. 1791, The mapping problem arises when the communication structure of a parallel algorithm differs from the intercon[12]. nection architecture of the target parallel computer This problem will be worsened when the number of processes created in the algorithm exceeds the number of processors available in the architecture. The implementation complexityof the algorithm depends on degree the of parallelism, granularity, communication and synchronization overheads, 1 1 0 demands, and other implementation overheads. Forregularly structured architectures, such as arrays [4], [92], [94],[1271, prisms [118], pyramids [134], hypercubes [122], hypernets [69], etc., the algorithmis more sensitive to architecturetopologyand machinesize.Agoodmatch may make a big difference in performance. Recently, several design methodologies for synthesizing parallel algorithms and VLSl architectures have been proposed in [12], [19], [93]. Program transformation is the key in establishing the perfect match. Algorithm designers must be aware that communication complexitycouldoverweigh thecomputational complexity if one adds the complexities from I10 and memory contention. The problem could be even more severe. This brings us to the concept of balanced parallel computations, in which the effective bandwidths of all subsystems are matched to yield the best possible performance.In amessage-passing multicomputer, the communication complexity may dominate the performance. The confronting C. Mapping Algorithms OntoParallel Architectures goal is to minimizeall the complexities coming from computation, communication,110, and memoryaccessin a balAlgorithm design usually involves several phases of anced manner. Simply chasing the bottlenecks will not be development: The physical problem must be first modeled sufficient if implementation efficiency becomes a major by a mathematical formulation such as differential equaconcern. tions or algebraic systems. In most cases, these equations or systems are defined over areal domain such as continWelist somecandidatealgorithmsfor parallel processing in Table 7. For algorithms which emphasize local operauous time functions. In order for the computer to solve tions, as seen in most low-level image and signal processing these systems, some form of discretization and numerical applications, the array (mesh) and pyramid are more suitapproximation must be employed. For example, in solving problems described by partialdifferentialequations (PDEs), able. For large-grain computations with global or unknown either finite-difference or finite-element methods can be dependence, the algorithm is better implemented on shared-memory multiprocessors. Mapping algorithms onto used in the discretization process. Then numerical schemes array processors has been treated by many researchers [19], are sought such as to employ iterative or direct methods in solving PDEs. Finally, one needs to partition the algo[74], [92], [93]. We illustrate in Fig. 23 a sequence of graph rithm in order to map with parallel or vector architectural transformations to convert a dataflowgraph into apipeline configurations. We concentrate below on thelast phase in net for vector processing of a compound vector function mapping of decomposed algorithms onto parallel archiconsisting of six operators. The numbers in the dataflow tectures. graph represent nodal and edge delays. The nodes corThree approacheshave been identified in designing parrespond to operations and the edges indicate the partial allel algorithms. The first converts a given sequential atgoordering relationships among the operations. Details of this rithm into a parallel version. This is often not a straightsystematic approach to designing pipeline nets can be forward process. Careful dataflow analysis must be found in [74]. employed to reveal all datdprogram dependencies. The Developing algorithms for multiprocessors or multicominhibitors for parallelization or vectorization must be puters demand a balanced partition of the algorithm. Parchecked to avoid inefficiency in the programming. Very titioned algorithms must be synchronized, because differoften the problem size does not match the machine size. ent granularities may exist in subprograms [Iq, [20], [&]. Algorithmic transformation may be needed to establish a Partitionedalgorithms can be further divided into preperfect match between the two sides of the problem. The scheduled or self-scheduled ones. A prescheduled algosecond approach is to inventa new parallel algorithm. This rithm i s allocated with resources at compile time. In order looks more difficult. However, it is often more efficient, to do this correctly, some prediction of the computation times and resource demand patterns of various subprobecause the new algorithm can be specifically tailored to grams must be known in advance. The self-scheduled algothe target machine. Most algorithm designers choose a combined approach by startingwith a parallel algorithm for rithms obtain their resources at run time. Such a dynamic allocation of resources depends on an optimizing scheda similar problem and then try modifying it for the problem
1371
Vectorization Algorithms and Category Vectodmatrix arithmetic matrix multiplication matrix decomposition conversion of matrices sparse matrix operations linear system solution eigenvalue computations least squares problems convolution and correlation digital filtering fast Fourier transforms feature extraction pattern recognition scene analysis and vision linear programming sorting and searching integer programming branch and bound algorithms combinatorial analysis constrained optimization probability distribution functions variance analysis nonparametric statistics multivariate statistics sampling and histogramming ordinary differential equations partial differential equations finite-element analysis domain decomposition numerical integration power series and functions interpolation and approximation searching techniques graph matching logic set operations transitive closures (d)
Fig. 23. Designinga pipeline netwith a sequence of graph
SignaVimage processing
Optimization processes
Statistical analysis
transformations on the synchronous data flow graph of a given compound vector computation [74]. (a) The given data flow graph. (b) After equivalent graph transformation. (c) After moving four units of edge delay into nodes. (d) The final pipeline net. problems, one can overlap computation withtransfer, data such as using double bufferingto isolate the input and output of a large volume of data movement. To reduce transfer time, extended main memory (such as the SSD in Cray2 and the communication buffer in ETA-IO) can be usedto close up the speed gaps. One can also trade serial I10 for smallgrain problems with parallelI/O for large-grain problems. The memory contention can be alleviated by skewed memory allocation schemes to avoid access conflicts or touse smart conflict resolution logic with multiported memory banks.
Partial differential
equations
uler in the OS kernel. All ready-to-run subprograms are queued to match with theavailabilityof resourcetypes, such as processors or memories. Another important class of algorithms is asynchronous algorithms, in which no forcedsynchronization is conducted amongconcurrentsubprograms[65],[91],[114].This is based on a relaxation principle, in which each processor will not wait for anotherprocessor to provide it with data. Instead, each processor works with the most recently available data wherever they may come from. Because of this relaxation,asynchronousalgorithmsaredifficuIttovalidate for correctness. Race conditions may exist to produce untraceable results. One can design a parallel algorithm which consists of synchronized and asynchronous portions. Such acombined mode can be constructed in a hierarchical manner, so that thebest match can be established between the programming levels and the underlying resource architecture. Recently, divide-and-conquer algorithms have been evaluated for parallel processing [62],
[130].
To overcome the difficulty associated with 110-bound
1372
Applications Their
~~
Special Applications linear solving systems of equations a seismic 3-D integration code for oil exploration forecasting weather short-term
a particle-incell simulation
program for plasma research experiments AIR 3D NASTRAN ELLPACK FUNPACK EISPACK MINPACK SPARSPAK
ESSL
L-l
mi
I
a Navia-Stokes code for 3-D aerodynamic simulation a PDE solver for finite-element analysis elliptic PDE solver routines for computing special functions and integrals solvers problem eigenvalue routines for unconstrained and constrained optimizations package for solving large-scale spare matrices an IBM library for vectodmatrix computations a core library supplied by Boeing Computer Services Center
a common math library jointly
BCSLIB
SIATEC
developed by several National Laboratories Math Advantage mathematical a package which is runable over 30 computer systems
putations. They are commonly known as the IMSL, PORT, NAG, and NATS [24]. IBM has developed the SSP (Scientific Subroutine Package), which was later used also in Univac and CDC machines. Today IMSL (International Mathematsupports most mainframesand ical and Statistical Libraries) superminis at various computer centers [76]. There are495 usercallable Fortran subroutines in IMSL, which perform basic mathematic functions, differential equations, matrix arithmetic,linearprogramming, etc. The PORT mathematical subroutine library emphasizing portability among heterogeneous machines was developed at Bell Labs [49]. The NAG (Numerical Algorithms Group) library was developed in both Fortran and Algol 60 versions [a]. Both PORT and NAG routines have been used in Cray 1benchmarking experiments. The NATS (National Activity toTest Software), backed by NSF and DOE, has promoted the development of several scientific software packages [14]. 2) Important ScientificSoftware Packages: The LINPACK i s a package for solving linearsystems whose matrices are general, banded, symmetric indefinite, symmetric positive definite triangular, or triangular [MI. The FUNPACK is a package of special function routines, such as Bessel functions, exponential integrals, etc. The EISPACK, a package for solving eigenvalue problems, consists of 58 subroutines. These routines are being altered to take account of
W
(b) Fig. 24. Application softwarepackagingand atypical mathematical software library. (Courtesy of Boeing Computer Services Co.) (a) Forming an application software package with modules selected from a software library. (b) Major software categories in the Boeing Software Library.
vector machine architectures and paging OS with Fortran 77 extensions. The MINIPACK is a systematized collection of software for unconstrained, linearly constrained, and large-scale optimizations. These PACKS are among many other application software packages that are currently in use [24]. 3) Software for Solving Differential Equations: Many scientifidengineering problems are characterized by ordinary differential equations (ODEs) or partial differential equations (PDEs). Mathematical software for ODEs includes the DEPAC developed at Sandia National Laboratories. Elliptic PDE problems are being attacked by theELLPACK and the FISHPAK for fluid dynamics modeling. Since many ODE or PDE problems resort to the solution of large-scale sparse matrices, the SPARSPAKandtheYale package[24] have been developed to meet this demand. In fact, efforts in devel-
WITH SUPERCOMPUTERS
1373
oping parallel PDE solvers are presently in progress at USC beminimized and software documentationshouldbe [75] and at Purdue [115] among many other efforts [109]. improved t o benefit both designers and users of parallell 4) Software Routines Supplied by Vendors: Major mainvector computers. frame andsupercomputer manufacturers have each develV. THEFUTUREOF SUPERCOMPUTERS oped some scientifidengineering routines. Cray offers SCLlLlB with functions microcoded for fast execution on We assess below recent advances in computer science thecraycomputers. Besidesusingthevectorizingcompiler and technologies in connection with the development of CFT, Cray application groupshave also developed impresthenext-generation supercomputers. In ordertoeffectively sive multitasking software tools [25]. Thesetools have been design and use ultra-powerful supercomputers, we call for applied in running some benchmark codes, such as the a multi-discipline development of a new science of parallel SPECTRAL for short-term weather forecasting. This multicomputation. The newcomputing technologies being tasked SPECTRAL code shows a speedup of 3.77 running on assessed include GaAs circuits, optical computing, and artia Cray X-MP14as compared with a Cray X-MPl1. IBM has ficial neurocomputers.Supercomputers are useful toolsto supplied the ESSL (Engineering and Scientific Subroutine improve the equality of human civilization. Due to hardLibrary) for vector facility attached to the 3090 mainframe warelsoftware limitations, theperformanceofexisting [138]. FPS has a math library of500 routines for their array supers lags behind the leading-edgedemand of computing processor families. The vendor-developedlibraries are power in science and technology. i n order to make new mostly machine-dependent, Portability is potentially a seriadvances in these leading edges, we demand a new genous problem with these packages. To help alleviate this eration of supercomputers, which are at least O(1d) to portability problem, the Math Advantage [126] was com0(104)faster than todays fastest machines. This demands mercially developed to supplyover 200 routines inFortran, a system which can execute O(104 to 0(1O1*) instructions C, and Ada versions runnableon over 30computer systems, per second [27, [136]. including some parallel and vector architectures. It has been projected that the fifth-generation super5) Software Development b y User Groups: The Boeing computers willemphasize massive parallelism and distribMathematical Software Library is being upgraded to suputed computing through a modular construction. The hardport CDC, IBM, FPS, and Craycomputers.The Boeing library warewill makefurther advances in packaging, interconnect structureisshowninFig.24(b),wheretheBCSLIBisthecore techniques, ultra large-scale integration three-dimensional library containingbasic computing modules and someutilIC design, GaAs or JJ technology, and optical interconnects ity software. The outer software includes thedense matrix and components. The recent breakthrough in superconsoftware form LINPAK, general sparse matrix routines, and ductor research may also have great impacts o n supercomother software routines mentioned earlier. This multilevel puter technologies. In the software area, concurrent lanlibrary structure is being optimized towards a vector library guages, functional programming, and symbolic processing for the CDC Cyber 205 and Cray machines through a conwill play an increasing role.These advances in parallel-proversion approach. For example, the CRAYPACK is being cessing hardware and software will make the future superdeveloped with Cray Assembly Languages (CAL)to exploit computers not onlyfaster but also smarter [@]. the Cray architecture. Another interesting effort is the SLATEC Common Math Library, which provides a means for A. Optical Computing and Neurocomputers several National Laboratories (Sandia, Los Alamos, AF Supercomputing technologies are presently advancing Weapons Lab., LLNL, NBS, etc.) to foster the exchange of in three areas: i) GaAs circuits for building the Cray 3 and software experiences in using supercomputers [16]. some 32-bit RlSC processors. ii)Optical crossbar interconMost of the above Fortran-coded softwarepackageswere nectsand electrooptical gate arrays. iii)Neural networksfor originally written for conventional scalar processors. Some a connectionist model of parallel computation. Besides software libraries have responded to the call for vectorinumerical supercomputing, increasing efforts have been zation andmultiprocessing. However, theconversion from exerted in developing Al-oriented supercomputers thatcan serial codes to parallel codes has been limited in isolated performmachineperceptionand reasoning applied to cases. These applications software libraries are presently knowledge engineering and intelligence processing. We taking very little advantage of the parallel hardware. assess below these new research directions and comment Attempts to achieve vectorization have been made more often than thosefor multitaskingor distributed computing. on their potential impacts to supercomputers and applications. This is due to the fact that vector uniprocessors appeared A technology road map is illustrated in Fig. 25. Presently, at least 5 years ahead of their multiprocessor extensions. most supers or minisupers are madein ECL or CMOS techHowever, this timelag in development is being shortened. and lower cost. However, nology duet o their high density Most minisupers extend their uniprocessor systems to multhe non-silicon devices (GaAsand HMET) are about five tiprocessor versions in less than 3 years. The extensions times faster than theECL family (50 ps versus 250 ps). GaAs from Alliant FX/1 to FX/8 [135] and from Convex C-1 to CXS is more radiation-resistant and has a wider operating temare good examples [23]. perature range (-200OC to 200C). The difficulties in GaAs Towards the parallelization and vectorization of appliare lesser density (3k gatesldie) and a lower yield in wafer cation software, most supercomputers choose the intelliproduction. The integration level of GaAs devices is presgent compiler approach or restructure the underlying algoently about 10 times lower and the cost about 100 times rithms or mathematical model used. One has to realize that higherthan in theirsiliconcounterparts. However,thetrend mathematical software i s application-driven. Many of the is becoming available in laboratoryenviisthatVLSi inGaAs above software libraries should be unified standardized or ronments [103]. The potential GaAs applications liein realtowards better portability, modelsharing, routine sharing, time signal processingand supercomputing in environand extended applicability. Proprietory restriction should
1374
1.0
I
I,
; l%l
8
.
0 . 1 H
O.SK l K
5K 1OK
SOKlmK
GATES/DIE
ment extremes [102],let alone the fact that Cray 3 uses mostly GaAs components. Optical computing would use photons in theservice of electrons. We may never build a pure optical computer. Most likely, we will build system a with electronic logic and optical interconnectionsin the near future. The use of optical crossbar network may increase the bandwidth of each linefromlOMbits/sinelectronicstolGbit/sinoptics.However, the setup time of each switch in thecrossbar may be 10 times longer in the optical network. Therefore, optics is useful whenmessages are long and there are massively parallel data transfers, such as those applied in parallel image processing. Other potential advantages of optics for supercomputing include inherent parallelism, no planar constraints, high communications bandwidth, reduced capacity loading, reduced crosstalk interference, and no pin-out limitations [ll], [120]. An electrooptical computer has been recently proposed at USC [l44]. The architecture, as depicted in Fig. 26,is based on the use of optical bistable gate arrays, programmable crossbar hologram, and the gate-level pipeline nets. Towards the eventual production of optical computers, integrated studies should be conducted among specialists on optical materials and devices and parallel architectures and algorithms. It is interestingtonotethat both parallelism and pipelining can play important roles in optical computing.
Neural circuits [61]have been suggested for supporting the connectionist model of parallel computation [371, [38] such as those needed inAI perception, vision, reasoning, and learning. The human brain has over O(lOo) neurons, each typically connectedto O(IO3) 0(104) other neurons. The connectionist model is inspired by brain theory and cognitive science. The model assumes thatknowledge (information) is stored in interconnection patterns among clusters of neurons. Massively paralleland collective operations take placein a neural network. Analog, digital, and photoelectric properties of silicon chipsare being utilized to build an artificial retina at Caltech. AT&T Bell Labs and TRW are implementing networks of electronic neurons. The Bell Labschipcontains256artificial neuronswithover64 OOO interconnections among them. Neurocomputers have beenattemptedwithartificial neural networks. In fact, there are three approaches: optical, electrooptical, and electronic as identified in [57l. The optical neurocomputers are constructed with optical PES with optical connectivity. Hologram and optical fiber interconnects were suggested. It has the potential t o support applications characterized by fixed transfer functions. The eleelectrooptical neurocomputers used electrooptical ments for the PES and optical interconnects. Both fully implemented and virtual electrooptical neurocomputers have been experimented with. The virtual connectionsare implemented witha small number of physical connections under program control. An electronic neurocomputer, called MarkIll, has been built and demonstratedat TRW [571. The Mark Ill chooses a multiprocessor architecture using 8 Motorola 68010 processors with virtual-processorinterconnectandweight stored in memory. The memory can implement 8100 virtual PES and 417 OOO interconnects among them. Any connection topology can be virtually established in the interconnect memory. Virtual electronic neurocomputers are attractive for implementing large neural networks. It has been estimated with concurrent silicon technology that a virtual electronic neurocomputercan be built with 100 million interconnects and l million PES. Neurocomputers may have a significant impact onsupercomputinginboth numerical and symbolic processing domains. A comparison of the major characteristics of electronic, optical, and is given in Table 9. It should be noted that neural computers artificial neuronsare made from electronics, optics, elecor trooptical devices. Bothopticalandneuralcomputers emphasize massive parallelism and fine granularity.
0
0
0
0
__I
0 0 0
m
0 0 0
c
0
Inputs
outputs
1375
s in
O(IO-~) s in electronic,
ism parallelism
Processor per processors granularity and system parallelism Communication bandwidth Physical integration Control synchronous clocking complexity
O(10)
- O(ld)Mbits/s - qld)transistors
chip with digital
at
O(1)
- O(ld)Gbitds -
chip silicon distributed and selforganizing technology dependent, low ranging from (using optics) to high (using electronics) inherently robust dueto cooperative operations
aid) W board per Power requirements O(lO)-MHz O(lO)-MHz rate clock clock
cations drive the development of particular parallel proin partial systems, cessing software. The barriers lie confirmation bias, and lack of performance spectrum and project database. Comparative analysis is needed to determine the effectiveness, robustness, efficiency, extensibility, economy, ease of use, reliability, and maintainability of parallel systems and to broaden their application domains. Of course, we need the cooperation between academic and industrial communities to distribute new discoveries and t o publish key concepts, principles, paradigms, and prescriptive terminologies.This could become one of the major thrusts in computer sciencelengineering education. The lack of vision in pushingthis thrust area will be the major barrier in achieving new advances in science and technology.
VI. CONCLUSIONS
We have evaluated the state of parallel processing on state-of-the-art supercomputers. Recent advances in supercomputer architectures, parallel processing software, resource and concurrency control, and concurrent programmingtechniquesarecategoricallyassessed.The intention is to bring together important issues and evaluate best answers to uphold performance. New research findings and high-risk hardwarelsoftware approaches are being transferred to the computer industry slowly. However, low-cost supercomputing is becoming a major driving force in hightech markets. This has triggered the thought of personal Supercomputers for the future. The rise of the computer industry in producing minisupers, mainframe extensions, array processors, scientific superstations, AI machines, optical computers, and artificial neurocomputers have had great impact on thedevelopment of futuresupercomputers. We conclude with the following observations: First, newtechnologies, innovative architectures, and more exotic computers willappear. Second, the cost of future supercomputers must be affordable in general-purpose applications. Third, a new generation of programmers must be trained t o achieve parallel processing. Fourth, we envision the emerging Science of Par-
1376
allel Computation, which will serve as the foundation of supercomputing and artificial intelligence [a], [70],[139]. ACKNOWLEDGMENT The author would like to thank J. Abarbanel and S. Mistry for typing the manuscript, and his Ph.D. students D. Kim, A. Louri,R. Chowkwanyun, H.C. Wang,Z.Xu,and J.Ghosh at USC for assisting in the preparation of the tables and illustrations. REFERENCES Alliant Computer Systems Corp., Alliant FX/Series: Product Summary.Acton,MA: June1985. G. S. Almasi,Overviewofparallelprocessing,Parallel Comput., vol. 2, no. 2, pp. 191-204, Nov. 1985. AmetekComputerResearchDivision,Concurrentprocessing on hypercube, Tech. Rep., Feb. 1987. M. Anaratone, E. Arnould, T. Gross, H. T. Kung, M. S. Lam, 0. Menziloioglu, K. Sarocky, andJ. A. Webb, Warp architecture and implementation, in Proc. 73th Anr;u. lnt. Symp. on Computer Architecture, pp. 346-356, June 1986. T. S. Axelrod, Effects of synchronization barriers on multiprocessor performance, Parallel Comput., vol. 3, no. 2, pp. 129-140, May 1986. K. E. Batcher, Design of a massively parallel processor, /E Trans. Comput., vol. C-29, pp. 836-840, Sept. 1980. BBN Labs., Butterfly Parallel Processor Overview, Version 1, Mar. 1986. 1. Beetam, M. Denneau,andD.Weingarten,The GFll supercomputer, in Proc.72thAnnu. lnt. Symp. on Computer Architecture, pp.108-115, June 1985. C. G. Bell, Multis: A new class of multiprocessor computer, Science, vol. 228, pp. 462-467, 1985. C. G. Bell, Parallelismadvancing computingtechnologies, Keynote Speech, ACMiComputer Society Fall Joint Computer Conf., Nov. 2-6, 1986. T. E. Bell, Advanced technology: Optical computing, /E Spectrum, vol. 23, pp. 34-57, Aug. 1986. F. Berman and L. Snyder, On mapping parallel algorithms into parallel architectures,]. ParallelDistributedComput., accepted to appear, 1987. A. J. Bernstein, Analysis of programs for parallel processing, /FEE Trans. Comput., vol. C-25, pp. 746-757, Oct. 1986. J. M. Boyle etal., NATS, A collaborative effort to certify and disseminate mathematical software, Proc. NCC, pp. 630635, 1972. I. Y. Bucher, The computation speed of supercomputers, in Proc. ACM Sigmetrics Conf. on Measurement and Modeling of Computer Systems, pp. 151-165, Aug. 1983. B. Buzbee, The SLATEC common mathematical library, in W. R. Cowell, Ed., Sources and Development of MathematicalSoftware. Englewood Cliffs, NJ: Prentice-Hall, pp.302320.
1984. ical Software.EnglewoodCliffs,NJ:Prentice-Hall, [25] Cray Research Inc., Multitasking user guide, Tech. Note SN-0222, Feb. 1984. [26] Cray Research Inc., Tbe Cray 2 Computer Systems. Technical Brochure, 1985. [27] DARPA, Strategic computing: New-generation computing [28] [29] [30] [31] [32] [33] [34]
[35]
[42] [43]
technology, Tech. Rep., Defense Advanced Research Projects Agency, Oct. 1983. D. B. Davis,Parallelcomputersdiverge, High Technology, pp. 16-22, Feb. 1987. J. B . Dennis, Dataflow supercomputer,lEEEComputer, vol. 13, no. 11, pp. 48-56, Nov. 1980. E. W. Dijkstra, Solution of a problem in concurrent programming, Commun. ACM, pp. 569-570, Sept. 1965. U.S. DOD, Programming Language Ada: Reference Manual, vol. 106. NewYork,NY:Springer-Verlag, 1981. J. J. Dongarra, A survey of high performancecomputers, in Dig. Papers: /E Computer Society COMPCON, pp. 8-11, Spring 1986. J. J. Dongarraand I. S. Duff,Advancedcomputerarchitectures, Argonne National Lab., Argonne, IL, Tech. Rep., Oct. 1985. J. J. Dongarra and G. W. Stewart, LINPACK-A package for solving linear systems, in W. R. Cowell, Ed., Sources and Development ofMathematicalSoftware. Englewood Cliffs, NJ: Prentice-Hall, 1984, pp. 20-48. J. J. Dongarra, Performance of various computers using standardlinearequationssoftware in aFortranenvironment, in W. J.Karplus, Ed., Multiprocessors andArray Processors. San Diego, CA: Simulation Councils Inc., Jan. 1987, pp. 15-33. ETA Systems, ETA-IO system overview: Introduction,Tech. Note, Pub-1006, Revision A, Feb. 1986. S. E. FahlmanandG. E. Hinton, Connectionist architectures for artificial intelligence, /FEE Computer, vol. 20, pp. 100-109, Jan. 1987. J.A. Feldman and D.H. Ballard, Connectionist models and their properties, Cogn. Sci., vol. 6, pp. 205-254, 1982. T. Y. Feng,Asurvey of interconnection networks, /E Computer, vol. 14, pp. 12-27, Dec. 1981. -, Parallel processors and processing, ACM Comput. Surv., special issue, Mar. 1977. P. R. Fenner, The Fled32 for real-time multicomputer simulation, in W. J. Karplus, Ed., Multiprocessors and Array Processors. San Diego, CA: Simulation Councils, Inc., Jan. 1987, pp. 127-136. S. Fernbach, Supercomputers-Past, present, prospects, 1 . Future Generation Comput. Syst., pp. 23-30, 1984. M. W. Ferrante, Cyberplusand MapV interprocessor communications for parallel and array processor systems, in W. J. Karplus,Ed.,Multiprocessors and ArrayProcessors. 1987, pp. 45San Diego, CA: Simulation Councils, Inc., Jan.
A. L. Charlesworth and J. L. Gustafson, Introducing replicated VLSl to supercomputing: TheFPS-16441MAXscientific computer, /E Computer, vol. 19, pp. 10-23, Mar. 1986. M. Chen, A design methodology for synthesizing parallel algorithms and architectures, 1. Parallel Distributed Comput., pp. 462-491, Dec. 1986. S. C.Chen, J. Dongarra, and C. C. Hsiung, Multiprocessing for linearalgebraalgorithmsontheCrayX-MP/2: Experience with small granularity,]. ParallelDistributedComput., Aug.
1984. S. S. Chen,CrayX-MP-4series,CrayResearchpresentations, Aug. 1985.
54. [44] J.A. Fisher, Trace sceduling: A techniquefor global microcode compaction, /E Trans. Cornput., vol. C-30, no. 7, pp. 478-490, July 1981. 1451 M.J. Flynn, Some computer organizations and their effectiveness, /E Trans.Comput., vof. C-21, pp. 948-960, Sept. 1972. [46] B. Ford and 1. C. Pool, The evolving NAG library service, in W. R. Cowell, Ed., Sources and Development of MathematicalSohare. Englewood Cliffs, NJ: Prentice-Hall, 1984, pp. 375-398. [47l G. FOX, Questions and unexpected answers in concurrent P-288, Caltech.,Concurrent computation,Tech.Rep.C
[481
[49] [501 [511 [521
C. Y. Chin and K. Hwang, Packet switching networks for /E Trans. multiprocessors and dataflow computers, Comput., vol. C-33, pp. 991-1003, Nov. 1984. Convex Computer Corporation, Convex C1 series: XP processors, Richardson, TX, Tech. Notes Edition, 1987. W. R. Cowell, Ed., Sources and Development of Mathemat-
Computation Program, California Institute of Technology/ Jet PropulsionLabs., Pasadena, CA, June 1986. G.C.Fox and S. W. Otto, Algorithms for concurrent processors, Phys. Today, vol. AR-370, pp. 13-20, May 1984. P. A.Fox,A. D. Hall, and N. L. Schryer, The PORT mathMath. Software, ematical subroutine library, ACM Trans. pp. 104-126, 1978. D. Gajski and J. K. Pier, Essential issues in multiprocessor systems, /E Computer, vol. 18, no. 6, pp. 9-28, June1985. D. H. Gibson, D. W. Rain, and H. F. Walsh,Engineeringand scientific processing on the IBM 3090, ISMSyst. I., vol. 25, no. 1, pp. 36-50, Jan. 1986. I. R. Goodman and C. H. Sequin, Hypertree: A multipro-
WITH SUPERCOMPUTERS
1377
[79] cessor interconnection topology,/ E Trans. Comput., vol. C-30, no. 12, pp. 923-933, Dec. 1981. [53] A. Gottlieb, R. Grishman, R. Kruskal, C. P. McAalifte, K. P. 1801 Randolph, and M. Snir, The NYU ultracomputerdesigning an MlMD shared memory parallel computer, / Trans. Comput., vol. C-32, no. 2, pp. 175-189, Feb. 1983. [81] [54] J. Graham andJ. Rattner, Expert computation on the iPSC concurrent computer, in W. J. Karplus, Ed., Multiprocessors and Array Processors. San Diego, CA: Simulation [82] Councils, Inc., Jan. 1987, pp. 167-176. [55] P. B. Hansen, The programming language concurrent Pas[83] cal, / Trans. Software Eng., vol. SE-1, no. 2, pp 199-206, June1975. 1841 (561 S. Hawkinson, The FPS T series: A parallel vector super computer, in W. J. Karplus, Ed., Multiprocessors andArray Processors. San Diego, CA Simulation Councils, Inc., Jan. [85] 1987, pp. 147-156. [57l R. Hecht-Nielson, Performance limits of optical, electrooptical, and electronic neurocomputers, Tech. Rep., TRW Rancho Carmel AI Center,San Diego, CA, 1986. [a]W.D. Hillis, The ConnectionMachine.Cambridge,MA: [86] MIT Press, 1985. [59] C. A. R. Hoare, Toward a theory of parallel programming, in C.A. R. Hoare, Ed., Operating Systems Techniques. New 1871 York, NY: Academic Press, 1972. [60] R. W. Hockney, ParallelComputers. Bristol, Eng1and:Adam Hilger, Ltd., 1981. [88] [61] J. J.Hopfield and D. W. Tank, Computing with neural circuits: A model, Science, vol. 233, pp. 625-633, Aug. 1986. [62] E. Horowitz, Divideandconquer for parallel processing, [89] / E Trans. Comput., vol. C-30, no. 4, Apr. 1981. [63] C. Huson, T. Mache, J. Davies, M. Wolfe, and B. Leasure, The KAP-205: An advanced sourceto-source vectorizer for [90] the Cyber 205 supercomputer, in Proc. lnt. Conf. on Parallel Processing, pp. 827-832, Aug. 1986. [64] K. Hwang, Ed., Supercomputers: Design and Applications. Washington, DC: IEEE Computer Society Press, Aug. 1984. [91] [65] K. Hwang and F. A. Briggs, Computer Architecture and Parallel Processing. New York, NY: McGraw-Hill, 1984. [a]K. Hwang andY. H. Cheng, Partitioned matrix algorithms [92] for VLSl arithmetic systems, I Trans. Comput., vol. C31, pp. 1215-1224, Dec. 1982. [93] [67l K. Hwang andR. Chowkwanyun, Dynamic load balancing for messagepassing multiprocessors, Tech. Rep. CRI-87[94] 0 4 , USC Computer Research Institute, Jan. 1987. [68] K. Hwang and D. DeGroot, Eds., Parallel Processing for [95] SupercomputingandArtificial Intelligence. New York, NY: McCraw-Hill, 1988 (in press). [%I [69] K. Hwang andJ. Chosh, Hypernets: Acommunication-efficient architecture for constructing massively parallel com[97l puters, / E Trans. Comput. (Special Issue on Supercomputing), vol. C-36, Dec. 1987. [70] K. Hwang, J. Ghosh,and R. Chowkwanyun,Computer [98] architectures for AI processing, / E Computer(Special Issue on New AI Systems), vol.20, pp. 19-29, Jan. 1987. [71] K. Hwang, D. Kim, and P. S. Tseng, An orthogonal multiprocessor architecture for efficient parallel processing,/ [99] Trans. Comput., to be published. [72] K. Hwang, 5. P. Su, and L. M. Ni, Vector computer archiEd., tecture andprocessingtechniques, in M. Yovits, [lo01 Advances in Computers, vol. 19. New York, NY: Academic Press, 1981, pp. 115-197. [73] K. Hwang and P. S. Tseng, An efficient VLSl multiprocessor forsignallimageprocessing, in froc. / E lnt. Conf. on [loll Computer Design: VLSl in Computers (Oct.7-10, 1985), PP. 172-175. [74] K. Hwang and Z. Xu, Pipeline nets for compound vector supercomputing, I Trans. Comput., Jan. 1988. [75] K. Hwang, Z. Xu, and A. Louri, Remps: An electrcmptical
H. F. Jordan, Structuring parallel algorithmsin an MIMD, shared memoryenvironment, ParallelComput., pp. 93-110, May 1986. R. M. Karp and R. E. Miller,Propertiesof a model for parallel computations: Determinancy, terminating, and queuing, SIAM). Appl. Math., pp. 1390-1411, NOV.1986. W. J. Karplus, Ed., Multiprocessors and ArraysProcessors. San Diego, CA: The Society of Computer Simulation, Jan.
1987.
H.KashiwagiandK.Miura,Japanesesuperspeedcomputer project, Proc. Adv. Reactor Computation, Mar.1983. D. E. Knuth, An empirical study of Fortran programs, Software Pract. Exper., vol. 1, no. 2, pp. 105-133, Apr.-June, 1971. P. M. Kogge and H. S. Stone, A parallel algorithm for efficient solution ofa general class of recurrence equations, /E Trans. Cornput., vol. C-22, pp. 786-793, 1973. J. S. Kowalik, R. Lord, and S. Kumar, Design and performance of algorithms for MlMD parallel computers, in J.S. Kowalik, Ed., Proc. of the NATO Workshopon High Speed Computations. Berlin, W. Germany, Springer-Verlag, 1984, pp. 257-276. J. S. Kowalik, Ed., Parallel MlMD Computation: HP SupercomputerandltsApplications. Cambridge, MA: MIT Press,
1985. E. W. Kozdrowicki and D. J.Theis, Second generation of vector supercomputers, /E Computer, vo1.13, pp. 71-83, Nov. 1980. D. J. Kuck, E. S. Davidson, D. H. Lawrie, and A. H. Sameh,
supercomputer for parallel solution PDE problems, in Proc. 2nd lnt. Conf. on Supercomputing (May5-8,1987). [76] IMSL Inc., lMSL LibraryReferenceManual.Houston,TX:
IMSL, 1980. [77l lnmos Ltd., OCCAM Programming Manual. Englewood Cliffs, NJ: Prentice-Hall,1984. [78] H. F. Jordan, Performance measurement of HEP-A pipelined MlMD computer, in Proc. 7OthAnnu. Symp. on Computer Architecture, pp. 207-212, June 1983.
Parallel supercomputing today and the cedar approach, Science, vol. 231, pp. 967-974, Feb. 1986. D. J. Kuck, R. H. Kuhn,B. Leasure, and M. Wolfe, The structure of an advanced retargetable vectorizer, in Proc. COMPSAC, Oct. 1980. D. J. Kuck,A.H.Sameh, et a / . , Theeffectsofprogram restructuring, algorithm change, and architecture choice on program, in Proc. lnt, Conf. on Parallel Processing, pp. 129-138, Aug. 1984. H.T. Kung, The structureof parallel algorithms, in M. Yovits, Ed.,Advances in Computers, vol. 18. NewYork,NY: Academic Press, 1980, pp. 65-112. H. T. Kung and C. E. Leiserson, Systolic arrays (for VLSI), SlAM Sparse Matrix Proc., pp. 245-282, 1978. S. Y. Kung, VLSl ArrayProcessors.EnglewoodCliffs,NJ: Prentice-Hall, 1987. -, On supercomputing with systoWwavefrontarray processors, Proc. I , vol. 72, pp. 867-884, July 1984. J. Larson, Multitasking on the CrayX-MP/2 multiprocessor, / E Computer, vol. 12, pp. 62-69, July 1984. S. Levialdi, Ed., lntegratedTechnologyforParallellmage Processing.NewYork,NY:AcademicPress, 1985. K. C. Li and H. Schwetman, Vectorizing C: A vector processing language,]. ParallelDistributedComput., vol. 2, no. 2, pp. 132-169, May 1985. N. R. Lincoln, Technology and design tradeoffsin the cre ation of a modern supercomputer, /E Trans. Comput., vol. C-31, pp. 363-376, May 1982. 0. Lubeck, J. Moore, and R. Mendez, A benchmark comparison of three supercomputers: Fujitsu VP-200, Hitachi S810/20, and Cray X-MP/2, l Computer, vol. 18, pp. 1029, Dec. 1985. J. McCraw, S. Skedzielewski, S. Allan, D. Grit, R. Oldehoeft, J.R. W. Glauert, I.Dobes, and P. Hohensee,SISAL-Streams and iteration in a single assignment language, Lawrence Livermore National Laboratory, Tech. Rep., Mar. 1985. S. K. McGrogan, Modifying algorithms to achieve greater than linear performance improvementson the ELXSl 6400 multiprocessor, in W. J.Karplus, Ed., Multiprocessors and Arrayfrocessors. San Diego, CA. Simulation Councils, Inc., Jan. 1987, pp. 103-110. V. Milutinovic, N. Lopez-Benitez, and K. Hwang, A GaAsbased microprocessor architecture for real-time applications, /E Trans. Comput., vol. 36, pp. 714-727, June1987. V. Milutinovic, Ed., GaAs: Atechnology for environmental 19, Oct. 1986. extremes, Special Issue in /Computer, vol. L. M. Ni and K. Hwang, Optimal load balancing in a multiple processor system with many job classes, l Trans. Software E r g . , vol. SE-11, pp. 491-496, May 1985. -,Vector reduction techniques for arithmetic systems,
1378
lEEE Trans. Comput., vol. C-34, pp. 404-411, May 1985. 11061 L. M. Ni, C. Xu, and T. B. Gendreau, A distributed drafting algorithm for load balancing, /E Trans. Software Eng., vol. SE-11, no. 10, pp. 1153-1161, Oct. 1985.
[I121
tectural introduction, /E Computer, vol. 17, pp. 62-74, Mar. 1984. V. A. Norton and G. F. Pfister, A methodologyforpredictingmultiprocessorperformance, in Proc. 7985 lnt. Conf. on Parallel Processing, pp.772-781, Aug. 1985. J. M. Ortega and R. G . Voigt, Solution of PDEs on vector and parallel computers, SlAM Rev., vol. 27, pp. 149-240, June 1985. J. F. Palmer,TheNCUBE family of parallel supercomputers,W. J. Karplus,Ed.,MultiprocessorsandArrayProcessors. San Diego,CA: SimulationCouncils, Inc.,Jan.1987. G.Paul,VECTRANand the proposed vector/array extensions to ANSI FORTRANfor scientific and engineering computations, in Proc. ISM Conf. on Parallel Computers and Scientific Computations (Rome, Italy, Mar. 1982). R. H. Perrot, R. W. Lyttle, and P. S. Dhillon, Thedesign and implementation of a Pascal-based language for array processor architectures,]. Parallel Distributed Comput., Aug.
[lu)] Q. Stout, Divide and conquer algorithms for parallel image processing, 1. Parallel Distributed Comput., vol. 4, no. 2, Feb. 1987. (1311 P. J.Sydow,Optimizationguide,Tech. Rep. SN4220,Cray Research, Inc., 1983. Computer Systems, Guide to Parallel Program[I321 Sequent ming.Beaverton,OR: SCS, 1987. [I331 H. Tamura, S. Kamiya, and T. Ishigai, FACOM VP-100/200:
pp.
(1361
- [I37
1987. [I131 G . F. Pfister, W. C. Brantley, D.A. George, S. L. Harvey, W. J. Kleinfelder, K. P. McAuliffe, E.A. Melton,V.A. Norton,and J. Weiss,The IBM research parallel processor prototype (RP3): Introduction and architecture, in Proc. lnt. Conf. on Parallel Processing, pp.764-711, Aug. 1985. Com[I141 M. Quinn, Designing Efficient Algorithms for Parallel 1987. puters. New York: NY: McGraw-Hill, [115] J. Rice and R. F. Boisvert,Solving Elliptic ProblemsUsing 1985. ELLIPACK.NewYork,NY:Springer-Verlag, [116] J. P. Riganatiand P. B. Schneck,Supercomputing, lEEE Computer, vol. 17, pp. 97-113, Oct. 1984. [117 G.Rodrigue, E. D. Giroux,and M. Pratt,Perspectiveon large-scale scientific computation,/E Computer, vol.13, pp. 65-80, OCt. 1980. [I181 A. Rosenfeld, The Prism machine: An alternative to pyramid,]. Parallel Distributed Comput., vol. 3, no. 3, pp. 404411, Sept. 1986. [I191 J.Sanguinetti, Performance of a message-based multiprocessor, /E Computer, vol. 19, pp. 47-55, Sept. 1986. . Sawchuk,Thestate of optical computing, in Distin[120] A
ing, in Proc. 70th Annu. Symp. on Computer Architecture, pp. 372-378, 1983. J.Test, M. Myszewski,and R. C. Swift, The Alliant FWSeries: Automatic parallelismin a multiprocessor mini-supercomputer, in W. J.Karplus, Ed., Multiprocessors andArray Processors. San Diego,CA SimulationCouncils, Inc.,Jan.1987, pp. 35-44. E. A. Torrero, Ed., Next-Generation Computers. New York, NY: lEEE PRESS, 1985. P. C. Treleaven, D. R. Brownbridge, andR. P. Hopkins, Datadriven and demanddriven computer architecture, ACM Comput. Surv., pp. 93-143, Mar. 1982. S. G. Tucker, The IBM 3090 system: An overview, ISMSyst. I . vol. , 25, no. 1, pp. 4-19, Jan. 1986. L. Uhr,Parallel Multicomputer Architecturesfor Artificial Intelligence. New York, NY: Wiley-lnterscience, 1987. N. Wirth, Programmingin Modula-2. New York, NY: Springer-Verlag, 1982. J. Worlton, Understanding supercomputer benchmarks, Datamation, pp. 121-130, Sept. 1984. W.A. Wulf, R. Levin, and S. P. Harbison, Hydra/C.mmp:An Experimental Computer System. New York, NY: McGrawHill, 1981. 2. Xu and K. Hwang, Molecule: A language construct for layereddevelopmentofparallelprograms, lEEE Trans. Software Eng., to appear in 1987. Z. Xu, K. Hwang, and K. Jenkins, Opcom: An architecture for optical computing based on pipeline networking, in Proc. 20th Annu. Hawaiilnt. Conf. on Systems Sciences, pp. 147-156, Jan. 1987.
guished Lecture Series, USC Computer Research Institute, Jan. 29,1987. [I211 J.T . Schwartz, Ultracomputers, ACMTrans. Programming Languages Syst., pp. 484-521, Apr. 1980. [122] C. L. Seitz, The cosmic cube, Commun. ACM, pp. 22-33, Jan.1985. multipro[I231 I. D. Shersonand Y. Ma,Orthogonalaccess cessing: An architecture for numerical applications, Proc. 8th Symp. on Computer Arithmetic (Lake Como, Italy, May
18-21,1987). [124] H. J. Siegel, Interconnection Networks for Large-ScalePar1984. allel Processing. Boston, MA: Lexington Books, [I251 B. J. Smith, Architecture and applications of the HEP mul[I261
tiprocessor computer system, Real-Timesignal Processing lV, VOI. 298, pp. 241-248, Aug. 1981. W. R. Smith, Quantitative technology corporation math advantage: A compatible,optimized math library, in W. J. Karplus, Ed., Multiprocessors a n d Array Processors. San Diego, CA Simulation Councils, Inc., Jan.1987. L. Snyder,SupercomputersandVLSI, in Advances in Computers. New York, NY: Academic Press, pp. 1-33. D.8. Sol1,Vectorization andvector migrationtechniques, Tech. Bull., IBM Publishing Services,225 Carpenter East, Irving, TX, June1986. Sperry Corporation, Sperry Integrated Scientific Processor Box 64942, St. Paul, MN 55164, 1985. System. P. 0.
1379