Academia.eduAcademia.edu

Heterogeneous parallelization of the Linkmap program

2000, Proceedings 2000. International Workshop on Parallel Processing

Sequential genetic algorithms have many successful applications in very different domains, but they have a main drawback in their utilization. Evaluations are very time-consuming, e.g., a pedigree consisting of fiftY-five nodes takes about seventy minutes on a DEC-Alpha processor and about two hundred and seventy minutes on a 166MHz Pentium f o r certain likelihood calculations. This time increases exponentially with the increase in the size of the pedigree. In order to solve these shortcomings and to study new models of higher eficiency, parallel platj6orms are being used f o r genetic programs. LINKAGE is a software package for pedorming genetic likelihood calculations; FASTLINK is an improved, faster version of it. This paper provides a parallel implementation of the 'Linkmap' program (one of the four programs in LINKAGDFASTLINK) for a heterogeneous environment, using a static and a dynamic strategy for task allocation. It was found that the increased performance by the dynamic strategy was close to the estimated maximum speedup.

zyx zyxwv zyxwv Heterogeneous Parallelization of the Linkmap Program zyxwvut "Aaditya Rai, *Noe Lopez-Benitez, 'J. D. Hargis, ' S . E. Poduslo *Computer Science Department Texas Tech University Lubbock, TX 79409 n.lopez-benitez@ttu.edu Abstract Sequential genetic algorithms have many successful applications in very different domains, but they have a main drawback in their utilization. Evaluations are very time-consuming, e.g., a pedigree consisting of fiftY-five nodes takes about seventy minutes on a DEC-Alpha processor and about two hundred and seventy minutes on a 166MHz Pentium f o r certain likelihood calculations. This time increases exponentially with the increase in the size of the pedigree. In order to solve these shortcomings and to study new models of higher eficiency, parallel platj6orms are being used f o r genetic programs. LINKAGE is a software package f o r pedorming genetic likelihood calculations; FASTLINK is an improved, faster version of it. This paper provides a parallel implementation of the 'Linkmap' program (one of the four programs in LINKAGDFASTLINK) f o r a heterogeneous environment, using a static and a dynamic strategy f o r task allocation. It was found that the increased performance by the dynamic strategy was close to the estimated maximum speedup. 1. Introduction I Division of Neurology, HSC Texas Tech University Lubbock, TX 79430 neusep @ ttuhsc.edu computational analysis more efficient. Several computer packages have been written for these linkage computations and most published linkage studies use one of these programs. LINKAGE is a sophisticated analysis software which evaluates the likelihood of a given pedigree under different assumptions about the recombination fraction between two loci [l]. It contains four related programs: LODSCORE, ILINK, LINKMAP and MLINK for the probability computations [2]. Sequential algorithmic improvements have been made to this package. These improvements are now integrated into FASTLINK [3]. While sequential genetic algorithms have many successful applications, a drawback in their use is that evaluations are still very time-consuming. A pedigree consisting of fifty-five nodes takes about seventy minutes on a DEC-Alpha processor and about two hundred and seventy minutes on a 166MHz Pentium for likelihood calculations. This time increases exponentially with the pedigree size. To solve these shortcomings and to study new models of higher efficiency and efficacy, parallel platforms are being used for analysis. In order to locate a disease gene the placement of markers is an important step. The placement of markers generally starts with two markers (A and B) with a known recombination fraction obtained from using another program. Then a third marker (C) is placed in relation to the two known markers followed by a fourth marker (D) and so on. Once all of the markers used for the analysis are placed in the correct order, then a disease gene is placed. If the markers are not in the correct order, the disease gene will also be misplaced. Linkmap computes the likelihood for each possible position of the new marker in each interval along the map. Parallelization can be achieved at two levels: at an algorithmic level which makes an improvement in the basic statistical calculations strategy; at a code level where the code is exploited for potential parallelism [4]. TREADMARKS is a parallel programming system developed at Rice University to allow programs written in a shared-memory style to run on a network of Unix zyxwvutsr zyxwvutsr Linkage analysis uses information from family pedigrees to map genes and to locate disease genes on particular chromosomes. The recombination fraction denoted as 8 is a measure of the frequency of crossing over between two loci. If 8 c 0.5 between two loci then they are considered linked. If 8 = 0.5, the two loci are considered unlinked. When 8 = 0, the two loci are at the same location. In genetics, likelihood is defined as the probability with which given observations (phenotypes) occur, and this probability is given for any specific model chosen. Data from small pedigrees are analyzed quickly. However, the use of large family pedigrees has greatly increased the computational time needed for analysis. Better algorithms and parallel computers would make zyxwvutsrqponm zyxwvutsrqpon 353 0-7695-077 I -9/00 $1 0.00 0 2000 LEEE zyxwvutsrq zyxwvutsrqp zyxw workstations [ 5 ] . TREADMARKS assumes a shared memory homogeneous environment. In our study we focused on formulating a parallel model of LINKMAP in the FASTLINK program, using code level parallelization. The model concentrates on distributing different likelihood computations for various recombination fractions on different computers. The parallel model was implemented using a message-passing interface [6]. Performance prediction and evaluation of the parallel model by simulating its execution on the TG-GUI (Task Graph - Graphic User-Interface) was also performed. The TG-GUI is a user-interface designed to evaluate task graphs [7] using simulation [8] or a numerical tool [9] and is useful in evaluating different heuristics described by the corresponding implemented algorithm. Task graphs are a well-known tool to study performance issues of complex jobs [8]. The edges of a task graph determine the precedence relations that govern the order of execution of the individual tasks; precedence relations can be associated with data or control dependencies. When the task graph is executed on the processing elements of a heterogeneous computing system, estimating overall completion time becomes an optimization problem, involving allocation of tasks to processors such that the completion time is minimized. The purpose of a task graph representation is two-fold: (a) predict performance under different allocation heuristics and (b) exploit a possible parallelism in a heterogeneous computing environment. The availability of a task graph representation of any program in the LINKAGE package has permitted its evaluation under different heuristics. Typical heuristics [lo, 1 1 1 that can be used in the evaluation are the following: Largest Task First : This heuristic will assign a task having the largest execution time to a randomly selected processor and marks that task as assigned. A task from the remaining task list is selected for the next assignment until all the tasks are assigned to the processors. Shortest Estimated Execution Time First (SEETF): A task is selected at random from the task set and assigned to a processor that executes the task in the minimum time. Minimum Finish Time (MFT): This heuristic takes into account the precedence constraints on the tasks. A randomly selected task T, is assigned to a processor Pj that minimizes the finish time of the task in a deterministic simulated execution, where the finish time of a task is given by the minimum sum of the execution time and the next instance in which processor Pj becomes free. Efficient scheduling tasks is critical to - of application __ achieving high performance in parallel systems. The objective of scheduling is to map tasks into processors and order their execution so that task precedence constraints are satisfied and minimum schedule length is obtained [12]. Thus, various schemes need to be evaluated before a model can be implemented. Profiling tools were used to identify the precedence relationships and flow of data in the sequential code to explore potential parallelism and formulation of a parallel model. “Profiling” involves the analysis of source code for potential performancehime bottlenecks by identifying those sections of the code which are likely to take more time for execution [ 131. Information regarding the number of times a function is called is also provided by the profiling tool. 2. Methodology Two important activities in parallel programming that need to be given special consideration to attain higher efficiency in parallel programs are partitioning the program into concurrent components and mapping these components into available processors 1141. Partitioning involves dividing large code into components that may execute concurrently with the primary goal of achieving scalability. Determination of how a partition scales gives an indication of the maximum number of computers that can be employed to solve a problem 1151. Domain decomposition is a partitioning technique that is useful when a problem is primarily concerned with a large, regular data domain. Data decomposition is used for distributing computations for likelihood for different 6’ vectors on different computers. Using the latter, we divided data (8 vectors) and operated (performed likelihood calculations) on these parts concurrently. The notion of achieving higher granularity, i.e., maintain minimum task interaction, led to choosing a specific loop (outermost loop) from a nest of loops in the LINKMAP program. Profiling was required to identify such loops. Sun Workshop’s Looptool takes a “<binarp.looptimes” file as a parameter to draw the loop graph chart [16]. This file was created by compiling the source file with the “-ZIP” compiler option and then executing the binary thus created; i.e., zyxwvutsrqp zy zyxwvuts zyxwvutsrqp Compile With -ZIP S0urce.c ___, Binary ___, Source.looptimes Sequential LINKMAP was compiled in this manner and executed on a SPARC machine. A “1inkmap.looptimes” was created and a graph using this file was obtained. It was found that there were three nested loops of interest: a loop corresponding to line 488 and to line 730 in the file linkmupc, and to line 1546 in 354 zy zyxwvutsrqpon zyxwvutsrqpo the file com1ike.c [ 171. The first two loops took about 99% of the total runtime whereas the third loop took about 80%. The percentage time taken by the rest of the loops was too small to provide any gain, if parallelized. Note that the percentage figures given by the profiler are cumulative, i.e., it adds up all the time spend in the inner loop and the reason why the percentage result does not add up to 100%. Nevertheless, the information given by the profiler identifies those loops that are computationally intensive. Analysis was concentrated on the first two loops were the two outermost loops of the nest are of the following form: condition had been reached. Functionally, the loop is of the form: do { Calculate *-vector; Update probabilities for this *-vector ( likelihood( ) ) ; Update condition for stopping; ) while (condition is valid); For the allocation of iterations of a loop it is necessary to predetermine the total number of iterations the loop goes through, given a certain dataset. Also, it is necessary to know beforehand the vectors for which the likelihood is to be calculated. This is required so that different 6 vectors can be allocated to different processors to perform calculations. For this purpose, the “while” loop was simulated to predetermine the number of iterations. Finally, the original “do while” loop was converted into a partitionable “for” loop of the form: zyxwvutsrqpon zyxwv ipedIwp0 { LOOpl: do ( ... iterpedso; ... ) (while <condition>); iterpedso ( Loop2: for( ) ( for (theta-cnt = 0; theta -cnt < num-of-iter; theta-cnt ++) ( /*Load the vector calculated by the simulator( ) for this iteration into gtheta[] to provide the same environment to the rest of the loop */ for (k = 0; k < mlocus -1 ;k ++) gthetark] = parArray[theta-cnt][k]; likelihood(gtheta); ... likelihood(); ... 1 Before the “for” loop could be partitioned and its iterations distributed on different computers, data dependency analysis was performed. The removal of data dependencies required the introduction of some extra variables of the same type as that of the data involved in the dependency. To verify that dependencies were removed, iterations were run in no particular order and the results were compared with those obtained when run sequentially. The two results were found to be exactly the same. Note that this random-iteration technique does not verify one hundred percent independencies. For example, an array addition of the for Thus effectively there is a nest of two loops which could be expressed as: For all vectors For all pedigrees Perform likelihood calculation where the outermost loop (loopl) calls the function iterpeds(), which is in turn responsible for the actual likelihood computations. The inner loop (loop2) corresponds to the number of pedigrees, i.e., likelihood calculations performed for each pedigree. For a small number of pedigrees the number of iterations was also very small. Initial estimations revealed that distributing the inner iterations would incur in more synchronization overhead offsetting any gain achieved. Analysis of the outer loop (loopl) yielded a do-while loop, that computes the 0 vector (from user-given data set) and then estimates the maximum likelihood for this vector for all pedigrees. These calculations are performed by the function likelihood(). The loop also calculates the loop-breaking conditions dynamically, i.e., in each pass, after the likelihood calculation, it checks whether the breaking For ( i = 0; i < 6 ; i ++) Var = array[ i ] + Var would yield correct results even if the iterations were calculated in a random fashion, but may give incorrect results when different iterations are run on separate computers simultaneously. However, in our case no such statements were found in the loop or calling functions and thus, all iterations were independent of each other. To formulate a parallel model of LINKMAP the code was represented as one task from the point where the program began its execution i.e., the main() function in the file commonc0de.c in FASTLINK, to the point it zyxw 355 zyxwvutsrq zy zyxwvutsrqp zyxwvutsr zyxwvu zyxwvutsr allocating iterations in this manner, until all iterations are executed. This ensures that the processor with greater efficiency gets a larger number of iterations. Also, a processor with the greatest load will automatically perform fewer iterations. One problem with this strategy as such, is that the manager does not participate in calculations. A variation of this strategy was implemented, in which the manager was made to calculate along with the workers and fastest computer in the cluster was assigned as the manager. Therefore, a thread was also created in this machine to perform calculations. This implementation is outlined as follows: reached the ‘for’ loop (function iterpeds() in the file 1inkmap.c. Then, the iterations of the ‘for’ loop were represented as ‘n’ independent tasks. Finally, all these iteration-tasks joined to end the program execution. Thus, the LINKMAP program can be expressed in terms of the following pseudocode: main( )( I* Task T, *I PreLoop-code ; For loop with ‘n’ iterations ; /*‘n’ independent tasks*/ 1 The corresponding task graph is shown in Fig. 1.The number of concurrent tasks T1 ... TN shown in Fig. 1 varies from data set to data set. The task T, will initiate and collect results from tasks TI ... TN. In linkage analysis, a data set involves two files: a .ped file that contains the pedigree details and the .dut file that contains the recombination fraction vector details [ 11. Implementation involves distributing concurrent tasks on a cluster of available computers. MPI was used to explicitly send and execute tasks on different processors. In a first implementation of the above ‘for’ loop, the number N of iterations of a loop were divided by a number P of available processors, and sets of iterations were formed with N/P iterations in each set. Then each processor was allocated one set of iterations. This allocation works as long as the number of iterations is a multiple of the number of processors, which is rarely the case. Thus, the rest of the N - ( L N I P ~ X P ) = N mod P iterations are allocated again to consecutive processors starting from processor zero. Since N mod P e P , in the second pass at most P-1 processors will get one extra iteration. The advantage of this strategy is that it is fairly simple to design and implement and is suited for an environment where all computers are of homogeneous speeds and loads. Consequently the heterogeneity in speed and load of different machines in a network is not taken into consideration. However, in a heterogeneous system allocating an equal number of iterations to all machines limits the overall efficiency of the partition because of the time taken by the slowest machine to execute its set of iterations. Moreover, the load at each machine changes with time. Thus, this strategy may end up having the slowest processor performing calculations while much faster processors are idle. In a second implementation a “manager-worker” dynamic style of task allocation was used [4]. Instead of allocating sets of iterations to any processor, a processor is made a manager (it executes T,) to allocate exactly one iteration to every other processor (worker) in the cluster. The workers execute their iteration and as soon as they finish, results are sent to the manager along with the request for another iteration. The manager keeps fzh U Figure 1. Task graph of a parallelized Linkmap Total number of iterations = N Total number of processors = P My rank in the cluster = my-rank Number of results received by processor zero = recvTheta Number of allocated to workers = thetavector Start( )I if ( my-rank != 0 )( while(1) ( Receive thetavector; Calculate likelihood for this vector; Send results to processor 0; 1 1 if (my-rank == 0)( thetavector = 0, recvTheta = 0; for ( procs = 1; procs < P; procstt) ( 356 zyx zyxwvutsrqponm zyxwvutsrqp zyxwvutsrqponmlkjih Send thetavector to procs; thetavector++; I zyxwvutsrq zyxwvut zyxwvutsr zyxwvutsrq while( 1) ( Receive results from any of the workers; recvTheta ++; Send rhetavector to the worker which sent results; Store results received in the output file; if (thetavector c N-I) thetavector tt; if ( recvTheta e N ) break; 1 Send “stop-signal”to all worker processors; 1 1 sets were used for the experiments. The first data set had one family (one pedigree), 55 family members, and 21 markers. LINKMAP was run, by placing marker 1 between or beside markers 17, 14 and 9 . The measured average execution times of all the tasks in the first data set on all the processors are presented in Table 1. The Dual Pentium Pro-multiprocessor had the fastest execution time while the Sun Sparc was the slowest. The TG GUI provided an implementation of the three different static heuristics previously mentioned: Minimum Finish time (MFT), Largest Task first (LTF) and Shortest Effective Execution Time First (SEETF). Simulation results assuming a normal distribution are shown in Table 2 for two, three, four and five processors. Simulation of the combined 5 computers was the most efficient. Next, the two-implementation strategies (static and dynamic) were tested for the dataset for various processor combinations under different loads. Measurements were tabulated to compare the observed speedup with the maximum possible speedup. In a heterogeneous environment the maximum possible speedup is estimated by adding the normalized execution times of all computers involved in the calculation. Normalization is obtained with respect to the fastest machine. The observed speedups (Tseq / Tpar ) were also estimated with respect to the fastest sequential time. Five distinct cases were considered: Case a: Two DEC-Alphas, (machines 2 and 3) with almost equal processing speed and negligible load, inferred by the near equal execution times shown in Table 1. The sequential execution time of Linkmap in machine 2 took 32.55 minutes and in machine 3 took 31.97 minutes. The parallel execution times are listed in Table 3 along with the best sequential and simulated time. The simulation compares closely with measured execution times. This shows the feasibility of using simulation to analyze several allocation schemes before any actual implementation. The speedup obtained for the dynamic heuristic is plotted against the maximum possible speedup (Fig 2). The speedup achieved (both dynamic and static) was almost equal to the maximum possible speedup, ensuring that both processors executed equal numbers of iterations in both schemes. Case b: The DEC-Alphas (1 and 2) had different processing speeds and negligible loads (Fig. 3). Here the dynamic allocation achieved greater speedup. The static strategy allocated equal numbers of iterations to both machines. One iteration for the first dataset on ( 1 ) took between 2.8 to 3.33 minutes and one iteration on (2) took approximately 5 minutes. Thus out of the six iterations, three iterations were calculated by ( 1 ) in about ten minutes whereas (2) took fifteen minutes. The first ‘if ’ loop is executed by the worker processors until no more 0’s are received from the master which is ranked zero. The second ‘if loop is executed only by processor zero. It initializes the indexes, sends vectors to different processors, and collects results from worker processors. Once the master has received N results, it sends an abort signal to all workers. The manager and a worker thread were created on processor ‘0’ by specifying this in a MPI configuration file. If this file had two entries for one processor, two threads were created on that computer by MPI and both threads were treated as individual processes with separate ranks in the cluster. Therefore, the worker thread on processor ‘0’ executes iterations under “if (my-rank != 0)” and the manager thread under “if (my-rank == 0)”. This strategy adjusted itself according to the heterogeneity of the machines available. zyxwvuts 3. Results Two types of experiments were conducted. First, the simulation of the parallel model using the TG-GUI was performed under three different allocation heuristics for all possible combination of processors. Second, the actual implementation was carried out for the two different strategies (an ad-hoc static and a dynamic allocation) and the results obtained were compared with the sequential run. Three static allocation heuristics were simulated using the TG-GUI. The dynamic and ad-hoc static heuristics were applied under various machine combinations and different load conditions. The ad-hoc static heuristic was also simulated and compared with actual results. The near equal execution times in this case validate the accuracy of the simulation tool. Experiments were conducted on six computers, three DEC Alphas (machines 1 , 2, and 3), one dual Pentium processor (machine 4), an Intel Pentium 200MHz (machine 5 ) and a SUN SPARC (machine 6). Two data 357 Table 1. Average Execution time (in seconds) of tasks on all processors heterogeneity in the speed of the machines, whereas the dynamic scheme yielded better results. The total time was dictated by the time taken by the slowest processor to perform its set of iterations; the static strategy yielded results in 15.25 minutes. The dynamic strategy took 11.88 minutes. Machine (l), the faster of the two managed to finish four iterations while two iterations were performed by (2). zy 1 zyxwvutsrqpo Table 3. Execution times (in minutes) 1 I 1 1 Fastest Parallel Parallel Sequential (ad-hoc (Dynamic Time Static scheme) scheme) Table 2. Predicted Execution times (in minutes) 1 1 1 Parallel (ad-hoc static simulation) zyx 1 zyxwvutsrqponml zyxwvutsr 2 (machines 1, 2) 5 (machines 1-5) 10.45 19.68 31.97 16.43 16.38 17.3825 actual speedup (dynamic) Case c: The three DEC-Alphas (1-3), were used with RI actual SDeedUD Istatid negligible load. This case is shown in Fig. 4. It is clear that the dynamic heuristic almost equals the maximum possible speedup, whereas the static scheme performed poorly with two computers. However, it performed almost equal to the dynamic scheme in the case of three computers since the slowest processor calculated three iterations in the former case as compared with only two iterations in the latter case. Case d: When one of the three processors, machine 3, was heavily loaded, a different pattern of results was obtained. This is shown in Fig. 5. The static implementation yielded poor results because even the heavily loaded processor received an equal number of iterations to calculate. The maximum possible speedup for three processors was less than that for two processors. The dynamic curve follows the pattern of the maximum attainable speedup curve. Case e: Three heterogeneous machines (4, 5, and 6) were used. The static scheme failed because of the ’.’** . 1.952 2 1.945 zyx number of processors Figure 2. Dynamic vs Static : Case a This is shown in Fig. 6. The dataset has 3 families (3 pedigrees), with 51, 60, and 55 family members and 15 markers. 358 vectors to different processors for likelihood calculations were implemented. It was shown that the dynamic scheme proposed yielded better results in all cases than the ad-hoc static implementation. zyxwvuts zyxwvu zyxwvutsrqpo zyxwvutsrqp zyxwv 1.6 4 U 0 a, ) % 2.5 1.4 1 1.962 1.958 2 1.2 n 1.5 U zyxwvutsrqponm zyxwvutsr zyxwvutsrqp zyxwvutsrqponm Q a3 1 1 2 0.5 number of processors 0 Figure 3. Dynamic vs Static: Case b 3 number of processors Figure 5. Dynamic vs Static: Case d. 2.189 2.2 2 2 1 1.8 4 1.8 1.6 3 1.6 1.4 1.4 Q 1.2 U % n 1.2 * 1 1 0.8 zyxwvutsrq 0.6 3 number of processors 0.4 0.2 0 Figure 4. Dynamic vs Static: Case c 3 LINKMAP was run by placing marker 15 between or beside markers 8, 9 and 10. The dataset was very computation intensive and a sequential run of this data set on the fastest DEC-Alpha took 93 hours to complete. Parallel execution time on the three computers took only 34 hours. 4. Conclusions Parallelization of the Linkmap program has been achieved at a code level for a heterogeneous environment. Two strategies of allocation of the recombination fraction 359 number of processors Figure 6 . Dynamic vs Static: Case e. Performance prediction for all possible combinations of the available computers was performed for three other static heuristics to show the feasibility of predicting the performance of specific allocation heuristics by simulation. The speedup attained by the dynamic scheme was close to the estimated maximum speedup. One possible reason for this is the choice of parallelizing the outer most loops in the nest of the computation intensive loops. This results in a higher granularity; i.e., high ratio of computation to communication. This approach distinguishes our work from a previous parallel implementation of the LINKMAP program, in which the parallelization was performed using LINDA, a machine independent parallel programming language, which is used to execute programs on a parallel computer [ 181. Their implementation is essentially for parallel architectures and does not address execution of the parallel LINKMAP on a cluster of workstations. With Linda, stress is placed on load balancing issues more than on different strategies that can be adopted for allocation; in our work we focus on adopting different allocation strategies and predicting performance under several such heuristics. Vaughan’s master’s thesis concentrates on parallelizing LINKMAP for a single recombination fraction vector calculation [ 191. Here the speedup obtained is largely dependent on the size of the pedigree. In distributing different iterations on different computers (instead of one iteration on all of them), we ensure that some speedup is always achieved irrespective of the pedigree size, because the number of calculations to be performed is always divided by the number of processors. Our approach is also different from TREADMARKS, which is designed for shared memory and homogeneous platforms [5]. The parallelization of LINKMAP presented in this work can be used as a case study that outlines an approach to converting existing highly iterative sequential programs to a parallel form. Future studies could include achieving “intra likelihood” parallelization which would calculate likelihoods for a single recombinant vector along with distribution of likelihood calculations for different vectors [20]. The steps outlined in this paper could be used in an attempt to parallelize the three other programs in Fastlink. Improvements in the dynamic allocation strategy especially in the concept of ‘worker helping worker’ could be further exploited. If a worker has finished a final calculation, while others are still calculating, and there are no more vectors left to be distributed, it could request some vectors that are currently being calculated by other workers, say, the slowest worker. This would ensure that the minimum possible execution time is obtained. zyx zyxwvu Cottingham, Jr R.W. Iduri R.M. Schaffer A.A., “Faster sequential genetic linkage computations” Amer J Hum Genetics, Vol. 53, 1993, pp. 252-263. Chandy K.M. Taylor S . , Introduction to Parallel Programming Jones and Bartlett, Boston, 1992. Amza C., Cox A.L., Dwarkadas S . , Keleher P., Lu H., Rajamony R., .Yu W., Zwaenepoel W, ‘Treadmarks: Shared Memory Computing on Networks of Workstations”, IEEE Computer, Vol 29 No. 2, February 1996, pp. 18-28 Pacheco P.S., Parallel Programming with MPI, Morgan Kaufman Publishers, Inc., San Mateo, CA, 1997. Krishna S , A., graphical interface for the analysis of task graphs, MS Thesis, Computer Science, Texas Tech University, 1999. Lopez-Benitez N. Hyon J-Y., “Simulation of task graph in heterogeneous environments”, IEEE Heterogeneous Computing Workshop, April 1999, pp. 1 12-124. McSpadden A.R., Lopez-Benitez N., “Stochastic Petri nets applied to performance evaluation of static task allocations in heterogeneous computing environments”, IEEE Heterogeneous Computing Workshop, 1997, pp. 185-194. Menasce D.A., Saha D., Da Silva Porto S.C., Almeida V.A.F., Tripathi S.K, “Static and Dynamic Processor Scheduling Disciplines in Heterogeneous Parallel Architectures”,J. of Parallel and Distributed Computing, 28, 1995, pp. 1-18. El-Rewini H., Lewis T.G., Ali H.H., Task Scheduling in Parallel and Distributed Systems, Prentice Hall, 1994. Topcuoglu H., Hariri S., Wu M-Y., “Task scheduling algorithms for heterogeneous processors”, IEEE Heterogeneous Computing Workshop, 1999, pp. 3-14. Kumar V., Grama A., Gupta A., Karypis G , Introducrion zyx zyxwvutsrqpo zyxwvutsrqp zyxwvu zyxwvutsrq zyxwvutsrq zyxwvu 5. References Terwilliger J.D., Ott J., Handbook of Human Genetic Linkage, The Johns Hopkins University Press, Philadelphia, 1994. Dwarkadas S . Schaffer A.A., Cottingham, Jr RW, Cox AL, Keleher P, Zwaenepoel W, “Parallelization of genetic linkage analysis problems” Human Heredity, Vol. 44, 1994, pp. 127-141. 360 to Parallel Computing-Design and Analysis of Algorithms, The BenjaminlCummings PublishingCO,Menlo Park, CA, 1994. Culler D.E., Singh J.P., Gupta A., Parallel Computer Architecture A HardwardSoftWare Approach, Morgan Kaufmann Publishers, San Francisco CA, 1999. Parallelization- Online tutorial http://www.wi.leidenuniv.nl/-guszlFlyinLCircus/l.Readi ng/2.Tutorial/04/index.html Sun Workshop 4.0 Answerbook,Sun MicroSystem, 1999, http://docs.sun.coni. Rai A., On the Parallelization of the LinkagdFastlink Package, M.Sc. Thesis, Computer Science Department, Texas Tech University, December 1999. Miller P.L., Nadkarni P., Gelernter J.E.,Carriero N., Pakstis A.J., Kidd K.K, “Parallelizing genetic linkage analysis: A case study for applying parallel computation in molecular biology” Computers and Biomed Res , Vol 24, 1991, pp. 234-248. Vaughan M. S . , A distributed approach to human genetic linkage analysis, Master’s thesis, Computer Science, Duke Univ, 1991. Cox A.L., Dwarkadas S . , Schaffer A.A., Zwaenepoel W., Gupta S.K., “Integrating parallelization strategies for linkage analysis”, Computers and Biomed Res, Vol. 28, 1995, pp. 116-139.