zyx
zyxwv
zyxwv
Heterogeneous Parallelization of the Linkmap Program
zyxwvut
"Aaditya Rai, *Noe Lopez-Benitez, 'J. D. Hargis, ' S . E. Poduslo
*Computer Science Department
Texas Tech University
Lubbock, TX 79409
n.lopez-benitez@ttu.edu
Abstract
Sequential genetic algorithms have many successful
applications in very different domains, but they have a
main drawback in their utilization. Evaluations are very
time-consuming, e.g., a pedigree consisting of fiftY-five
nodes takes about seventy minutes on a DEC-Alpha
processor and about two hundred and seventy minutes on
a 166MHz Pentium f o r certain likelihood calculations.
This time increases exponentially with the increase in the
size of the pedigree. In order to solve these shortcomings
and to study new models of higher eficiency, parallel
platj6orms are being used f o r genetic programs. LINKAGE
is a software package f o r pedorming genetic likelihood
calculations; FASTLINK is an improved, faster version of
it. This paper provides a parallel implementation of the
'Linkmap' program (one of the four programs in
LINKAGDFASTLINK) f o r a heterogeneous environment,
using a static and a dynamic strategy f o r task allocation.
It was found that the increased performance by the
dynamic strategy was close to the estimated maximum
speedup.
1. Introduction
I
Division of Neurology, HSC
Texas Tech University
Lubbock, TX 79430
neusep @ ttuhsc.edu
computational analysis more efficient. Several computer
packages have been written for these linkage computations
and most published linkage studies use one of these
programs. LINKAGE is a sophisticated analysis software
which evaluates the likelihood of a given pedigree under
different assumptions about the recombination fraction
between two loci [l]. It contains four related programs:
LODSCORE, ILINK, LINKMAP and MLINK for the
probability computations [2]. Sequential algorithmic
improvements have been made to this package. These
improvements are now integrated into FASTLINK [3].
While sequential genetic algorithms have many
successful applications, a drawback in their use is that
evaluations are still very time-consuming. A pedigree
consisting of fifty-five nodes takes about seventy minutes
on a DEC-Alpha processor and about two hundred and
seventy minutes on a 166MHz Pentium for likelihood
calculations. This time increases exponentially with the
pedigree size. To solve these shortcomings and to study
new models of higher efficiency and efficacy, parallel
platforms are being used for analysis. In order to locate a
disease gene the placement of markers is an important
step. The placement of markers generally starts with two
markers (A and B) with a known recombination fraction
obtained from using another program. Then a third
marker (C) is placed in relation to the two known markers
followed by a fourth marker (D) and so on. Once all of
the markers used for the analysis are placed in the correct
order, then a disease gene is placed. If the markers are not
in the correct order, the disease gene will also be
misplaced. Linkmap computes the likelihood for each
possible position of the new marker in each interval along
the map.
Parallelization can be achieved at two levels: at an
algorithmic level which makes an improvement in the
basic statistical calculations strategy; at a code level where
the code is exploited for potential parallelism [4].
TREADMARKS is a parallel programming system
developed at Rice University to allow programs written in
a shared-memory style to run on a network of Unix
zyxwvutsr
zyxwvutsr
Linkage analysis uses information from family
pedigrees to map genes and to locate disease genes on
particular chromosomes. The recombination fraction
denoted as 8 is a measure of the frequency of crossing
over between two loci. If 8 c 0.5 between two loci then
they are considered linked. If 8 = 0.5, the two loci are
considered unlinked. When 8 = 0, the two loci are at the
same location. In genetics, likelihood is defined as the
probability with which given observations (phenotypes)
occur, and this probability is given for any specific model
chosen. Data from small pedigrees are analyzed quickly.
However, the use of large family pedigrees has greatly
increased the computational time needed for analysis.
Better algorithms and parallel computers would make
zyxwvutsrqponm
zyxwvutsrqpon
353
0-7695-077 I -9/00 $1 0.00 0 2000 LEEE
zyxwvutsrq
zyxwvutsrqp
zyxw
workstations [ 5 ] . TREADMARKS assumes a shared
memory homogeneous environment.
In our study we focused on formulating a parallel
model of LINKMAP in the FASTLINK program, using
code level parallelization. The model concentrates on
distributing different likelihood computations for various
recombination fractions on different computers. The
parallel model was implemented using a message-passing
interface [6]. Performance prediction and evaluation of the
parallel model by simulating its execution on the TG-GUI
(Task Graph - Graphic User-Interface) was also
performed. The TG-GUI is a user-interface designed to
evaluate task graphs [7] using simulation [8] or a
numerical tool [9] and is useful in evaluating different
heuristics described by the corresponding implemented
algorithm. Task graphs are a well-known tool to study
performance issues of complex jobs [8]. The edges of a
task graph determine the precedence relations that govern
the order of execution of the individual tasks; precedence
relations can be associated with data or control
dependencies. When the task graph is executed on the
processing elements of a heterogeneous computing system,
estimating overall completion time becomes an
optimization problem, involving allocation of tasks to
processors such that the completion time is minimized.
The purpose of a task graph representation is two-fold: (a)
predict performance under different allocation heuristics
and (b) exploit a possible parallelism in a heterogeneous
computing environment.
The availability of a task graph representation of any
program in the LINKAGE package has permitted its
evaluation under different heuristics. Typical heuristics
[lo, 1 1 1 that can be used in the evaluation are the
following:
Largest Task First : This heuristic will assign a task
having the largest execution time to a randomly
selected processor and marks that task as assigned. A
task from the remaining task list is selected for the
next assignment until all the tasks are assigned to the
processors.
Shortest Estimated Execution Time First (SEETF): A
task is selected at random from the task set and
assigned to a processor that executes the task in the
minimum time.
Minimum Finish Time (MFT): This heuristic takes
into account the precedence constraints on the tasks. A
randomly selected task T, is assigned to a processor Pj
that minimizes the finish time of the task in a
deterministic simulated execution, where the finish
time of a task is given by the minimum sum of the
execution time and the next instance in which
processor Pj becomes free.
Efficient scheduling
tasks is critical to
- of application
__
achieving high performance in parallel systems. The
objective of scheduling is to map tasks into processors and
order their execution so that task precedence constraints
are satisfied and minimum schedule length is obtained
[12]. Thus, various schemes need to be evaluated before a
model can be implemented.
Profiling tools were used to identify the precedence
relationships and flow of data in the sequential code to
explore potential parallelism and formulation of a parallel
model. “Profiling” involves the analysis of source code for
potential performancehime bottlenecks by identifying
those sections of the code which are likely to take more
time for execution [ 131. Information regarding the
number of times a function is called is also provided by
the profiling tool.
2. Methodology
Two important activities in parallel programming that
need to be given special consideration to attain higher
efficiency in parallel programs are partitioning the
program into concurrent components and mapping these
components into available processors 1141. Partitioning
involves dividing large code into components that may
execute concurrently with the primary goal of achieving
scalability. Determination of how a partition scales gives
an indication of the maximum number of computers that
can be employed to solve a problem 1151. Domain
decomposition is a partitioning technique that is useful
when a problem is primarily concerned with a large,
regular data domain. Data decomposition is used for
distributing computations for likelihood for different 6’
vectors on different computers.
Using the latter, we
divided data (8 vectors) and operated (performed
likelihood calculations) on these parts concurrently.
The notion of achieving higher granularity, i.e.,
maintain minimum task interaction, led to choosing a
specific loop (outermost loop) from a nest of loops in the
LINKMAP program. Profiling was required to identify
such loops. Sun Workshop’s Looptool takes a
“<binarp.looptimes” file as a parameter to draw the loop
graph chart [16]. This file was created by compiling the
source file with the “-ZIP” compiler option and then
executing the binary thus created; i.e.,
zyxwvutsrqp
zy
zyxwvuts
zyxwvutsrqp
Compile
With -ZIP
S0urce.c
___,
Binary
___,
Source.looptimes
Sequential LINKMAP was compiled in this manner
and executed on a SPARC machine.
A
“1inkmap.looptimes” was created and a graph using this
file was obtained. It was found that there were three
nested loops of interest: a loop corresponding to line 488
and to line 730 in the file linkmupc, and to line 1546 in
354
zy
zyxwvutsrqpon
zyxwvutsrqpo
the file com1ike.c [ 171. The first two loops took about 99%
of the total runtime whereas the third loop took about
80%. The percentage time taken by the rest of the loops
was too small to provide any gain, if parallelized. Note
that the percentage figures given by the profiler are
cumulative, i.e., it adds up all the time spend in the inner
loop and the reason why the percentage result does not
add up to 100%. Nevertheless, the information given by
the profiler identifies those loops that are computationally
intensive. Analysis was concentrated on the first two loops
were the two outermost loops of the nest are of the
following form:
condition had been reached. Functionally, the loop is of
the form:
do {
Calculate *-vector;
Update probabilities for this *-vector ( likelihood( ) ) ;
Update condition for stopping;
)
while (condition is valid);
For the allocation of iterations of a loop it is necessary
to predetermine the total number of iterations the loop
goes through, given a certain dataset. Also, it is necessary
to know beforehand the vectors for which the likelihood
is to be calculated. This is required so that different 6
vectors can be allocated to different processors to perform
calculations. For this purpose, the “while” loop was
simulated to predetermine the number of iterations.
Finally, the original “do while” loop was converted into a
partitionable “for” loop of the form:
zyxwvutsrqpon
zyxwv
ipedIwp0 {
LOOpl:
do (
...
iterpedso;
...
) (while <condition>);
iterpedso (
Loop2:
for( ) (
for (theta-cnt = 0; theta -cnt < num-of-iter; theta-cnt ++)
(
/*Load the vector calculated by the simulator( ) for
this iteration into gtheta[] to provide the same
environment to the rest of the loop */
for (k = 0; k < mlocus -1 ;k ++)
gthetark] = parArray[theta-cnt][k];
likelihood(gtheta);
...
likelihood();
...
1
Before the “for” loop could be partitioned and its
iterations distributed on different computers, data
dependency analysis was performed. The removal of data
dependencies required the introduction of some extra
variables of the same type as that of the data involved in
the dependency. To verify that dependencies were
removed, iterations were run in no particular order and the
results were compared with those obtained when run
sequentially. The two results were found to be exactly the
same. Note that this random-iteration technique does not
verify one hundred percent independencies. For example,
an array addition of the for
Thus effectively there is a nest of two loops which
could be expressed as:
For all vectors
For all pedigrees
Perform likelihood calculation
where the outermost loop (loopl) calls the function
iterpeds(), which is in turn responsible for the actual
likelihood computations. The inner loop (loop2)
corresponds to the number of pedigrees, i.e., likelihood
calculations performed for each pedigree. For a small
number of pedigrees the number of iterations was also
very small. Initial estimations revealed that distributing
the inner iterations would incur in more synchronization
overhead offsetting any gain achieved. Analysis of the
outer loop (loopl) yielded a do-while loop, that computes
the 0 vector (from user-given data set) and then estimates
the maximum likelihood for this vector for all pedigrees.
These calculations are performed by the function
likelihood(). The loop also calculates the loop-breaking
conditions dynamically, i.e., in each pass, after the
likelihood calculation, it checks whether the breaking
For ( i = 0; i < 6 ; i ++) Var = array[ i ]
+ Var
would yield correct results even if the iterations were
calculated in a random fashion, but may give incorrect
results when different iterations are run on separate
computers simultaneously. However, in our case no such
statements were found in the loop or calling functions and
thus, all iterations were independent of each other.
To formulate a parallel model of LINKMAP the code
was represented as one task from the point where the
program began its execution i.e., the main() function in
the file commonc0de.c in FASTLINK, to the point it
zyxw
355
zyxwvutsrq
zy
zyxwvutsrqp
zyxwvutsr
zyxwvu
zyxwvutsr
allocating iterations in this manner, until all iterations are
executed. This ensures that the processor with greater
efficiency gets a larger number of iterations. Also, a
processor with the greatest load will automatically perform
fewer iterations. One problem with this strategy as such, is
that the manager does not participate in calculations. A
variation of this strategy was implemented, in which the
manager was made to calculate along with the workers
and fastest computer in the cluster was assigned as the
manager. Therefore, a thread was also created in this
machine to perform calculations. This implementation is
outlined as follows:
reached the ‘for’ loop (function iterpeds() in the file
1inkmap.c. Then, the iterations of the ‘for’ loop were
represented as ‘n’ independent tasks. Finally, all these
iteration-tasks joined to end the program execution. Thus,
the LINKMAP program can be expressed in terms of the
following pseudocode:
main( )(
I* Task T, *I
PreLoop-code ;
For loop with ‘n’ iterations ; /*‘n’ independent tasks*/
1
The corresponding task graph is shown in Fig. 1.The
number of concurrent tasks T1 ... TN shown in Fig. 1
varies from data set to data set. The task T, will initiate
and collect results from tasks TI ... TN. In linkage
analysis, a data set involves two files: a .ped file that
contains the pedigree details and the .dut file that contains
the recombination fraction vector details [ 11.
Implementation involves distributing concurrent tasks
on a cluster of available computers. MPI was used to
explicitly send and execute tasks on different processors.
In a first implementation of the above ‘for’ loop, the
number N of iterations of a loop were divided by a number
P of available processors, and sets of iterations were
formed with N/P iterations in each set. Then each
processor was allocated one set of iterations. This
allocation works as long as the number of iterations is a
multiple of the number of processors, which is rarely the
case. Thus, the rest of the N - ( L N I P ~ X P ) = N mod P
iterations are allocated again to consecutive processors
starting from processor zero. Since N mod P e P , in the
second pass at most P-1 processors will get one extra
iteration. The advantage of this strategy is that it is fairly
simple to design and implement and is suited for an
environment where all computers are of homogeneous
speeds and loads. Consequently the heterogeneity in speed
and load of different machines in a network is not taken
into consideration. However, in a heterogeneous system
allocating an equal number of iterations to all machines
limits the overall efficiency of the partition because of the
time taken by the slowest machine to execute its set of
iterations. Moreover, the load at each machine changes
with time. Thus, this strategy may end up having the
slowest processor performing calculations while much
faster processors are idle.
In a second implementation a “manager-worker”
dynamic style of task allocation was used [4]. Instead of
allocating sets of iterations to any processor, a processor is
made a manager (it executes T,) to allocate exactly one
iteration to every other processor (worker) in the cluster.
The workers execute their iteration and as soon as they
finish, results are sent to the manager along with the
request for another iteration. The manager keeps
fzh
U
Figure 1. Task graph of a parallelized Linkmap
Total number of iterations = N
Total number of processors = P
My rank in the cluster = my-rank
Number of results received by processor zero = recvTheta
Number of allocated to workers = thetavector
Start( )I
if ( my-rank != 0 )(
while(1) (
Receive thetavector;
Calculate likelihood for this vector;
Send results to processor 0;
1
1
if (my-rank == 0)(
thetavector = 0, recvTheta = 0;
for ( procs = 1; procs < P; procstt) (
356
zyx
zyxwvutsrqponm
zyxwvutsrqp
zyxwvutsrqponmlkjih
Send thetavector to procs;
thetavector++;
I
zyxwvutsrq
zyxwvut
zyxwvutsr
zyxwvutsrq
while( 1) (
Receive results from any of the workers;
recvTheta ++;
Send rhetavector to the worker which sent results;
Store results received in the output file;
if (thetavector c N-I) thetavector tt;
if ( recvTheta e N ) break;
1
Send “stop-signal”to all worker processors;
1
1
sets were used for the experiments. The first data set had
one family (one pedigree), 55 family members, and 21
markers. LINKMAP was run, by placing marker 1
between or beside markers 17, 14 and 9 . The measured
average execution times of all the tasks in the first data set
on all the processors are presented in Table 1. The Dual
Pentium Pro-multiprocessor had the fastest execution time
while the Sun Sparc was the slowest.
The TG GUI provided an implementation of the three
different static heuristics previously mentioned: Minimum
Finish time (MFT), Largest Task first (LTF) and Shortest
Effective Execution Time First (SEETF). Simulation
results assuming a normal distribution are shown in Table
2 for two, three, four and five processors.
Simulation of the combined 5 computers was the most
efficient. Next, the two-implementation strategies (static
and dynamic) were tested for the dataset for various
processor combinations under different loads.
Measurements were tabulated to compare the observed
speedup with the maximum possible speedup. In a
heterogeneous environment the maximum possible
speedup is estimated by adding the normalized execution
times of all computers involved in the calculation.
Normalization is obtained with respect to the fastest
machine. The observed speedups (Tseq / Tpar ) were also
estimated with respect to the fastest sequential time. Five
distinct cases were considered:
Case a: Two DEC-Alphas, (machines 2 and 3) with
almost equal processing speed and negligible load,
inferred by the near equal execution times shown in Table
1. The sequential execution time of Linkmap in machine 2
took 32.55 minutes and in machine 3 took 31.97 minutes.
The parallel execution times are listed in Table 3 along
with the best sequential and simulated time. The
simulation compares closely with measured execution
times. This shows the feasibility of using simulation to
analyze several allocation schemes before any actual
implementation. The speedup obtained for the dynamic
heuristic is plotted against the maximum possible speedup
(Fig 2). The speedup achieved (both dynamic and static)
was almost equal to the maximum possible speedup,
ensuring that both processors executed equal numbers of
iterations in both schemes.
Case b: The DEC-Alphas (1 and 2) had different
processing speeds and negligible loads (Fig. 3). Here the
dynamic allocation achieved greater speedup. The static
strategy allocated equal numbers of iterations to both
machines. One iteration for the first dataset on ( 1 ) took
between 2.8 to 3.33 minutes and one iteration on (2) took
approximately 5 minutes. Thus out of the six iterations,
three iterations were calculated by ( 1 ) in about ten minutes
whereas (2) took fifteen minutes.
The first ‘if ’ loop is executed by the worker processors
until no more 0’s are received from the master which is
ranked zero. The second ‘if loop is executed only by
processor zero. It initializes the indexes, sends vectors to
different processors, and collects results from worker
processors. Once the master has received N results, it
sends an abort signal to all workers. The manager and a
worker thread were created on processor ‘0’ by specifying
this in a MPI configuration file. If this file had two entries
for one processor, two threads were created on that
computer by MPI and both threads were treated as
individual processes with separate ranks in the cluster.
Therefore, the worker thread on processor ‘0’ executes
iterations under “if (my-rank != 0)” and the manager
thread under “if (my-rank == 0)”. This strategy adjusted
itself according to the heterogeneity of the machines
available.
zyxwvuts
3. Results
Two types of experiments were conducted. First, the
simulation of the parallel model using the TG-GUI was
performed under three different allocation heuristics for
all possible combination of processors. Second, the actual
implementation was carried out for the two different
strategies (an ad-hoc static and a dynamic allocation) and
the results obtained were compared with the sequential
run.
Three static allocation heuristics were simulated using
the TG-GUI. The dynamic and ad-hoc static heuristics
were applied under various machine combinations and
different load conditions. The ad-hoc static heuristic was
also simulated and compared with actual results. The near
equal execution times in this case validate the accuracy of
the simulation tool.
Experiments were conducted on six computers, three
DEC Alphas (machines 1 , 2, and 3), one dual Pentium
processor (machine 4), an Intel Pentium 200MHz
(machine 5 ) and a SUN SPARC (machine 6). Two data
357
Table 1. Average Execution time (in seconds) of tasks on all processors
heterogeneity in the speed of the machines, whereas the
dynamic scheme yielded better results.
The total time was dictated by the time taken by the
slowest processor to perform its set of iterations; the static
strategy yielded results in 15.25 minutes. The dynamic
strategy took 11.88 minutes. Machine (l), the faster of the
two managed to finish four iterations while two iterations
were performed by (2).
zy
1
zyxwvutsrqpo
Table 3. Execution times (in minutes)
1 I
1 1
Fastest
Parallel
Parallel
Sequential
(ad-hoc
(Dynamic
Time Static scheme) scheme)
Table 2. Predicted Execution times (in minutes)
1
1
1
Parallel (ad-hoc
static
simulation)
zyx
1
zyxwvutsrqponml
zyxwvutsr
2 (machines 1, 2)
5 (machines 1-5)
10.45
19.68
31.97
16.43
16.38
17.3825
actual speedup (dynamic)
Case c: The three DEC-Alphas (1-3), were used with
RI actual SDeedUD Istatid
negligible load. This case is shown in Fig. 4. It is clear
that the dynamic heuristic almost equals the maximum
possible speedup, whereas the static scheme performed
poorly with two computers. However, it performed almost
equal to the dynamic scheme in the case of three
computers since the slowest processor calculated three
iterations in the former case as compared with only two
iterations in the latter case.
Case d: When one of the three processors, machine 3,
was heavily loaded, a different pattern of results was
obtained. This is shown in Fig. 5. The static
implementation yielded poor results because even the
heavily loaded processor received an equal number of
iterations to calculate. The maximum possible speedup
for three processors was less than that for two processors.
The dynamic curve follows the pattern of the maximum
attainable speedup curve.
Case e: Three heterogeneous machines (4, 5, and 6)
were used. The static scheme failed because of the
’.’**
.
1.952
2
1.945
zyx
number of processors
Figure 2. Dynamic vs Static : Case a
This is shown in Fig. 6. The dataset has 3 families (3
pedigrees), with 51, 60, and 55 family members and 15
markers.
358
vectors to different processors for likelihood calculations
were implemented. It was shown that the dynamic scheme
proposed yielded better results in all cases than the ad-hoc
static implementation.
zyxwvuts
zyxwvu
zyxwvutsrqpo
zyxwvutsrqp
zyxwv
1.6
4
U
0
a,
)
%
2.5
1.4
1
1.962
1.958
2
1.2
n
1.5
U
zyxwvutsrqponm
zyxwvutsr
zyxwvutsrqp
zyxwvutsrqponm
Q
a3
1
1
2
0.5
number of processors
0
Figure 3. Dynamic vs Static: Case b
3
number of processors
Figure 5. Dynamic vs Static: Case d.
2.189
2.2
2
2 1
1.8
4 1.8
1.6
3
1.6
1.4
1.4
Q
1.2
U
%
n
1.2
*
1
1
0.8
zyxwvutsrq
0.6
3
number of processors
0.4
0.2
0
Figure 4. Dynamic vs Static: Case c
3
LINKMAP was run by placing marker 15 between or
beside markers 8, 9 and 10. The dataset was very
computation intensive and a sequential run of this data set
on the fastest DEC-Alpha took 93 hours to complete.
Parallel execution time on the three computers took only
34 hours.
4. Conclusions
Parallelization of the Linkmap program has been
achieved at a code level for a heterogeneous environment.
Two strategies of allocation of the recombination fraction
359
number of processors
Figure 6 . Dynamic vs Static: Case e.
Performance prediction for all possible combinations of
the available computers was performed for three other
static heuristics to show the feasibility of predicting the
performance of specific allocation heuristics by
simulation. The speedup attained by the dynamic scheme
was close to the estimated maximum speedup. One
possible reason for this is the choice of parallelizing the
outer most loops in the nest of the computation intensive
loops. This results in a higher granularity; i.e., high ratio
of computation to communication.
This approach distinguishes our work from a previous
parallel implementation of the LINKMAP program, in
which the parallelization was performed using LINDA, a
machine independent parallel programming language,
which is used to execute programs on a parallel computer
[ 181. Their implementation is essentially for parallel
architectures and does not address execution of the
parallel LINKMAP on a cluster of workstations. With
Linda, stress is placed on load balancing issues more than
on different strategies that can be adopted for allocation;
in our work we focus on adopting different allocation
strategies and predicting performance under several such
heuristics. Vaughan’s master’s thesis concentrates on
parallelizing LINKMAP for a single recombination
fraction vector calculation [ 191. Here the speedup
obtained is largely dependent on the size of the pedigree.
In distributing different iterations on different computers
(instead of one iteration on all of them), we ensure that
some speedup is always achieved irrespective of the
pedigree size, because the number of calculations to be
performed is always divided by the number of processors.
Our approach is also different from TREADMARKS,
which is designed for shared memory and homogeneous
platforms [5].
The parallelization of LINKMAP presented in this
work can be used as a case study that outlines an approach
to converting existing highly iterative sequential programs
to a parallel form. Future studies could include achieving
“intra likelihood” parallelization which would calculate
likelihoods for a single recombinant vector along with
distribution of likelihood calculations for different
vectors [20]. The steps outlined in this paper could be
used in an attempt to parallelize the three other programs
in Fastlink.
Improvements in the dynamic allocation strategy
especially in the concept of ‘worker helping worker’ could
be further exploited. If a worker has finished a final
calculation, while others are still calculating, and there are
no more vectors left to be distributed, it could request
some vectors that are currently being calculated by other
workers, say, the slowest worker. This would ensure that
the minimum possible execution time is obtained.
zyx
zyxwvu
Cottingham, Jr R.W. Iduri R.M. Schaffer A.A., “Faster
sequential genetic linkage computations” Amer J Hum
Genetics, Vol. 53, 1993, pp. 252-263.
Chandy K.M. Taylor S . , Introduction to Parallel
Programming Jones and Bartlett, Boston, 1992.
Amza C., Cox A.L., Dwarkadas S . , Keleher P., Lu H.,
Rajamony R., .Yu W., Zwaenepoel W, ‘Treadmarks:
Shared Memory Computing on Networks of
Workstations”, IEEE Computer, Vol 29 No. 2, February
1996, pp. 18-28
Pacheco P.S., Parallel Programming with MPI, Morgan
Kaufman Publishers, Inc., San Mateo, CA, 1997.
Krishna S , A., graphical interface for the analysis of task
graphs, MS Thesis, Computer Science, Texas Tech
University, 1999.
Lopez-Benitez N. Hyon J-Y., “Simulation of task graph in
heterogeneous environments”, IEEE Heterogeneous
Computing Workshop, April 1999, pp. 1 12-124.
McSpadden A.R., Lopez-Benitez N., “Stochastic Petri nets
applied to performance evaluation of static task allocations
in heterogeneous computing environments”, IEEE
Heterogeneous Computing Workshop, 1997, pp. 185-194.
Menasce D.A., Saha D., Da Silva Porto S.C., Almeida
V.A.F., Tripathi S.K, “Static and Dynamic Processor
Scheduling Disciplines in Heterogeneous Parallel
Architectures”,J. of Parallel and Distributed Computing,
28, 1995, pp. 1-18.
El-Rewini H., Lewis T.G., Ali H.H., Task Scheduling in
Parallel and Distributed Systems, Prentice Hall, 1994.
Topcuoglu H., Hariri S., Wu M-Y., “Task scheduling
algorithms for heterogeneous processors”, IEEE
Heterogeneous Computing Workshop, 1999, pp. 3-14.
Kumar V., Grama A., Gupta A., Karypis G , Introducrion
zyx
zyxwvutsrqpo
zyxwvutsrqp
zyxwvu
zyxwvutsrq
zyxwvutsrq
zyxwvu
5. References
Terwilliger J.D., Ott J., Handbook of Human Genetic
Linkage, The Johns Hopkins University Press,
Philadelphia, 1994.
Dwarkadas S . Schaffer A.A., Cottingham, Jr RW, Cox AL,
Keleher P, Zwaenepoel W, “Parallelization of genetic
linkage analysis problems” Human Heredity, Vol. 44,
1994, pp. 127-141.
360
to Parallel Computing-Design and Analysis of Algorithms,
The BenjaminlCummings PublishingCO,Menlo Park, CA,
1994.
Culler D.E., Singh J.P., Gupta A., Parallel Computer
Architecture A HardwardSoftWare Approach, Morgan
Kaufmann Publishers, San Francisco CA, 1999.
Parallelization- Online tutorial
http://www.wi.leidenuniv.nl/-guszlFlyinLCircus/l.Readi
ng/2.Tutorial/04/index.html
Sun Workshop 4.0 Answerbook,Sun MicroSystem, 1999,
http://docs.sun.coni.
Rai A., On the Parallelization of the LinkagdFastlink
Package, M.Sc. Thesis, Computer Science Department,
Texas Tech University, December 1999.
Miller P.L., Nadkarni P., Gelernter J.E.,Carriero N.,
Pakstis A.J., Kidd K.K, “Parallelizing genetic linkage
analysis: A case study for applying parallel computation in
molecular biology” Computers and Biomed Res , Vol 24,
1991, pp. 234-248.
Vaughan M. S . , A distributed approach to human genetic
linkage analysis, Master’s thesis, Computer Science,
Duke Univ, 1991.
Cox A.L., Dwarkadas S . , Schaffer A.A., Zwaenepoel W.,
Gupta S.K., “Integrating parallelization strategies for
linkage analysis”, Computers and Biomed Res, Vol. 28,
1995, pp. 116-139.