N-Version Programming A Fault-Tolerance Approach To Reliability Software Operation
N-Version Programming A Fault-Tolerance Approach To Reliability Software Operation
114
occuring i n two or more versions. Together with a t h a t i n most c a s e s only qlreasor~ableness" r a t h e r than
reasonable choice o f c-vectors and c c - p i n t s t h i s is "correctness" may be checked f o r by acceptance tests.
expected to turn N-version programming into an The l a c k o f established procedures to e s t i m a t e t h e
e f f e c t i v e method t o achieve t o l e r a n c e o f software e f f e c t i v e n e s s o f acceptance tests l e a v e s it hard to
faults. The e f f e c t i v e n e s s o f t h e e n t i r e approach determine i f it is s u f f i c i e n t t o use a given acceptance
depends on t h e v a l i d i t y o f t h i s c o n j e c t u r e , t h e r e f o r e test f o r a v e r y c r i t i c a l a p p l i c a t i o n .
it is c r i t i c a l l y important t o keep t h e i n i t i a l
specification f r e e o f any flaws t h a t would b i a s t h e In view o f t h e above d i f f i c u l t i e s , t h e N-version
independent p r o g r a m e r s toward introducing t h e same programing approach o f f e r s some advantages over t h e
software d e f e c t s . recovery blocks. In t h i s approach, self-checking is
not required. Some redundant software can be
The research e f f o r t a t UCLA addresses two t h u s f a r eliminated; t h i s seems a t t r a c t i v e f r a n t h e coverage
unanswered questions: ( 1 ) Which c o n s t r a i n t s (e.g., p i n t o f view [12]. It a l s o o f f e r s t h e p o s s i b i l i t y o f
need f o r formal s p e c i f i c a t i o n s , s u i t a b l e t y p e s of imnediately masking sane software f a u l t s so t h a t t h e r e
problems, n a t u r e o f algorithms, timing c o n s t r a i n t s , is no delay i n operation.
i-tc.) have t o be s a t i s f i e d to make N-version
In c e r t a i n a p p l i c a t i o n s , N-version programing a l s o
programing f e a s i b l e a t a l l r e g a r d l e s s o f t h e c o s t ? makes b e t t e r use o f e x i s t i n g hardware fault-tolerance
( 2 ) How does t h e cost-effectiveness of t h e N-version resources. For i n s t a n c e , t h e r e a r e recent system
programing approach compare t o t h e t w o a l t e r n a t i v e s : d e s i g n s f o r aerospace a p p l i c a t i o n s t h a t use redundant
non-redmdant ( " f a u l t - i n t o l e r a n t " ) programing [l], and hardware a t t h e system l e v e l to a t t a i n fault-tolerance.
t h e "recovery block" [ 2 , 31 approach? The s c a r c i t y of The SIFT design [ 133, t h e Symnetric Multiprocessor [ 141
previous results and an absence o f formal t h e o r i e s on and t h e c e n t r a l computer complex i n t h e Space S h u t t l e
C-version programing has led us to choose an
experimental approach i n t h i s i n v e s t i g a t i o n . ?he [151 are some exanples. In t h e s e systems, copies o f
approach has been to choose m e conveniently i d e n t i c a l programs a r e executed i n t h r e e o r more
accessible programing problems, to assess the i d e n t i c a l processor-memory u n i t s , and voting of t h e
a p p l i c a b i l i t y of N-version programing, and then to results allows d e t e c t i o n and masking o f hardware
proceed to generate a set o f programs. Cnce generated, faults. Since f u l l system r e l i a b i l i t y r e q u i r e s t h e
t h e prcgrans are executed i n a simulated m u l t i p l e - r e l i a b l e operation o f both hardware and software, t h e s e
hardware system, and t h e r e s u l t i n g observations a r e d e s i g n s a r e vulnerable t o software d e f i c i e n c i e s . The
applied to r e f i n e t h e methodology and to build up adoption o f N-version p r o g r m i n g may allow such
t h e o r e t i c a l concepts o f N-version programing. A more systems t o t o l e r a t e both hardware and software f a u l t s
d e t a i l e d discussion o f t h e research approach and g w l s without d e l a y s caused by t h e acceptance t e s t i n g used i n
can be found i n 191, and a d e t a i l e d d i s c u s s i o n of t h e recovery block approach.
experimental results i n [ l o ] .
4. Implementation o f N-Version Programing
3 . A Comparison o f Approaches
In canparison to N-version programing t h e recovery For t h e reason o f convenience , t h e following
block approach has one apparent advantage. In sane discussion w i l l a s s m e t h a t N-3. An extension to N > 3
s i t u a t i o n s , a m f t w a r e system evolves by replacement o f is q u i t e straightfoward.
m e o f its modules with newly developed ones. The
replaced modules can be used as s u p p l m e n t a r y 4.1 Special Mechanisms
a l t e r n a t e s t o t h e new modules. 'Ihe production cost is
lower i n t h i s case. Implementation o f s p e c i a l mechanisms (c-vectors,
c s - i n d i c a t o r s , and synchronization mechanisms), i n a
However, t h e r e a r e a l s o c e r t a i n disadvantages 3-version progran is i l l u s t r a t e d by Figures 2, 3, and
associated with t h e recovery block approach. Fir-st, 4. The schemata shown i n t h e s e f i g u r e s have been
t h e system s t a t e before e n t r y i n t o a recovery block written w i t h t h e PL/I compiler i n mind. It should be
must be saved u n t i l sane reasonable r e s u l t s are noted t h a t a s a result o f emphasis on r e a d a b i l i t y , t h e
obtained fran the block. Considerable s t o r a g e overhead l e n g t h of some i d e n t i f i e r s or l a b e l s may n o t be allowed
may then be involved f o r nested recovery block i n m e implementations.
structures. Second, s p e c i a l precautions a r e needed to
coordinate p a r a l l e l processes within a nested recovery
block s t r u c t u r e . Ctherwise t h e interdependencies anong VERSIONi : PROCEDURE OPTIONS (TASK) ;
these processes may r e q u i r e t h a t a long chain of DCL 1 C-VECTORi
comparison v a r i a b l e s
process e f f e c t s should be undone a f t e r a process has
f a i l e d [ZI. Third, some intermediate o u t p u t frcxn a T{ status flags }.EXTERNAL'
DCL ( D i S A G R E E i , GOODBYE) EIT(1 )'EXTERNAL;
recovery block may not be r e v e r s i b l e i n a real-time
environment. Therefore, no recovery a c t i o n can be DCL (SERVICEi , C O M P L E T E i ) EVENT EXTERNAL;
performed before t h e i n c o r r e c t output causes its DCL FINIS BIT(I) I N I T ( ' O ' B ) ;
daaage. Fourth, s p e c i a l system support is necessary t o other declarations;
a l l e v i a t e t h e above weaknesses. This limits the DO WHILE (TFINIS);
generality o f a p p l i c a t i o n s o f t h e recovery block WAIT (SERVICE1 ) ;
technique. COMPLETION (SERVICEi) = 'O'B;
I F 1GOODBYE & TDISAGREEi
THEN CALL PRODUCE;
F i n a l l y , we a l s o note t h a t t h e e f f e c t i v e n e s s o f t h e ELSE FINIS = ' 1 ' B ;
acceptance test is o f t e n q u i t e d i f f i c u l t t o measure.
In many c a s e s , t h e procedure used to v e r i f y r e s u l t s COMPLETION (COMPLETEi) = ' 1 ' B ;
END;
f r a n t h e execution o f a program can be a s canplex a s PRODUCE : PROCEDURE ;
t h e program itself. For exanple, it is easy to check produce C-VECTORi ;
t h e consistency o f t h e nunber o f elements i n a set END PRODUCE;
before and a f t e r execution of a sorting routine. It. is E N D VERSIONi
more d i f f i c u l t t o v e r i f y t h a t a l l o f t h e d a t a items a r e
indeed sorted a s s p e c i f i e d . I t is even more d i f f i c u l t Figure 2 A Schema f o r the i - t h Version
t o v e r i f y t h a t t h e e l m e n t s o f t h e set before and a f t e r of Code
t h e s o r t i n g a r e t h e same. Therefore, it is obvi.ous
115
ACCEPTANCE: PROCEDURE OPTIONS (MAIN); 4.2 Inexact Votinz
DCL VERSIONi ENTRY;
DCL 1 C VECTORi EXTERNAL, For nunerical computations, twu types of d e v i a t i o n s
may appear i n t h e r e s u l t s . The first type is an
"expected" d e v i a t i o n due t o t h e inexact hardware
r e p r e s e n t a t i o n or t h e d a t a s e n s i t i v i t y of a p a r t i c u l a r
GOODBYE B I T ( 1) EXTERNAL ; algoritfnn. The second type is an Ymexpected"
DCL S E R V I C E i EVENT EXTERNAL, d e v i a t i o n due to either inadequate design or
COMPLETEi EVENT EXTERNAL; implementation of an algorithm, or a malfunction of
o t h e r declarations ; hardware. E i t h e r type of d e v i a t i o n may cause r e s u l t s
COMPLETION ( S E R V I C E i ) = ' 1 ' B ; obtained fran d i f f e r e n t nunerical algorithms to
COMPLETION (COMPLETEi) = ' O ' B ; d i s a g r e e with each o t h e r . me standard voting process
CALL VERSIONi TASK EVENT ( F I N I S i ) ; which r e q u i r e s t h a t t h e majority of correspondent
DO WHILE (need more s e r v i c e ) ; results should have e x a c t l y t h e same values t o
W A I T (COMPLF?Ei )i determine an acceptable result is not a p p l i c a b l e here.
COMPLETION (COMPLETEi) = ' O ' B ; Different voting processes need t o be devised t o handle
process C VECTORi; voting with non-identical results. These voting
I F - ~ n e e dgore s e r v i c e THEN GOODBYE = ' 1 ' B ; processes will be c a l l e d "inexact voting".
COMPLETIGN ( S r R V I C E i ) = ' 1 'B;
END; In general, a d a p t i v e and non-adaptive voting a r e t w o
WAIT ( F I N I S i ) ; a l t e r n a t i v e s which may be applied to perform inexact
END ACCEPTANCE ; voting. Assune t h a t R1, R2, and R3 a r e correspondent
r e s u l t s used to determine t h e voted r e s u l t , R. Then i n
Figure 3 A Schema for an Acceptance Program
t h e approach of adaptive voting, a s suggested i n [161,
R = Wl*Rl + W2rR2 + W3aR3,
DRIVER: PROCEDURE OPTIONS (MAIN) ;
where ( 1 ) W1, W2, W3, a r e weights of R1, R2, R3
DCL (VERSION1 , VERSIONZ, VERSION3) ENTRY;
respectively; (2) W1, W2, W3 a r e p o s i t i v e values; and
declare ( C VECTOR1 , C VECTORZ, C VECTOR3); ( 3 ) Wl+W2+W3=1. These weights may be dynamically
DCL (DISAGFEEl , DISAGEEEZ, DISAGEEE3, GOODBYE) c a l c u l a t e d based on t h e values of R1, R2, and R3. The
B I T ( 1 ) EXTERANL;
major i n t e n t is to favor acceptable r e s u l t s and t o
DCL (SERVICEI, COMPLETE1 , minimize t h e effect of a disagreeing r e s u l t . In o t h e r
SERVICEZ, COMPLETEZ, w r d s , R i s constructed to be a continuous function of
SERVICE3, COMPLETE3) EVENT EXTERNAL; R1, R2, and R3, t h a t w i l l s n w t h o u t t h e e f f e c t of a
other declarations; disagreeing r e s u l t . To c m p u t e t h e weights of
i n i t i a l i z e ( S E R V I C E i . COMPLETE11 as i n ACCEPTANCE: correspondent r e s u l t s , s e v e r a l schemes a r e a v a i l a b l e .
CALL V E R S I O i 1 TASK EVENT ( F I N I S l ) ; The performance of a scheme is influenced by its
CALL VERSIONZ TASK EVENT ( F I N I S Z ) ; "tolerance" parameter, which is a measure of t h e
CALL VERSION3 TASK EVENT ( F I N I S 3 ) ; allowable noise l e v e l and could be optimally determined
DO WHILE (need more s e r v i c e ) : f r a n t h e magnitudes of expected r e s u l t s and noisy
WAIT (COMPLETE1 ,-COMPLETE2, COMPLETE3) ; results.
process ( C VECTOR1 , C VECTORZ, C VECTOR3) ;
I F l D I S A G R r E 1 THEN CORPLETIONICOflPLETEl ) = ' O ' B : The adaptive voting approach may s u f f e r from t h e
I F .DISAGREE2 THEN COMPLETI3N (COMPLETEZ) = 0 ' B f following disadvantages: ( 1 ) The optimal t o l e r a n c e
I F l D I S A G R E E 3 THEN COMPLETION(COMPLETE3)='O'B; parameter is d i f f i c u l t t o determine unless the
I F ineed-more-service THEN GOODBYE='l ' B ; c h a r a c t e r i s t i c s of t h e expected values and noisy values
COMPLETION(SERVICE1 ) = I 1 '6; a r e knom well i n advance. ( 2 ) The remaining e f f e c t of
COMPLETION( SERVICEZ)=' 1 ' B ; noise may n o t b e acceptable i n m e cases. (3) If t h e
COMPLETION (SERVI CE 3) = ' 1 ' B ; voted r e s u l t w i l l be used a s input for n e x t c y c l e of
END ; canputation then t h e accunulation o f r e s i d u a l e f f e c t s
WAIT ( F I N I S 1 , F I N I S Z , F I N I S 3 ) ; of noise may cause a s e r i o u s problem. (4) If
END DRIVER; implemented i n software, t h e adaptive voter may be
q u i t e slow.
Figure 4 A Schema f o r a Driver
As a c o n t r a s t , t h e non-adaptive voting approach u s e s
an allowable discrepancy range and d i f f e r e n c e s of p a i r s
When Figures 2, 3, and 4 are applieo "i" m u l d be of correspondent results i n determining R. Assume t h a t
replaced by 1 , 2, or 3. VERSIONi i s t h e i - t h version 6 is t h e allowable discrepancy range, and D i j is t h e
of a 3-version software u n i t , ACCEPTANCE is the absolute value of t h e d i f f e r e n c e between R i and R j .
acceptance progran with r e s p e c t t o VERSIONi and DRIVER Then an acceptable R may be reached by adopting one of
r e p r e s e n t s a d r i v e r which e x e r c i s e s a +version t h e following t m s t r a t e g i e s : ( 1 ) i f maximun (D12, D23,
software u n i t . C-VECTORi r e p r e s e n t s a c-vector t o be D31) 6 or ( 2 ) i f minimimun (D12, D23, D31) <_ 6.
produced by t h e i-th version. The cs-indicator The first s t r a t e g y r e q u i r e s D12, D23, D3l be known
DISAGREEi shows whether or n o t C-VECTORi a g r e e s with before R can be determined. I n t h e second s t r a t e g y ,
t h e correspondent acceptable r e s u l t s . Another cs- however, i f D12 L 6 R can be determined without
i n d i c a t o r , GOODBYE, r e p r e s e n t s wbether or n o t a normal knowing D23 and C Y .
terminating condition is satisfied. The
synchronization primitive, SERVICEi, is used to s i g n a l The non-adaptive voting approach is n o t without
a request fran t h e d r i v e r for t h e s e r v i c e of t h e i - t h flaws. F i r s t , t h e value of 6 is very d i f f i c u l t t o
version. Another synchronization p r i m i t i v e , CCMPLETEi, determine dynamically for each instance o f voting.
i s used by t h e i-th version t o s i g n a l t h e d r i v e r t h a t Second, t h e s t r a t e g y which u s e s t h e p r i n c i p l e of
C-VECTORi is ready. maximun (D12, D23, D32) L 6 is too r i g i d s i n c e an
erroneous r e s u l t may e a s i l y cause a D i j which is l a r g e r
From Figures 2, 3, and 4 , it is evident t h a t t h e than S . Acceptable r e s u l t s may n o t be reached even
implementation of s p e c i a l mechanisms for N-version when two of t h e t h r e e correspondent r e s u l t s a r e
programing is r e l a t i v e l y simple. This is i l l u s t r a t e d reasonably close. Third, t h e s t r a t e g y which uses t h e
by t h e exanple i n Appendix 1. p r i n c i p l e of minimun (D12, D23, D31) 6 may encounter
116
s i t u a t i o n s where each version may have d i f f e r e n t The problem s p e c i f i c a t i o n f o r RATE (given i n [ l o ] )
effects on t h e outcane of voting. These s i t u a t i o n s a r e s t a t e s t h e requirements for a stand-alone RATE program
i l l u s t r a t e d b e t t e r with t h e following exanples. AsSUlle lhis s p e c i f i c a t i o n is w r i t t e n to make a RATE program
t h a t t h e allowable discrepancy range is 0.9 and ( i ) e a s i l y i n s t r u n e n t a b l e f o r t h e p u r p s e o f conducting
expected R I = 117.0, ( i i ) expected R2 = 116.5, and experiments on 3-version programing.
(iii) expected R3 = 115.8. In t h i s c a s e , ( i ) i f Only
Rl or A3 is erroneous, an acceptable R can still be
generated, ( i i ) but i f R2 i s erroneous then no There are four t y p e s o f output d a t a s t r u c t u r e s
acceptable R can be generated. described i n t h e problem s p e c i f i c a t i o n f o r RATE. The
' h e r e f o r e , t h e r e is no i n e x a c t voting approach which MESSAGE type of d a t a informs t h e d r i v e r about
can be applied s a t i s f a c t o r i l y t o a l l cases. The execution termination c o n d i t i o n s or the time a t which
success of an approach u s u a l l y depends o n the r e s u l t s a r e produced. The FAClDR type of data
d e s i g n e r ' s knowledge about ( 1 ) t h e d a t a s e n s i t i v i t y of r e p r e s e n t s t h e s i z e and t h e u n i t s o f t h e s p e c i f i e d
each algorithm, (2) t h e l i m i t a t i o n s o f hardware region. The GRID type of d a t a r e p r e s e n t s a point i n
representations, and (3) the allowable ranges o f t h e concerned region and its previous and Current
d i s c r e p a n c i e s f o r each instance o f voting. temperatures. ?he CUTPUT-REWEST type of d a t a
i n d i c a t e s t h e p i n t s and t h e times f o r which o u t p u t s
5. F e a s i b i l i t y S t u d i e s o f N-version Programming a r e requested. Each o f t h e s e t y p e s o f d a t a can be
t r e a t e d a s a c-vector t h a t r e p r e s e n t s r e s u l t s t o be
A t an e a r l y s t a g e of t h e i n v e s t i g a t i o n o f N-version voted upon.
prograrming, it was decided to conduct a few
experiments to gain m e i n s i g h t i n t o t h e f e a s i b i l i t y The problen s p e c i f i c a t i o n f o r RATE a l s o g i v e s an
o f t h i s technique. Three o b j e c t i v e s were set for t h e algorithm s p e c i f i c a t i o n f o r : ( 1 ) terminating c o n d i t i o n s
experiments. They were: (1) to study t h e g e n e r a l i t y for program execution; ( 2 ) cc-points a t which r e s u l t s
and t h e ease of t h e implementation o f N-version should be produced; and ( 3 ) an algorithm that
programing; (2) to gain q u a l i t a t i v e and q u a n t i t a t i v e implements a nunerical approximation for solving t h e
d a t a on e f f e c t i v e n e s s of 3-version programing; and ( 3 ) p a r t i a l d i f f e r e n t i a l equation shown above. Three
to observe and i d e n t i f y problems or d i f f i c u l t i e s i n a l g o r i t h s , (ALCI, ALC2, Affi3) were s e l e c t e d t h a t
using 3-version programing. implemented t h r e e d i f f e r e n t nunerical methods to s o l v e
t h e partial d i f f e r e n t i a l equation. While an i n d i v i d u a l
Three c r i t e r i a were used to select t a r g e t problems RATE s p e c i f i c a t i o n can employ only one of these
f o r f e a s i b i l i t y s t u d i e s . F i r s t , a t a r g e t problem needs algoriths. t h e e x i s t e n c e o f t h r e e d i f f e r e n t choices
to be r e l a t i v e l y complex 50 t h a t t h e r e is reasonably provides a higher p r o b a b i l i t y o f avoiding r e l a t e d
good p o s s i b i l i t y t h a t r e s i d u a l software d e f e c t s w i l l programing errors in t h e N-version programing
occur i n i t s implementing programs. Second, the experiment.
progran implementing a t a r g e t problem should be o f
manageable s i z e to f a c i l i t a t e t h e instrunentation The problem of RATE was given a s a programing
efforts. Third, s i n c e programing is an expensive assignnent f o r t h e greduate seminar course, E2262,
a c t i v i t y , it is very d e s i r a b l e t h a t a t a r g e t problem o f f e r e d a t UCLA i n t h e Spring q u a r t e r of 1977. The
should allow convenient generation of m u l t i p l e v e r s i o n s s t u d e n t s were asked to form teams o f t w o people t o
o f a progrzn. s o l v e t h e RATE problem. There were t h r e e s t u d e n t s who
preferred to w r k alone and t h e i r r e q u e s t s were
5.1 The 3-version Y E S Frogran Experiment granted. T o t a l l y , t h e r e were 18 teams t h u s formed.
Two of t h e s e teams f a i l e d t o turn i n t h e i r programs.
MESS Qini-Text Editing Sygtem) was a program That l e f t 16 programs a v a i l a b l e f o r t h i s i n v e s t i g a t i o n .
assignnent for t h e graduate seminar course E2262, The RATE s p e c i f i c a t i o n was d i s t r i b u t e d to a l l 18
offered a t UCLA i n t h e Spring q u a r t e r o f 1976. A programing teuns. Algorithms 1, 2, and 3 were
preliminary r e p o r t on Y S S is contained in [91. The s p e c i f i e d i n s i x cases. Each o f t h e RATE programs was
s p e c i f i c a t i o n o f M E S S and 2 d e t a i l e d d e s c r i p t i o n of t h e w r i t t e n i n P l / I ( F ) , and executed on t h e B IM 360/91.
r e s u l t s can be found i n [lo!. Each program consisted of more than 600 P l / I
From t h e experience and t h e results o f t h e HESS statements. The period allowed f o r t h e developnent of
experiment t h e following conclusions were reached: ( 1 ) t h e R4TE programs was four weeks.
The methodology used to implement N-version
programing (see Sections 2 and 4 ) is r e l a t i v e l y simple The R4TE programs were c o l l e c t e d and t e s t e d a g a i n s t
and can be generalized to o t h e r s i m i l a r a p p l i c a t i o n s . s i x text c a s e s . Based on t h e results of t h e s e tests,
(2) The r e s u l t s a t t a i n e d f r a n executing +version t h e b e s t four programs were selected f o r f u r t h e r
programs a r e encouraging. The e f f e c t i v e n e s s of experiments. Besides, t h e r e were t h r e e programs
3-version programing seems to warrant f u r t h e r developed by t h e authors. Hence, we had 7 programs
investigation. ( 3 ) The 3-version redundancy was a v a i l a b l e for subsequent s t u d i e s on t h e f e a s i b i l i t y o f
s u c c e s f u l l y applied a t subroutine (module) l e v e l , 3-version programing.
r a t h e r than a t c a n p l e t e program l e v e l . This s h o w t h a t
s e l e c t i v e a p p l i c a t i o n of N-version redundancy t o Three of t h e s e l e c t e d programs implemented ALG1.
c e r t a i n c r i t i c a l parts o f longer programs can be a Two each o f t h e remaining four programs implemented
practical alternative. ALC2 and ALC3 r e s p e c t i v e l y . The seven programs were
i n s t r u n e n t e d , grouped i n t o 12 combinations, and t e s t e d
5.2 The 3-version R4TE FV0gt-F Experiment with 32 test c a s e s . T o t a l l y , t h e r e were 384 c a s e s
t e s t e d . Among them:
RATE, standing for Region Approximation and ( 1 ) 290 c a s e s contained no bad version,
Temperature Estimation, is a program for computing ( 2 ) 71 c a s e s contained one bad version,
dynamic changes o f temperatures a t d i s c r e t e p o i n t s i n a ( 3 ) 18 c a s e s contained two bad versions, and
p a r t i c u l a r region o f a plain. The temperature changes (4) 5 c a s e s contained t h r e e bad versions.
a r e governed by t h e followi?g equation:
Naturally, t h e 290 c a s e s o f t h r e e good versions
generated acceptable results, and t h e 18 c a s e s
containing two bad versions and t h e 5 c a s e s containing
t h r e e bad versions generated unacceptable r e s u l t s . For
where t h e c o e f f i c i e n t s A , B, C, D, E, and F are some t h e 71 c a s e s containing a s i n g l e bad version, 59 c a s e s
constants,
117
generated acceptable results and 12 c a s e s generated (5) I n t h e event t h a t an allowable range o f
unacceptable results. These 12 c a s e s contained a discrepancy cannot be e a s i l y determined, t h e problem of
version which malfunctioned i n a way t o cause t h e i n e x a c t voting is d i f f i c u l t t o handle. Acceptable
system to a b o r t t h e execution o f t h e involved +version r e s u l t s a r e d i f f i c u l t to reach i n t h i s case.
unit.
Although t h e above conclusions have a s i g n i f i c a n t
Based on these r e s u l t s , we have observed tW nunber of negative p o i n t s , it should be noted t h a t only
d i f f i c u l t i e s which r e q u i r e s p e c i a l a t t e n t i o n i n N- a very small portion of t h e f i e l d o f N-version
version programing: ( 1 ) I n m e s i t u a t i o n s , one programing has been touched. Some negative results
version of code developed an e r r o r t h a t caused t h e might be due t o inexperience of programmers, inadequate
operating system of t h e B IM 360/91 canputer t o take selection of problems, or improper c o n t r o l of
over execution, An exanple is t h e improper handling o f environment for conducting these s t u d i e s . Furthermore,
conversion e r r o r s by a version. A s a r e s u l t , both t h e t h e r e a r e s e v e r a l v a r i a t i o n s of N-version programing.
involved version and its associated 3-version program This approach might be more e f f e c t i v e when it is
were aborted by t h e operating system. Even though t h e applied to t h e s p e c i f i c a t i o n or design o f a program
other t m versions were executing properly, t h e 1171. It is a l s o p o s s i b l e t h a t N-version programming
3-version program could not proceed f u r t h e r p a s t t h i s can aid progran t e s t i n g more e f f e c t i v e l y than it can
point t o generate c o r r e c t r e s u l t s . (2) The l o g i c being perform run-time software d e f e c t masking. Therefore,
implemented by one version o f t h e code may be correct, it i s believed t h a t a t t h e present s t a g e o f t h e
i n c o r r e c t , or it may be a l t o g e t h e r missing. For c a s e s investigation N-version programing remains an
i n which missing l o g i c is t h e cause o f i n c o r r e c t interesting and p o t e n t i a l l y e f f e c t i v e approach t o
software operation, error symptoms f r a n f a u l t y versions software fault-tolerance.
tend to be t h e same. There is a p o s s i b i l i t y t h a t
f a u l t y but i d e n t i c a l results (due to missing l o g i c ) may 7. Acknowledgment
outvote c o r r e c t r e s u l t s .
This research was supported by the U.S. National
6. Conclusions Science Fomdation, Grant No. MCS 72-03633 A@. The
authors wish to acknowledge the generous and
The r e s u l t s obtained f r m t h e MESS and t h e RATE
e n t h u s i a s t i c advice and cooperation received from Prof.
experiments a r e o f mixed nature. There a r e s e v e r a l Daniel M. Berry and t h e proaramscontributed by t h e
encouraging points: (1 ) The methodology for s t u d e n t s of h i s Software Engineering c l a s s e s i n 1976
implementing N-version programming is r e l a t i v e l y simple and i n 1977. Valuable advice and criticism was
and can be generalized to o t h e r s i m i l a r a p p l i c a t i o n s ; received f r a n Professors J.A. Goguen, J r . and D.F.
(2) In m e c a s e s , +version p r o g r m i n g has been Martin of UCLA, and fran tw r e f e r e e s o f t h i s paper.
e f f e c t i v e i n preventing f a i l u r e due to defects
l o c a l i z e d i n one version o f code; and ( 3 ) tJ-version
programing can be a p r a c t i c a l approach i f it is
s e l e c t i v e l y applied a t subroutine l e v e l . On t h e o t h e r 8. References
hand, t h e r e a r e sane n e g a t i v e points: ( 1 ) I n the
e n v i r o m e n t of sane o p e r a t i n g systems, c e r t a i n [ 13 A. Avizienis, "Fault-Tolerance and Fault-
implementation d e f e c t s of a version may cause i t s Intolerance: Complementary Approaches t o Reliable
associated +version progran to be aborted by t h e O.S.; Computing," Proc. 1975 I n t . Conf. Reliable Software,
and ( 2 ) I f missing program functions are the 458-454.
predminant software defects, then N-version
programing may n o t be an e f f e c t i v e approach. [21 B. Randell, "System S t r u c t u r e for Software
Fault-Tolerance," IEEE Trans. Software Fngr., Vol.
I n a d d i t i o n t o t h e experimental r e s u l t s , we have SE-1, June 1975, 220-232.
.identified Sane s i t u a t i o n s in which N-version
programing appears to be n o t e f f e c t i v e or is n o t [31 H . Hecht, "Fault-Tolerant Software f o r Real-Time
applicable. The most s i g n i f i c a n t l i m i t a t i o n s a r e Applications," ACM Computing Surveys, Vol. 8, Dec.
discussed below. 1976, 391-407.
( 1 ) I n a real-time environment a system f a i l u r e may [41 A. Avizienis, "Fault-Tolerant Computing:
be caused by performance l i m i t a t i o n s r a t h e r than Progress, Problems and Prospects Information
functional problems. TWO t y p i c a l e x m p l e s a r e timing Processing 77 (Proc. IFIP Congress 19771, 405-420.
c o n s t r a i n t v i o l a t i o n s and resource contentions. A
l i k e l y source f o r t h e s e problems is systen overload. [51 X.R. Elmendorf, "Fault-Tolerant Programming ,"
N-version programing may produce adverse e f f e c t s i n Proc. 1972 I n t . Symp. Fault-Tolerant Computing, June
these situations. 1972, 79-83.
( 2 ) In c e r t a i n o t h e r c i r c u n s t a n c e s , t h e r e e x i s t s no 161 H. Kopetz, "Software Redundancy i n Real Time
unique path t o t h e s o l u t i o n of a problem. Step-by-step Systems," Proc. IFIP Congress 1974, 182-186.
matching or voting of correspondent results cannot be
used a s a c r i t e r i o n o f c o r r e c t n e s s . Therefore, N- [71 E. Girard and J.C. Rault, "A Programming
version programing is n o t a p p l i c a b l e f o r s i t u a t i o n s i n Technnique f o r Software R e l i a b i l i t y , " Proc. 1973 IEEE
which d i s t i n c t m u l t i p l e s o l u t i o n s (or intermediate Symp. on Computer Software R e l i a b i l i t y , 44-50.
s o l u t i o n s ) exist.
[81 M.A. F i s c h l e r , e t . a l . , " D i s t i n c t Software: An
( 3 ) In still o t h e r c a s e s a long sequence o f o u t p u t s Approach t o Reliable Computing,', Proc. 2nd USA-Japan
may n o t lend itself t o be s p e c i f i e d i n a s p e c i f i c Computer Conference, Tokyo, Japan, 1975, 1-7.
order. In these c a s e s , t h e o u t p u t s from t h e component
v e r s i o n s cannot be r e a d i l y compared. 191 A. Avizienis and L. Chen, "Ckl t h e Implementation
o f N-version Programming f o r Software Fault-Tolerance
(4) In m e s i t u a t i o n s t h e sequence o f o u t p u t s from b r i n g Execution," Proceedings o f COMPSAC 77, ( F i r s t
a version is context-dependent. Any e r r o r t h a t pushes IEEE-CS I n t e r n a t i o n a l Computer Software and Application
t h e rest o f output o f f its proper p o s i t i o n makes t h e Conference), Nov. 1977, 149-155
subsequent comparison o f r e s u l t s meaningless.
118
[ l o ] L. Chen, "Improving Software R e l i a b i l i t y by N-
version UCLA Computer Science Dept.
.Technical R e m r t , University of California Los CONVERSION1 : PROCEDURE OPTIONS (TASK) ;
Angeles, 1978. DCL NUMBERl BINARY F I X E D ( 3 1 ) EXTERNAL;
DCL (SERVICE1 , COMPLETEl) EVENT EXTERNAL;
[111 B.H. Liskov and V. Berzins. "AI ADDraiSal o f DCL (DISAGREE1 , GOODBYE) B I T ( 1 ) EXTERNAL;
Progran s p e c i f i c a t ions, 1, Computation str &ut- e s Craup DCL D I G I T S ( 1 0 ) BINARY F I X E D ( 6 ) EXTERNAL;
Memo 141-1, MIT Laboratory f o r Ccinputer Science, April DCL F I N I S B I T (.1 ). I N I T ( 'O'B) ;
1977. DO WHILE ( > F I N I S ) ;
WAIT (SERVICE1 ) ;
1123 W.C. Bouricius, et. a l . , " R e l i a b i l i t y Modeling COMPLETION (SERVICE1) = ' O ' B ;
Techniques for Self-Repairing Computer Systems,1t m. NUMBERl = 0:
ACM 1969 Annual Conf.,295-309. I F lGOODBYE.& 1DISAGREE1
THEN DO I = 1 TO 10;
C131 J.H. Wensley e t . a l . , "The Design, Analysis, and NUMBERl = NUMBER1 *1 D+
Verification of t h e S I F T Fault-Tolerant System," PE. D I G I TS ( I - 60 ;
2nd I n t . Conf. Software Engineering, Oct. 1976, END;
aa-269. ELSE F I N I S = ' 1 ' B ;
COMPLETION (COMPLETE1 ) = '1 ' B ;
[141 A.L. Ho@cins, Jr., and T.B. Smith 111, "?he END;
Architectural Elements of a Svmnetric Fault-Tolerant END CONVERSION7 ;
Multiprocessor," IEEE T r a n s . C$nputers, Vol. C-24, May
19-75 , 43&505. Figure 5
[ 151 J.R. Sklaroff, "Redundancy Management Technique
IM J. of Res. and Dev.,
f o r Space S h u t t l e Canputers,lt B
Vol. 20, Jan. 1976, 20-25.
Ti61
- - - R.B. Broen. "New Voters f o r Redundant Systems,"
Trans. M E : Journal of Dynanic Systems, Measurement, CONVERSION2: PROCEDURE OPTIONS (TASK) ;
and Control, March 1975. 41-45. OCL NUMBER 2 BINARY F I X E D ( 3 1 ) EXTERNAL;
OCL (DISAGREEZ. GOODBYE) B I T ( 1 ) EXTERNAL;
[ l 7 ] A.B. Long, C.V. Rammoorthy, et. al., "A DCL (SERVICE2, .COMPLETE2) EVENT EXTERNAL ;
Methodology f o r k v e l o p n e n t and Validation of C r i t i c a l DCL ( D I G I T S ( 1 0 ) EINARY F I X E D ( 6 ) EXTERNAL;
Software f o r Nuclear Power P l a n t s , " Proc. CCMPSAC '77 OCL F I N I S BIT(1) I N I T ( 'O'B);
(IEEE-CS I n t . Conputer Software & Applications Conf.), DO WHILE ( 7 F I N I S ) ;
620-626. WAIT (SERVICE2) ;
COMPLETION (SERVICEL) = ' O ' B ;
Authors NUMBER2 = 0;
I F TGOODBYE & l D I S A G R E E 2
Liminz Chen was born i n Taiwan. He received the ILS. THEN DO I = 1 TO 10;
( ' 6 9 ) i n Psychology f r a n the National Taiwan University NUMBER2 = NUMBER2 +
and the Y.A. and X.S. degrees i n Psychology and i n ( D I G I T S (I )-60)*1 O**( 1 0 - 1 ) ;
Compter Science fran UCLA. Since 1974 he has been a
END;
Postgraduate Research Engineer associated w i t h t h e ELSE F I N I S = ' 1 ' B ;
Fault-Tolerant Computing project a t UCLA, where he is
COMPLEION (COMPLETE2) = '1 ' B ;
canpleting t h e Ph.C. d i s s e r t a t i o n i n Computer Science.
END;
Since Yay 1977, he is associated w i t h t h e Xerox Co.
END CONVERSIONZ;
working on software developnent methodology, diagnostic
programing, and huna? engineering. Figure 6
A 1 i r d a s Avizienis was born i n Kaunas, Lithuania. He
re:eived the E.S. ( ' 5 4 ) , M.S. ('55) and Ph.D. ( ' 6 0 )
degrees i n E l e c t r i c a l Engineering from t h e University
of I l l i n o i s . I n 1960 ne i n i t i a t e d research on f a u l t -
t o l e r a n t canputing and l a t e r directed the JPL-STAR CONVERSIONJ: PROCEDURE OPTIONS ( M A I N ) ;
computer project a t the J e t Propulsion Laboratory, DCL NUMBER3 BINARY F I X E D ( 3 1 ) EXTERNAL;
where he is now an Academic Hember of Technical S t a f f . DCL (DISAGREE3. GOODBYE) B I T ( 1 ) EXTERNAL;
Since 1962 ne has been a f a c u l t y member a t UCLA, where DCL C SERVICE^, COMPLETE^) EVENT EXTERNAL;
he is now a Professor i n the Computer Science DCL D I G I T S ( 1 0 ) BINARY F I X E D ( 6 ) EXTERNAL;
Department. He i s : Fellow of the IEEE and the author OCL F I N I S BIT(1) INIT('0'B);
of over 50 publications on computer a r i t h m e t i c and DO WHILE (,FINIS);
f a u l t tolerance. He was the f i r s t chairman of t h e WAIT (SERVICE3);
IEEE-CS Technical Comnittee on Faul t-Tolerant Computing COMPLETION ( S E R V I C E 3 ) :: ' O ' B ;
and the Chairman of FTCS-1 i n 1971. NUMBER3 = 0;
I F TGOODBYE & l D I S A G R E E 3
Appendix THEN DO I = 1 TO 10;
NUMBER3 = NUMBER3*1 O+
Assume t h a t a +version software u n i t has heen MOD(DIGXTS(I), 60);
identified t o implement a function which can be END ;
activated repeatedly t o convert an a r r a y of 10 ASCII ELSE F I N I S = ' 1 ' 6 ;
coded digits into a binary nunber. Possible COMPLETION(COMPLETE3) ' 1 'B;
2:
119