Fault Tolerant Systems: Chapter 1: PRELIMINARIES
Fault Tolerant Systems: Chapter 1: PRELIMINARIES
Fault Tolerant Systems: Chapter 1: PRELIMINARIES
Chapter1:PRELIMINARIES
PRELIMINARIES
FAULTCLASSIFICATION
TYPESOFREDUNDANCY
BASICMEASURESOFFAULTTOLERANCE
TRADITIONALMEASURES
NETWORKMEASURES
OUTLINE
PRELIMINARIES
Computersystems,hardwareandsoftware,
aremostcomplexsystemsevercreatedby
humanbeings.
Criticalapplications:Spaceshuttle,financial
systems,medicalinstruments,etc.
Faulttolerance:techniquestotoleratefaults
whilestilldeliveringacceptablelevelof
serviceforintendedobjectivesofsystems.
3
FAULTCLASSIFICATION
FAULT ERROR FAILURE
Fault:hardwaredefectorsoftware/programming
mistake.
Error:manisfestationoffault.
Failure:notachieveintendedobjectiveofsystem.
Fault/errormayspreadthroughsystem.
Containmentzone:barriertoreducechancethat
fault/errorinonezonepropagatestoanother.
4
FAULTCLASSIFICATION
FAULTCHARACTERISTICS
Permanent:permanentdefect.
Transient:malfunctionforsometimeandrestore
functionalityafterward.
Intermittent:oscillatesbetweenquiescentand
active.
OTHERCHARACTERISTICS.
Benign.
Malicious:appearsreasonable,butincorrect.
5
TYPESOFREDUNDANCY
REDUNDANCY:
Propertyofhavingmoreofaresourcesthanis
minimallynecessarytodothejob.
Whenthereisfault,redundancymasksorworks
aroundfaults.
FORMSOFREDUNDANCY:
Hardwareredundancy(staticanddynamic):
incorporateextrahardwareintodesigntoeither
detectoroverrideeffectsoffailedcomponent.
6
TYPESOFREDUNDANCY
FORMSOFREDUNDANCY(cont.):
Informationredundancy:errordetectionand
correction.
Timeredundancy:reexecutionofsamehardware
orprogram.
Softwareredundancy:multipleversionsof
program.
BASICMEASURESOFFT
MEASURE
Mathematicalabstractionthatexpressessome
relevantfacetofperformanceofobject.Usually
onlycapturesasubsetofproperties.
TYPES:
Traditional.
Network.
BASICMEASURESOFFT
RELIABILITYANDAVAILABILITY:Verylimitedin
whattheycanexpress.
ReliabilityR(t):probabilitythatsystemhasbeen
up(operational)continuouslyintimeinterval
[0,t].
MeanTimeToFailure(MTTF):Averagetime
systemoperatesuntilfailureoccurs.
MeanTimeBetweenFailure(MTBF):Averagetime
betweentwoconsecutivefailures.
9
BASICMEASURESOFFT
MeanTimetoRepair(MTTR):Timeneededto
repairsystemfollowingfirstfailure.
MTBF=MTTF+MTTR
AvailabilityA(t):averagefractionoftimeover
interval[0,t]thatsystemisup(operational).
PointavailabilityAP(t):probabilitythatsystemis
upatparticulartimeinstantt.
Longterm(steadystate)availability.
A = lim A(t)
t
10
BASICMEASURESOFFT
Longtermavailabilitymaybecalculatedfrom
MTTF,MTBF,andMTTR.
A=
MTTF
MTBF
MTTF
MTTF + MTTR
Itispossibleforalowreliabilitysystemtohave
highavailability:asystemthatfailseveryhouron
averagebutcomesbackupafteronlyasecond
MTBFofonehour(lowreliability),butavailability
ishighA=3559/3600=0.99972.
11
BASICMEASURESOFFT
NETWORKMEASURES:
Focusesonnetworkthatconnectsprocessor
together.
Nodeandlineconnectivity:Minimumnumberof
nodesandlines,respectively,thathavetofail
beforenetworkbecomesdisconnected.
Canonlydistinguishestwonetworkstates:
connectedanddisconnected.Itsaysnothing
abouthownetworkdegradesasnodesfailbefore,
orafter,becomingdisconnected.
12
BASICMEASURESOFFT
NETWORKMEASURES:
Bothnetworkshavesamenodeconnectivityof1.
ButN1ismuchmoreconnectedthanN2
probabilityofN1beingbrokenupislowerthan
forN2.
13
OUTLINE
HARWAREFAULTTOLERANCE
INFORMATIONREDUNDANCY
FAULTTOLERANTNETWORK
SOFTWAREFAULTTOLERANCE
CHECKPOINTING
CASESTUDIES
FAULTDETECTIONINCRYPTOGRAPHIC
SYSTEMS
14