White Paper A Hybrid Fault Tolerant Architecture: RTP Corporation
White Paper A Hybrid Fault Tolerant Architecture: RTP Corporation
White Paper A Hybrid Fault Tolerant Architecture: RTP Corporation
Project:
RTP 3000 System
Customer:
RTP Corporation
Pompano Beach, FL
USA
The document was prepared using best effort. The authors make no warranty of any kind and shall not be liable in
any event for incidental or consequential damages in connection with the application of the document.
© All rights reserved.
Management summary
The RTP 3000 SIS System has a hybrid architecture that uses a set of advanced design
techniques to provide SIL 3 safety integrity and high availability. Safety integrity and high
availability are achieved on a system that also provides an unusual level of architecture
flexibility and computing speed (5 msec. scan rates). This combination of safety integrity, high
availability, flexibility and performance sets new levels of expectation among safety PLC users.
Architectures available include:
Input Module: Single 1oo1, Dual 1oo2, Triple 2oo3
CPU Module: Single 1oo1, Dual 1oo2, Triple 2oo3
Output Module: Single 1oo1D, Dual 2oo2D
Each subsystem and each I/O module can have a different architecture depending on the
criticality of application functions using those modules. In this way a cost optimized system
based on application risk can be designed.
Input modules with a single (1oo1) architecture provide cost effective inputs with a safety
integrity rating of SIL 2. The dual architecture (1oo2) will provide high safety integrity to a rating
of SIL 3. The triple architecture (2oo3) is used to provide higher availability of the input
subsystem. Diagnostics are primarily provided via comparison in the Node Processor.
Node Processor modules can be configured with single, dual and triple architectures. The single
(1oo1) architecture is the base configuration. A dual architecture (1oo2) is used to achieve high
safety integrity. A triple architecture (2oo3) is used to achieve both safety integrity and high
availability. Comparison diagnostics between the Node Processors provide high effectiveness
fault detection even with transient bit errors and soft failures in small geometry integrated
circuits. The approach of using detail comparison instead of extensive self-diagnostics also
frees computing power to ensure higher application function performance.
Output modules with a single (1oo1D) architecture will provide high safety integrity to a rating of
SIL 3 with no redundancy. The dual (2oo2D) architecture is used to provide higher availability
for each output subsystem. Single channel safety integrity is achieved through automatic
diagnostics which will initiate an output shutdown if potentially dangerous failures are detected.
The diagnostics are run locally in the output module, in the chassis (I/O) processor and in some
cases in the node processor.
A Markov model was developed to analyze the behavior of the RTP 3000 SIS system under
fault conditions for two common configurations:
1. Maximum Safety (1oo2, 1oo1D)
2. Maximum Availability and Safety (2oo3, 2oo2D)
Using the Markov models and the failure rates from the FMEDA, example average Probability of
Failure on Demand (PFDAVG) and Mean Time To Fail Spurious (MTTFS) values are calculated.
The results confirm the level of high safety integrity and high availability achieved by the design.
For example, the 1oo2 architecture would use two relay contacts wired in series. If one contact
failed short circuit, the other contact would still open the circuit. Only if both contacts failed short
circuit did the assembly fail short circuit. The disadvantage of this circuit is that the open circuit
failure rate doubled since the circuit would fail open circuit if either contact failed open circuit. It
can be seen that the only architecture that could tolerate both short circuit and open circuit
failures is the 2oo3. A full description of all architectures is available in [N3, Chapter 14].
One issue with these designs was that any failure tolerated by the design would become
hidden. Normal operation would continue even though individual relays had failed. When this
happened, the fault tolerance was lost but the failure was typically not known to the operator or
responsible maintenance personnel. A second failure would fail the system. Thus, the systems
required frequent manual inspection and testing to prove that all the relays still worked
completely. The operational cost of manual proof testing was high and this activity often was not
performed. Without the frequent manual proof testing, the advantage of the redundancy was
lost.
Diagnostic Output
B Switch
Diagnostic Output
The system would switch from A to B depending on diagnostic signals from the two CPU units
(typically from the watchdog timer circuits). The switch would select whichever unit indicated it
was good. This design could indeed provide good fault tolerance but depended on the
automatic diagnostics. If the diagnostics did not detect a failure, the switch would not select the
good unit. Reliability models show that if the diagnostics do not have an effectiveness in the
90% range, the overall availability of this design will not be better than a single unit [N3, Chapter
9].
3.1 Configuration 1
Configuration 1 shows some of the basic design concepts used to achieve safety integrity in the
RTP 3000. Input modules and the Node Processor are duplicated with diagnostic capability
provided by comparison diagnostics. Comparisons are made of the input scans, intermediate
results and calculated results. This comparison will detect an estimated 99% of failures that may
be potentially dangerous. Additional self-diagnostics are performed by the Chassis Processor
on the Node Processor, itself and the Output modules. Overall the combination of comparison
diagnostics and automatic self-diagnostics provides an extremely high level of diagnostic
effectiveness (99+%).
3.2 Configuration 2
Configuration 2 (Figure 3) shows how additional redundancy is added to achieve both high
safety integrity and high availability. A third input module and a third Node Processor may be
added to achieve a 2oo3 architecture. Diagnostics are again provided by comparison
diagnostics of the input scans, intermediate results and calculated results. Common cause
defense is provided by separate modules.
For the Chassis Processor and the Output modules, a 2oo2D architecture is used. This
provides maximum availability but the architecture is again highly dependent on exceptional
diagnostic coverage. The FMEDA on this design has verified this has been achieved.
4 Markov Analysis
A detailed Markov analysis has been done on the RTP3000 system based on analysis done for
the 2500 system [R2] to quantitatively show the result of the architectural design decisions. A
single safety instrumented function (SIF) is modeled with three analog input signals and two
digital output signals. This I/O count being typical of a simple SIL 3 SIF using three transmitters
and two final elements.
μS 2 SUN+ SUC SU
2SDN+2DDN+ FS
2ADN 1 Detected SD+DD+AD 9
1 OK
μON
2
AU
2SUN+SUC
DU
+SDN
SU SD+SU
2DUN 1 DU
1 OK AU SDN+DDN 1 Detected DD+DU
+ADN+ADC
3 1 AU
6
DU μON
SD+SU
FD
2DUN+DDN 1 DU DD+DU DD/DU AU
2AUN 1 AU 1 AU 10
1 OK μON
OK 4
7
DDC
1
ADN+AU μON
SD+DD+AD
FD
1 Detected
2ADN+ADC DU
8
μON
SDC+DDC+ADC
FD
2 Detected
2SDN+2SUN+SDC+SUC
11
AUC
2 AU
5
DUC
DUC
2DDN+2DUN+DDC+DUC
FDU
12
Figure 5: 3000 System 1oo2 Markov model
A normal 1oo2 Markov model only has six states. This model is significantly more complicated
as it accounts for diagnostic annunciation failures (AD, AU). The model also shows the affect of
the assumption that the end user does not automatically shutdown on detected failures as
stated in the assumptions.
The model solution shows clearly that states 5, 6, 7 and 10 have state probabilities several
orders of magnitude lower than other states. Therefore these states could be pruned without
any noticeable impact on the result. For the remainder of the Markov models developed, such
tertiary failure states will not be developed.
μS SD+SU
SD+SU+DD+AD
FS
AU 3
AU
2
DD+DU
OK
1
DU
FDU
4
From state 1, single failures are shown as transitions to other states. The system is successful
in this state and will respond to a demand. In state 2 an assumption is made that no self
diagnostic can be assumed to work. The system is still successful in state 2 and will respond to
a demand. The Fail Safe State is state 3 and transition probabilities to this state will be
considered for spurious trip calculations. The Fail Dangerous State is state 4. The probability of
being in this state will be considered in the PFDAVG calculation.
2SUN
3SUN SU 2SDN+2DDN+2ADN
1 Detected DU SD+SU
3 μON SU
7
2DUN
SD+SU
1 Detected
2SDN+2DDN+2ADN DU
DU
3DUN DU
OK 4
μON 8
1
2SUN 2DUN
SDC+SUC
SU DU
3AUN DU
9
AU SDC+SUC
2SDN+2SUN
5
SDC+SUC
AU
SU
10 DUC
2DDN+2DUN
1.5 DUC AU
DU
DUC
11 DUC
DDC+DUC DUC FDU
DDC+DUC
13
μS SDC+SUC+DDC+ADC
SD+SU+DD+AD
FS
2SDN+2DDN+2ADN 7
1 Detected
DU SD+SU+DD+AD
μON 2
AU SD+SU+AD SD+SU+AD
SD+SU+AD
2SUN
SU
OK 3 AU
1 2AUN DU μON D/AU FD
5
D/DU
8
μON
AU
4 SU/AU
DD+DU
DD+DU 6
DD+DU
2DUN+DUC
FDU
9
Traditional models for a 2oo2D system contain only six of these states. This model is more
complex as it models the impact of diagnostic subsystem failures. As with the 1oo1D model
diagnostic subsystems were classified into two groups, those that automatically initiate a trip
and those that do not. Similar to the 1oo1D model, the worse case assumption is made that no
diagnostic can be assumed to work once there has been a single Annunciation failure.
The table shows that high safety integrity is achieved with all configurations.
For SIL 3 applications, the PFDAVG value needs to be ≥ 10-4 and < 10-3. This means that for a
SIL 3 application, the PFDAVG for a 5 year mission time of Configuration 1 is equal to 26.5% of
the range. Similarly, for Configuration 2, the PFDAVG is equal to 35.9%.
For SIL 2 applications, the PFDAVG value needs to be ≥ 10-3 and < 10-2. This means that for a
SIL 2 application, the PFDAVG for a 5 year mission time of Configuration 1 is equal to 2.65% of
the range. Similarly, for Configuration 2, the PFDAVG is equal to 3.59%.
These results must be considered in combination with PFDAVG values of other devices of a
Safety Instrumented Function (SIF) in order to determine suitability for a specific Safety Integrity
Level (SIL).
From the Markov model calculations, also the Mean Time to Fail Spurious (MTTFS) is derived.
Table 4 shows the MTTFS results.
Table 4 MTTFS results
Type A component “Non-Complex” component (using discrete elements); for details see
7.4.3.1.3 of IEC 61508-2
Type B component “Complex” component (using micro controllers or programmable logic);
for details see 7.4.3.1.3 of IEC 61508-2
9.1 Liability
exida performed the calculations based on methods advocated in applicable International
standards. Failure rates are obtained from a detailed Failure Modes, Effects and Diagnostics
Analysis. exida accepts no liability whatsoever for the use of these numbers or for the
correctness of the standards on which the general calculation methods are based.
9.2 Releases
Version: V1
Revision: R2
Version History: V1, R2: Edited per client reivew
V1, R1: Released to client
V0, R1: Draft; based on 2500 report.
Authors: William Goble - John C. Grebe
Review: V1, R1: RTP
V0, R1: Chris O’Brien, John Grebe
Release status: Released to client