Software Level

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

LESSONS LEARNED FROM APPLICATION OF SYSTEM AND SOFTWARE LEVEL RAMS ANALYSIS TO A SPACE CONTROL SYSTEM

Nuno Silva(1), Alexandre Esper(1) Critical Software, S.A., Parque Industrial de Taveiro, Lote 48, 3045-504 Coimbra, Portugal, {nsilva, aresper}@criticalsoftware.com o o
(1)

ABSTRACT The work presented in this article represents the results of applying RAMS analysis to a critical space control system, both at system and software levels. The system level RAMS analysis allowed the assignment of criticalities to the high level components, which was further refined by a tailored software level RAMS analysis. The importance of the software level RAMS analysis in the identification of new failure modes and its impact on the system level RAMS analysis is discussed. Recommendations of changes in the software architecture have also been proposed in order to reduce the criticality of the SW components to an acceptable minimum. The dependability analysis was performed in accordance to ECSS-Q-ST-80, which had to be tailored and complemented in some aspects. This tailoring will also be detailed in the article and lessons learned from the application of this tailoring will be shared, stating the importance to space systems safety evaluations. The paper presents the applied techniques, the relevant results obtained, the effort required for performing the tasks and the planned strategy for ROI estimation, as well as the soft skills required and acquired during these activities. 1. INTRODUCTION

Additional failure modes identified at software design level; Propose changes in the architecture of the MCDS products in order to reduce the criticality of the software components to a minimum;

The CCA and SCA are evaluated by the performance of appropriate RAMS analysis. The RAMS acronym stands for 'Reliability, Availability, Maintainability and Safety'. As such, a RAMS analysis must focus on the evaluation of these characteristics of a system, often named all together as dependability and safety. The main RAMS techniques applied for this case study are FMEA (Failure Modes, Effects and Criticality Analysis) and FTA (Fault Tree Analysis), applied both at system and software levels, together with the MCDS requirements and design documentation. The generic functional failure modes considered as a baseline for the FMECA analysis are usually: 1. Function fails to perform; 2. Function performs incorrectly; 3. Function performs prematurely; 4. Function performs belatedly; 5. Function does not fail safe; 6. Function blindly propagates wrong data. Since software became an essential part of the space systems, both in terms of criticality and impact in case of failure, but also in terms of flexibility allowing corrections, changes and additional control of systems in order to keep the system safe and running, these analysis bring a great added value and allow the software to be considered dependable and safe with much higher confidence. Critical Software (CSW) has been working with software RAMS analysis since 2003, applying them to safety critical space systems, and combining the hardware oriented analysis with the software properties. CSW has applied RAMS analysis for the European Space Agency (ESA) and contributed to their description and inclusion on the ECSS standards. Several standards [8], [10], [11] and [13] and research works [9] and [12] focus on the importance and completeness of RAMS analysis. However, a study that

This work describes the results of the chain system/software level RAMS composed by a Criticality Category Assignment (CCA) and a Software Criticality Analysis (SCA) performed over a space mission control and data system (MCDS), which is the core Software (SW) subsystem of the Mission Operations Centre (MOC), a larger set of tools and applications that oversee the full mission status. The main objectives of the performed analysis have been: To perform a software dependability analysis of the MCDS software products, in accordance with the requirements/recommendations of ECSS-Q-ST-80 [2] using the results of system level dependability analyses, in order to determine the criticality of the individual software components. To provide feedback from the software dependability analysis to the system level dependability analysis addressing in particular:

joins together the importance and valuable outcomes from the RAMS analysis and the effort spent to achieve those results is not known, this is why this study was started and the preliminary results and lessons learnt are presented in this article. This article will start by presenting an overview of the used RAMS techniques, briefly describe the applied case study and its complexity, provide an overview of the main results achieved, specify the extracted lessons learned from the activity and conclude. 2. SHORT DESCRIPTION OF RAMS TECHNIQUES

The tailored approach agreed with the customer is presented in Figure 1 and is described in the following sections.
Software Criticality Analysis

System Level RAMS analysis

Propose which SW Products will undergo SCA

RAMS techniques are usually documented and planned in a RAMS plan. For European space industry the ECSS standards provide quite a large amount of information about these techniques and how to apply them [2], [3], [4], [5] and [6]. Thus, for each system and each project the need for RAMS analysis must be studied and defined, and the appropriate techniques can be selected. The approach used for the work, includes a system level CCA and then a detailed SCA that follows the requirements specified in section 6.2.2 of ECSS-Q-ST80 [2], which were discussed, tailored and agreed upon with the customer. Past experience in Criticality Analysis was also considered, and SCA was performed based on system level CCA plus a combination of SW FMECA and SW FTA. These techniques are generally recognized in industry as being the most appropriate. Within the context of the SCA a software product is composed of one or more software components. Each SW Component provides specific functionalities and interfaces with other SW Components. One of the SCA outputs will be the Criticality level of each SW Component. The SCA was performed simulating the typical situation where the analysis at SW component level is performed by a supplier who is in charge of the development of a SW product, and not of the entire system. This implies that: The supplier has not a complete visibility of the system features, but only of the SW product under analysis; The end effects of software failures which propagate outside the SW product can only be evaluated at system level. The SW Products that compose the MCDS system under study had already been analysed at system level with a Failure Mode and Effect Analysis (FMEA), an FTA, followed by a CCA (aggregates the results of FMEA and FTA).

Perform a Functional Analysis at SW Componen Level t

Interaction with System Level Analysis

SFMECA

SFTA

Recommendations for SW Component Criticality Reduction

Figure 1. Software Criticality Analysis Approach

2.1 Propose which SW Products will undergo Software Criticality Analysis The first steps of the SCA is the decision on what SW Products will undergo further detailed analysis taking into account their CCA and the most critical failure modes provided by the System Level Analysis. SW Products with criticality A to C have been selected (although no product with criticality A was found). The criticality assignment is based on [2] and consists of:
Table 1. Software Criticality Categories

Category
A B C D

Definition
Catastrophic: Loss of life, environment permanent damage Critical: Mission loss Major: Major mission degradation Minor: Minor mission (negligible consequences) degradation

2.2 Perform a Component Level

Functional

Analysis

at

SW

The main objective of the functional analysis within the context of the SCA is to provide a good level of understanding of the SW components that compose each SW product, their respective functionality and interfaces. Each SW product is decomposed into SW components to a level of SW design decomposition at which it still makes sense to assign different criticality categories to different components. In general, it makes no sense to assign different criticality category to two design components for which it is not possible to avoid failure propagation between them. For each SW product, the outputs of the functional analysis consist of: A table listing all SW components and respective functionality; For the complex SW products, a diagram containing the SW components and their respective interactions in terms of data and control flow; 2.3 SW FMECA The SW FMECA is applied to all SW components that are part of each SW product selected for further analysis. The main objectives of the SW FMECA are to identify the potential failure modes of the SW components analysed, in particular the ones that have not been identified at system level, and to allow determining which SW components contribute to the criticality of the SW product and which not. SW FMECA is applied according to ECSS-Q-ST-30-02C [5], also based on [10]. The system level FMEA (CCA) information such as the failure causes and local effects have been used as inputs for the failure modes analysed in the SW FMECA. Compensating provisions to be included in the SW FMECA are only the ones that can be incorporated into the SW product(s) being analyzed. Any system-level type of compensating provision, such as HW protections or human intervention, should not be mentioned here. Special attention is given to the SW component failure modes related with the interfaces between SW components and SW products, in order to highlight all possible types of failure propagation. The following are common software failure causes used in the SW FMECA: Exhaustion of common system resources (memory or file descriptors); Wrong algorithm;

Incorrect or corrupted configuration data; Deadlocks and infinite loops; Wrong memory access.

2.4 SW FTA The SW FTA is developed according to FTA Adoption notice IEC 61025 [11] and ECSS-Q-ST-4012C [6]. The SW FTA is developed with the following objectives: To rapidly identify the SW Components that contribute to the most critical feared events (undesired events that contribute to a failure) identified at system level; To provide evidence that effective barriers are in place and correctly implemented by SW components, e.g. to find gaps in system monitoring. Since this analysis lacks the identification of possible new failure modes from SW components it is applied for the SW products of highest criticality, B, and selectively, according to engineering judgment, for SW products of lower Criticality, C and D. The system level FMEA (CCA) provides inputs relative to the feared events to be explored in the SW FTA, namely critical failure modes from FMEA (CCA) are used as feared events that are decomposed in order to identify the SW components contributing to critical feared events. 2.5 Provision of Recommendations Components Criticality Reduction for SW

Based on the results of the SW FMECA and SW FTA, the SCA recommends changes to the architecture of the MCDS SW products in order to reduce the criticality of the SW components to a minimum. The aim is for the architecture to be tolerant to the failure propagation between SW products and SW components whether by creating barriers to its propagation or by enabling the necessary monitoring strategy and act upon the failure detection. It is worthwhile to consider reducing the SW criticality for two reasons: 1. To prevent the failure propagation; 2. To potentially reduce the cost of development, as a higher SW criticality usually leads to higher development costs. While the first point is an obvious motivation to reduce Criticality, the second point is related to the differences between reducing the criticality from B to C / C to D.

ECSS-E-ST-40C [7] defines the principles and requirements applicable to space software engineering. These requirements and objectives can be tailored according to several drivers, such as dependability and safety aspects, software development constraints, product quality objectives and business objectives. When comparing reductions from B to C with reductions from C to D, the reduction of activities and outputs is similar, 15 and 16 respectively. Considering that MCDS will involve several million lines of code when fully deployed any possible SW criticality reduction will have an effort impact and should be considered. 2.6 Methods Reduction for SW Components Criticality

photometric measurements in wide and narrow wavelength bands. Ground stations will collect about 100Tb of science data and 5 years of ground processing of the data will be needed. 3.2 Space Segment The space segment consists of the satellite that operates on a Lissajous halo orbit around the L2 co-linear libration point of the Sun/Earth-Moon system. Such orbits provide high observation efficiency, a very stable thermal environment and low radiation environment. The drawbacks are the large communication distance and the need for regular orbit maintenance manoeuvre. 3.3 Ground System The Cebreros station, complemented by the Kourou station, is used for spacecraft command control during Launch and Early Orbit Phase. The Cebreros station is also used during cruise and nominal operations on a data downlink demand basis complemented with New Norcia station during galactic plane scans, when the data downlink demand exceeds the Cebreros availability. 3.4 Mission Control and Data System

The industrial approaches considered to reduce the SW components criticality are: Safety Monitoring: protects against specific failure conditions by directly monitoring a function which would contribute to the failure condition, see section 2.3 of DO-178B [8]; Partitioning: provides isolation between independent SW components to restrain and/or isolate faults, see section 2.3 of DO-178B [8]. 2.7 Interaction with System Level Analysis There is an interaction between the SCA and system level analysis when performing the SW FMECA, since, when failure modes propagate outside a SW product, the end effects can be analysed only at system level (SW engineers have limited visibility). There is also the probability of finding new failure modes not detected at system level which could change the final criticality of a SW product. This would basically imply that the system level CCA would have to be updated. 3. CASE STUDY: MISSION DATA AND CONTROL SYSTEM

This section presents the context for the Software Criticality Analysis including a short MCDS mission overview as well as the relationship between this analysis and the upper level dependability analysis. 3.1 Space Mission Overview The space mission under study is intended to provide extremely accurate scientific data (for 3-D imaging). This requires the position and velocity measurements of a substantial fraction of stars combined with accurate

The control and data system (MCDS) provides the following main functionalities: Receiving and processing telemetry to/from the spacecraft; Allows preparation, uplink and verification of commands; File transfer between various ground segment entities; Data archiving and distribution; Mission planning; Operations automation of Spacecraft Control activities. It is logically decomposed into several subsystems which together provide the functionality to monitor and control the spacecraft. 3.5 MCDS Software Products and Critical Functions The System Level Criticality Category Assignment (CCA) associates functionalities to a specific SW product and each SW product to its respective Criticality. Table 2 presents the results from the system level CCA for each SW product.

Table 2. System Level Criticality Category Assignment Results

Table 3. Numerical Results of System and Software Level RAMS Analysis Severity Funcs3 FEs4 25 11 FMs5 86 301 19 198 81 3 3 41 42 0 B C D Other

SW Product
C01 C02 C03 C04 C05 C06 C07 C08 C09 C10

Description

Criticality
FA FTA System Level SW Level (SCA) FMEA CCA SW FTA SW FMECA SCA 106 -

Delivery and verification of B telecommands Maintenance of the mission's B on-board software Transfer of files D (B1)

Archive the spacecraft TM C and TC data Generate and maintain the C run-time database Interface with the ground C station equipment Processing and display of the B (C2) spacecraft telemetry Controls and monitors D applications and processes Mission data distribution D

The effort required to perform the system and SW level activities has also been collected and can be seen on Table 4.
Table 4. Effort spent for System and Software Level RAMS Analysis

Effort (hours) % Management System Level (CCA) SW Level (SCA)


FA FTA FMEA CCA SW FTA SW FMECA SCA 280 177 290 393 348 74 936 230 10% 7% 11% 14% 13% 3% 34% 8%

Weighted Effort (hours)


280

Status monitoring, SW D releases distribution, configurations files Resources planning schedule generation and D

C11 C12 C13 4.

966

Statistical analysis of TM D parameters Storage of science data RESULTS OF RAMS ANALYSIS D

1481

The system level RAMS analyses lead to the criticality assignment presented in Table 3. The system level activities lead to a high level classification of the SW products and to a preliminary criticality assignment that is useful to proceed with more detailed RAMS analysis and to act on the most critical products. The following table presents the numerical results from the RAMS analysis.

The 377 system level requirements lead to the identification of 106 functions with the Functional Analysis. 25 feared events (FE) have been mapped in Fault Trees. During the system level RAMS analysis 11 SW products were considered, however, during the SW level RAMS analysis the system under study had to be extended to 15 SW products due to effects propagation. 76 SW components have been identified for this study. 5 HW products have also been taken into account. Table 3 shows that the number of failure modes is quite higher for the SW FMECA (301), this is normal since the information and documentation available for SW artefacts is more concrete and much more detailed
3

Functions Feared Events Failure Modes

SW level analysis allowed reduction from B to D (SW detection methods). New SW critical failure mode imposed increase from C to B.

4 5

compared with the system level, sometimes generic, information. The number of feared events considered for the FTA has been reduced for the SW level as a project decision. Weighted Effort: assuming that lots of the effort spent for the system level activities is then reused in the form of knowledge, failure modes, feared events, etc, a 20% value has been shared between system and software level. On average, each function takes 1.7 hours to analyse, system level FMEA requires 4.6 hours per failure mode and software level FMEA requires 3.1 hours per failure mode. This work has been performed by an independent team with no connection to the products under analysis, so there an additional learning effort included in these numbers. 5. LESSONS LEARNED

The experience of going through the two types of RAMS analysis, with such a large system was a real challenge and provided many lessons learned. The first lesson learned is related to starting from a system level analysis down to the software level, i.e., starting from the FMEA and CCA, down to the SCA. The SW level analysis allowed the identification of several new failure modes. It also allowed the refinement of the criticalities found at system level, through the identification of compensating provisions not visible at system level. The second lesson learned is related to the impact of stopping the analysis at CCA level and not continuing to the SW level. The criticalities assigned at system level are a good approximation of the final criticality, but important safety barriers implemented at SW level are not visible. Moreover, failure modes potentially overlooked at system level cannot be identified at software level. The overall process adopted also provided several lessons learned, the first one being the need to define a CCA methodology to address the software criticality downgrading in presence of failure compensating provisions, since this procedure is not specified in ECSS-Q-ST-30C [3] or ECSS-Q-ST-30-02C [5]. This methodology was based on textual notation to justify the reduction of the criticalities where applicable. This lesson learned provided important contribution to initiate the process of improving the ECSS Standards (namely Q-30, Q-40 and Q-80) to handle this gap. Another lesson learned in terms of process refers to a major shortcoming that had to be overcome: No Whole-System-Level Dependability Analysis. This means that no mission-level dependability analysis addressing the consequences of failures on the whole system. If available, this analysis would allow identifying what is catastrophic, critical, major or negligible at satellite-level and/or mission-level, considering any possible source of failures, including

the ground segment failures. It would also compensate for the limited visibility that the ground segment system engineering team might have in determining the impact on the spacecraft/mission of certain failures, and therefore in classifying feared events and failure consequence severity. As a result, the missing missionlevel dependability analysis contributed to a more conservative analysis in the cases where visibility was not enough/complete to actually assert the severity of certain failure mode effects. It is also important to refer that the mission system engineering team was involved in the definition of the top level feared events, which were a major contribution to the structure of the systemlevel analysis and indirectly to the software-level analysis. The third lesson learned related to the process adopted encompasses the system functional analysis (FA) and SW components functional analysis. The effort required by the functional analysis and elaboration of interfaces diagrams is commonly underestimated. It should always be included as a separate activity already in the planning phase. FA should be the responsibility of the design team. Another group of lessons learned refer to the evaluation of the FTA/SW FTA in the context of the CCA/SCA: Fault-tree analysis provide a good system knowledge early in the process, but it is a very time consuming activity; FTA supports the identification of single-point failures and safety barriers, but it does not assign criticalities; If applied in the beginning of the process, requires the analysis of large amounts of documentation to build a single tree; If applied after the FMEA/SW FMECA, the construction of Fault-trees become a much more straightforward task; In future dependability analysis, it is recommended to apply the FTA/SW FTA after the respective FMEA/SW FMECA and selectively to specific areas requiring more detailed/focused analysis. Finally, an additional set of lessons learned can be highlighted: Scope definition where to start and where to stop with the RAMS analysis; Customer interpretation and importance of criticality assignments; Tools for supporting RAMS analysis maturity; Effort estimation of RAMS about 4 hours/failure mode; Learning from RAMS techniques tailoring activities and results; RAMS engineers team work versus individual work; Value of SW RAMS analysis (as a complement to system level analysis);

In terms of ROI, RAMS will not reduce the effort and cost, but will point out areas and solutions for reductions, in this particular case it could point out reductions in the order or Millions of Euros due to the size of code involved.

6.

CONCLUSIONS

System and SW level RAMS have shown to be both important and complementary. A large effort was spent, but results will only be visible if criticality reduction techniques and actions are put in place. The more reductions are achieved, the less the system will be costly and more dependable. We can measure dependability cost of a system by the absence of critical areas that can lead to severe failure impacts. A ROI estimation is under development, it will take into account the effort required for the CCA and SCA, the amount of level B and C components these are serious candidates for reduction, and then an estimation of cost per line of code. In this case we are talking about a few millions, thus a reduction in criticality might help save a large amount of time and money, with a known investment. The skills acquired and exploited during these activities are a valuable asset and should be used by both customer and RAMS experts to make the system less complex, more reusable and easily maintainable. The reuse of the dependability analysis performed is possible for mission control systems built on top of a similar common infrastructure of services (HW/SW). Mission control systems are typically instantiated for each mission, but the majority of functionalities and respective configuration is maintained or improved between missions. The reuse is nevertheless constrained by the consequences of mission specific end effects. Finally, some additional effort is usually required if these activities are performed by an independent team, and one might question if the fact that weve found more failure modes at SW level is due to the available information or to the acquired knowledge about the system. 7. ACKNOWLEDGMENT

3. ECSS-Q-ST-30C, Space product assurance - Dependability, 06/03/2009 4. ECSS-Q-ST-40C, Space product assurance - Safety, 06/03/2009 5. ECSS-Q-ST-30-02C, Space product assurance - Failure modes, effects (and criticality) analysis (FMEA/FMECA), 06/03/2009 6. ECSS-Q-ST-40-12C, Space product assurance - Fault tree analysis Adoption notice ECSS/IEC 61025, 06/03/2009 7. ECSS-E-ST-40C, Space engineering - Software, 06/03/2009 8. Radio Technical Commission for Aeronautics, Inc. RTCA DO-178B, Software Considerations in Airborne Systems and Equipment Certification., Washington, D.C.: RTCA, 01/12/1992, www.rtca.org 9. Patricia Rodriguez-Dapena, Software Safety Certification: A Multidomain Problem, IEEE SOFTWARE, July-Aug. 1999 10. IEC 60812 FMEA, IEC 60812:2006, www.iec.ch 11. IEC 61025 fault tree analysis (FTA), IEC 61205:2008, www.iec.ch 12. TN3 / TN4 Description of methods and techniques for software development, verification and validation, v1.0, 17/02/2003, CARES project, ESA 13. CENELEC EN 50128, "Software for railway control and protection systems"

This work has been partially supported by the project CRITICAL Software Technology for an Evolutionary Partnership (CRITICAL-STEP, http://www.criticalstep.eu), Marie Curie Industry-Academia Partnerships and Pathways (IAPP) number 230672, within the context of the EU Seventh Framework Programme (FP7). 8. REFERENCES

1. Critical Step website, http://www.critical-step.eu/ 2. ECSS-Q-ST-80, Space Product Assurance - Software Product Assurance, 06/03/2009

You might also like