Safety Engineering
Safety Engineering
Safety Engineering
Safety engineering is an applied science strongly related to systems engineering and the
subset System Safety Engineering. Safety engineering assures that a life-critical system
behaves as needed even when pieces fail.
In the real world the term "safety engineering" refers to any act of accident prevention by
a person qualified in the field. Safety engineering is often reactionary to adverse events,
also described as "incidents", as reflected in accident statistics. This arises largely
because of the complexity and difficulty of collecting and analysing data on "near
misses".
Increasingly, the importance of a safety review is being recognised as an important risk
management tool. Failure to identify risks to safety, and the according inability to address
or "control" these risks, can result in massive costs, both human and economic. The
multidisciplinary nature of safety engineering means that a very broad array of
professionals are actively involved in accident prevention or safety engineering.
The majority of those practicing safety engineering are employed in industry to keep
workers safe on a day to day basis. See the American Society of Safety Engineers
publication Scope and Function of the Safety Profession.
Safety engineers distinguish different extents of defective operation: A "failure" is "the
inability of a system or component to perform its required functions within specified
performance requirements", while a "fault" is "a defect in a device or component, for
example: a short circuit or a broken wire".[1] System-level failures are caused by lowerlevel faults, which are ultimately caused by basic component faults. (Some texts reverse
or confuse these two terms. See NUREG-0492 page V-1.) The unexpected failure of a
device that was operating within its design limits is a "primary failure", while the
expected failure of a component stressed beyond its design limits is a "secondary failure".
A device which appears to malfunction because it has responded as designed to a bad
input is suffering from a "command fault".[2] A "critical" fault endangers one or a few
people. A "catastrophic" fault endangers, harms or kills a significant number of people.
Safety engineers also identify different modes of safe operation: A "probabilistically safe"
system has no single point of failure, and enough redundant sensors, computers and
effectors so that it is very unlikely to cause harm (usually "very unlikely" means, on
average, less than one human life lost in a billion hours of operation). An inherently safe
system is a clever mechanical arrangement that cannot be made to cause harm
obviously the best arrangement, but this is not always possible. A fail-safe system is one
that cannot cause harm when it fails. A "fault-tolerant" system can continue to operate
with faults, though its operation may be degraded in some fashion.
These terms combine to describe the safety needed by systems: For example, most
biomedical equipment is only "critical", and often another identical piece of equipment is
The process
Ideally, safety-engineers take an early design of a system, analyze it to find what faults
can occur, and then propose safety requirements in design specifications up front and
changes to existing systems to make the system safer. In an early design stage, often a
fail-safe system can be made acceptably safe with a few sensors and some software to
read them. Probabilistic fault-tolerant systems can often be made by using more, but
smaller and less-expensive pieces of equipment.
Far too often, rather than actually influencing the design, safety engineers are assigned to
prove that an existing, completed design is safe. If a safety engineer then discovers
significant safety problems late in the design process, correcting them can be very
expensive. This type of error has the potential to waste large sums of money.
The exception to this conventional approach is the way some large government agencies
approach safety engineering from a more proactive and proven process perspective. This
is known as System Safety. The System Safety philosophy, supported by the System
Safety Society, is to be applied to complex and critical systems, such as commercial
airliners, military aircraft, munitions and complex weapon systems, spacecraft and space
systems, rail and transportation systems, air traffic control system and more complex and
safety-critical industrial systems. The proven System Safety methods and techniques are
to prevent, eliminate and control hazards and risks through designed influences by a
collaboration of key engineering disciplines and product teams. Software safety is fast
growing field since modern systems functionality are increasingly being put under
control of software. The whole concept of system safety and software safety, as a subset
of systems engineering, is to influence safety-critical systems designs by conducting
several types of hazard analyses to identify risks and to specify design safety features and
procedures to strategically mitigate risk to acceptable levels before the system is certified.
Additionally, failure mitigation can go beyond design recommendations, particularly in
the area of maintenance. There is an entire realm of safety and reliability engineering
known as "Reliability Centered Maintenance" (RCM), which is a discipline that is a
direct result of analyzing potential failures within a system and determining maintenance
actions that can mitigate the risk of failure. This methodology is used extensively on
aircraft and involves understanding the failure modes of the serviceable replaceable
assemblies in addition to the means to detect or predict an impending failure. Every
automobile owner is familiar with this concept when they take in their car to have the oil
changed or brakes checked. Even filling up one's car with gas is a simple example of a
failure mode (failure due to fuel starvation), a means of detection (fuel gauge), and a
maintenance action (fill 'er up!).
For large scale complex systems, hundreds if not thousands of maintenance actions can
result from the failure analysis. These maintenance actions are based on conditions (e.g.,
gauge reading or leaky valve), hard conditions (e.g., a component is known to fail after
100 hrs of operation with 95% certainty), or require inspection to determine the
maintenance action (e.g., metal fatigue). The Reliability Centered Maintenance concept
then analyzes each individual maintenance item for its risk contribution to safety,
mission, operational readiness, or cost to repair if a failure does occur. Then the sum total
of all the maintenance actions are bundled into maintenance intervals so that maintenance
is not occurring around the clock, but rather, at regular intervals. This bundling process
introduces further complexity, as it might stretch some maintenance cycles, thereby
increasing risk, but reduce others, thereby potentially reducing risk, with the end result
being a comprehensive maintenance schedule, purpose built to reduce operational risk
and ensure acceptable levels of operational readiness and availability.
Analysis techniques
The two most common fault modeling techniques are called "failure modes and effects
analysis" and "fault tree analysis". These techniques are just ways of finding problems
and of making plans to cope with failures, as in Probabilistic Risk Assessment (PRA or
PSA). One of the earliest complete studies using PRA techniques on a commercial
nuclear plant was the Reactor Safety Study (RSS), edited by Prof. Norman Rasmussen[3]
(see WASH-1400)
the expense of testing, computer programs can calculate failure probabilities from fault
trees.
Safety certification
Usually a failure in safety-certified systems is acceptable if, on average, less than one life
per 109 hours of continuous operation is lost to failure. Most Western nuclear reactors,
medical equipment, and commercial aircraft are certified to this level. The cost versus
loss of lives has been considered appropriate at this level (by FAA for aircraft under
Federal Aviation Regulations).
Preventing failure
Probabilistic fault tolerance: adding redundancy to equipment and
systems
A NASA graph shows the relationship between the survival of a crew of astronauts and
the amount of redundant equipment in their spacecraft (the "MM", Mission Module).
Once a failure mode is identified, it can usually be prevented entirely by adding extra
equipment to the system. For example, nuclear reactors emit dangerous radiation and
contain nasty poisons, and nuclear reactions can cause so much heat that no substance
might contain them. Therefore reactors have emergency core cooling systems to keep the
temperature down, shielding to contain the radiation, and engineered barriers (usually
several, nested, surmounted by a containment building) to prevent accidental leakage.
Most biological organisms have a certain amount of redundancy: multiple organs,
multiple limbs, etc.
For any given failure, a fail-over, or redundancy can almost always be designed and
incorporated into a system.
Containing Failure
It is also common practice to plan for the failure of safety systems through containment
and isolation methods. The use of isolating valves, also known as the Block and bleed
manifold, is very common in isolating pumps, tanks, and control valves that may fail or
need routine maintenance. In addition, nearly all tanks containing oil or other hazardous
chemicals are required to have containment barriers set up around them to contain 100%
of the volume of the tank in the event of a catastrophic tank failure. Similarly, long
pipelines have remote-closing valves periodically installed in the line so that in the event
of failure, the entire pipeline is not lost. The goal of all such containment systems is to
provide means of limiting the damage done by a failure to a small localized area.