Safety Engineering

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 6

Safety engineering

Safety engineering is an applied science strongly related to systems engineering and the
subset System Safety Engineering. Safety engineering assures that a life-critical system
behaves as needed even when pieces fail.
In the real world the term "safety engineering" refers to any act of accident prevention by
a person qualified in the field. Safety engineering is often reactionary to adverse events,
also described as "incidents", as reflected in accident statistics. This arises largely
because of the complexity and difficulty of collecting and analysing data on "near
misses".
Increasingly, the importance of a safety review is being recognised as an important risk
management tool. Failure to identify risks to safety, and the according inability to address
or "control" these risks, can result in massive costs, both human and economic. The
multidisciplinary nature of safety engineering means that a very broad array of
professionals are actively involved in accident prevention or safety engineering.
The majority of those practicing safety engineering are employed in industry to keep
workers safe on a day to day basis. See the American Society of Safety Engineers
publication Scope and Function of the Safety Profession.
Safety engineers distinguish different extents of defective operation: A "failure" is "the
inability of a system or component to perform its required functions within specified
performance requirements", while a "fault" is "a defect in a device or component, for
example: a short circuit or a broken wire".[1] System-level failures are caused by lowerlevel faults, which are ultimately caused by basic component faults. (Some texts reverse
or confuse these two terms. See NUREG-0492 page V-1.) The unexpected failure of a
device that was operating within its design limits is a "primary failure", while the
expected failure of a component stressed beyond its design limits is a "secondary failure".
A device which appears to malfunction because it has responded as designed to a bad
input is suffering from a "command fault".[2] A "critical" fault endangers one or a few
people. A "catastrophic" fault endangers, harms or kills a significant number of people.
Safety engineers also identify different modes of safe operation: A "probabilistically safe"
system has no single point of failure, and enough redundant sensors, computers and
effectors so that it is very unlikely to cause harm (usually "very unlikely" means, on
average, less than one human life lost in a billion hours of operation). An inherently safe
system is a clever mechanical arrangement that cannot be made to cause harm
obviously the best arrangement, but this is not always possible. A fail-safe system is one
that cannot cause harm when it fails. A "fault-tolerant" system can continue to operate
with faults, though its operation may be degraded in some fashion.
These terms combine to describe the safety needed by systems: For example, most
biomedical equipment is only "critical", and often another identical piece of equipment is

nearby, so it can be merely "probabilistically fail-safe". Train signals can cause


"catastrophic" accidents (imagine chemical releases from tank-cars) and are usually
"inherently safe". Aircraft "failures" are "catastrophic" (at least for their passengers and
crew) so aircraft are usually "probabilistically fault-tolerant". Without any safety features,
nuclear reactors might have "catastrophic failures", so real nuclear reactors are required
to be at least "probabilistically fail-safe", and some such as pebble bed reactors are
"inherently fault-tolerant".

The process
Ideally, safety-engineers take an early design of a system, analyze it to find what faults
can occur, and then propose safety requirements in design specifications up front and
changes to existing systems to make the system safer. In an early design stage, often a
fail-safe system can be made acceptably safe with a few sensors and some software to
read them. Probabilistic fault-tolerant systems can often be made by using more, but
smaller and less-expensive pieces of equipment.
Far too often, rather than actually influencing the design, safety engineers are assigned to
prove that an existing, completed design is safe. If a safety engineer then discovers
significant safety problems late in the design process, correcting them can be very
expensive. This type of error has the potential to waste large sums of money.
The exception to this conventional approach is the way some large government agencies
approach safety engineering from a more proactive and proven process perspective. This
is known as System Safety. The System Safety philosophy, supported by the System
Safety Society, is to be applied to complex and critical systems, such as commercial
airliners, military aircraft, munitions and complex weapon systems, spacecraft and space
systems, rail and transportation systems, air traffic control system and more complex and
safety-critical industrial systems. The proven System Safety methods and techniques are
to prevent, eliminate and control hazards and risks through designed influences by a
collaboration of key engineering disciplines and product teams. Software safety is fast
growing field since modern systems functionality are increasingly being put under
control of software. The whole concept of system safety and software safety, as a subset
of systems engineering, is to influence safety-critical systems designs by conducting
several types of hazard analyses to identify risks and to specify design safety features and
procedures to strategically mitigate risk to acceptable levels before the system is certified.
Additionally, failure mitigation can go beyond design recommendations, particularly in
the area of maintenance. There is an entire realm of safety and reliability engineering
known as "Reliability Centered Maintenance" (RCM), which is a discipline that is a
direct result of analyzing potential failures within a system and determining maintenance
actions that can mitigate the risk of failure. This methodology is used extensively on
aircraft and involves understanding the failure modes of the serviceable replaceable
assemblies in addition to the means to detect or predict an impending failure. Every

automobile owner is familiar with this concept when they take in their car to have the oil
changed or brakes checked. Even filling up one's car with gas is a simple example of a
failure mode (failure due to fuel starvation), a means of detection (fuel gauge), and a
maintenance action (fill 'er up!).
For large scale complex systems, hundreds if not thousands of maintenance actions can
result from the failure analysis. These maintenance actions are based on conditions (e.g.,
gauge reading or leaky valve), hard conditions (e.g., a component is known to fail after
100 hrs of operation with 95% certainty), or require inspection to determine the
maintenance action (e.g., metal fatigue). The Reliability Centered Maintenance concept
then analyzes each individual maintenance item for its risk contribution to safety,
mission, operational readiness, or cost to repair if a failure does occur. Then the sum total
of all the maintenance actions are bundled into maintenance intervals so that maintenance
is not occurring around the clock, but rather, at regular intervals. This bundling process
introduces further complexity, as it might stretch some maintenance cycles, thereby
increasing risk, but reduce others, thereby potentially reducing risk, with the end result
being a comprehensive maintenance schedule, purpose built to reduce operational risk
and ensure acceptable levels of operational readiness and availability.

Analysis techniques
The two most common fault modeling techniques are called "failure modes and effects
analysis" and "fault tree analysis". These techniques are just ways of finding problems
and of making plans to cope with failures, as in Probabilistic Risk Assessment (PRA or
PSA). One of the earliest complete studies using PRA techniques on a commercial
nuclear plant was the Reactor Safety Study (RSS), edited by Prof. Norman Rasmussen[3]
(see WASH-1400)

Failure modes and effects analysis


Main article: Failure mode and effects analysis
In the technique known as "failure mode and effects analysis" (FMEA), an engineer starts
with a block diagram of a system. The safety engineer then considers what happens if
each block of the diagram fails. The engineer then draws up a table in which failures are
paired with their effects and an evaluation of the effects. The design of the system is then
corrected, and the table adjusted until the system is not known to have unacceptable
problems. It is very helpful to have several engineers review the failure modes and effects
analysis.

Fault tree analysis


In the technique known as "fault tree analysis", an undesired effect is taken as the root
('top event') of a tree of logic. Then, each situation that could cause that effect is added to
the tree as a series of logic expressions. When fault trees are labelled with actual
numbers about failure probabilities, which are often in practice unavailable because of

the expense of testing, computer programs can calculate failure probabilities from fault
trees.

A fault tree diagram


The Tree is usually written out using conventional logic gate symbols. The route through
a Tree between an event and an initiator in the tree is called a Cutset. The shortest
credible way through the tree from Fault to initiating Event is called a Minimal Cutset.
Some industries use both Fault Trees and Event Trees (see Probabilistic Risk
Assessment). An Event Tree starts from an undesired initiator (loss of critical supply,
component failure etc) and follows possible further system events through to a series of
final consequences. As each new event is considered, a new node on the tree is added
with a split of probabilities of taking either branch. The probabilities of a range of 'top
events' arising from the initial event can then be seen.
Classic programs include the Electric Power Research Institute's (EPRI) CAFTA
Software, which is used by almost all the US nuclear power plants and by a majority of
US and international aerospace manufacturers, and the Idaho National Laboratory's
SAPHIRE, which is used by the U.S. Government to evaluate the safety and reliability of
nuclear reactors, the Space Shuttle, and the International Space Station.

Safety certification
Usually a failure in safety-certified systems is acceptable if, on average, less than one life
per 109 hours of continuous operation is lost to failure. Most Western nuclear reactors,
medical equipment, and commercial aircraft are certified to this level. The cost versus
loss of lives has been considered appropriate at this level (by FAA for aircraft under
Federal Aviation Regulations).

Preventing failure
Probabilistic fault tolerance: adding redundancy to equipment and
systems

A NASA graph shows the relationship between the survival of a crew of astronauts and
the amount of redundant equipment in their spacecraft (the "MM", Mission Module).
Once a failure mode is identified, it can usually be prevented entirely by adding extra
equipment to the system. For example, nuclear reactors emit dangerous radiation and
contain nasty poisons, and nuclear reactions can cause so much heat that no substance
might contain them. Therefore reactors have emergency core cooling systems to keep the
temperature down, shielding to contain the radiation, and engineered barriers (usually
several, nested, surmounted by a containment building) to prevent accidental leakage.
Most biological organisms have a certain amount of redundancy: multiple organs,
multiple limbs, etc.
For any given failure, a fail-over, or redundancy can almost always be designed and
incorporated into a system.

Inherent fail-safe design


For more details on this topic, see Inherent safety.
When adding equipment is impractical (usually because of expense), then the least
expensive form of design is often "inherently fail-safe". The typical approach is to
arrange the system so that ordinary single failures cause the mechanism to shut down in a
safe way. (For nuclear power plants, this is termed a passively safe design, although more
than ordinary failures are covered.)
One of the most common fail-safe systems is the overflow tube in baths and kitchen
sinks. If the valve sticks open, rather than causing an overflow and damage, the tank
spills into an overflow.
Another common example is that in an elevator the cable supporting the car keeps springloaded brakes open. If the cable breaks, the brakes grab rails, and the car does not fall.
Inherent fail-safes are common in medical equipment, traffic and railway signals,
communications equipment, and safety equipment.

Containing Failure

It is also common practice to plan for the failure of safety systems through containment
and isolation methods. The use of isolating valves, also known as the Block and bleed
manifold, is very common in isolating pumps, tanks, and control valves that may fail or
need routine maintenance. In addition, nearly all tanks containing oil or other hazardous
chemicals are required to have containment barriers set up around them to contain 100%
of the volume of the tank in the event of a catastrophic tank failure. Similarly, long
pipelines have remote-closing valves periodically installed in the line so that in the event
of failure, the entire pipeline is not lost. The goal of all such containment systems is to
provide means of limiting the damage done by a failure to a small localized area.

You might also like