Six Sigma (4 of 4)
Hazard and Operability Study (HAZOP)
A hazards analysis can be time consuming, be highly sophisticated, and involve a detailed in-depth analysis. If there is the potential for significant injuries or damage, then it may be essential to put an extensive on-going effort into the hazard analysis.
With many types of processes a Job Hazard Analysis (JHA) is sufficient to reveal potential hazards within the process. However, if the process is complex, or there is the possibility of a disaster should there be a problem in the process, a more powerful hazard analysis method should be used. OSHA lists several accepted methods, such as:
- What-If Analysis
- Checklist
- Hazard and Operability Study (HAZOP)
- Failure Mode and Effects Analysis (FMEA)
- Fault Tree Analysis
A Hazop Study is the most commonly used process hazard analysis method. It can be used to identify operability problems even during the early stages of project development, as well as identifying potential hazards in operating systems.
What is a Hazop Study?
A Hazard and Operability Study (HAZOP) is a systematic approach to investigating each element of a process to identify all of the ways in which parameters can deviate from the intended design conditions and create hazards or operability problems.
A Hazop Study typically involves using the piping and instrument diagrams (P&ID), or a plant model, as a guide for examining every section and component of a process. A hazop team, consisting of experienced and knowledgeable people, brainstorms potentially hazardous situations that could arise in each section of pipe, each valve, and each vessel in the system.
The hazop team should be led by someone with an in-depth knowledge of the process, but they do not need to be an expert in the technology used in the process. The hazop team should include people with a variety of expertise such as operations, maintenance, instrumentation, engineering/process design, and other specialists as needed. These should not be “newbies,” but be people with experience, knowledge, and an understanding of their part of the system.
The Hazop Study Process
In a Hazop Study the hazop team works through the P&IDs examining the impact of potential changes to parameters such as flow, temperature, pressure and time. Using their experience they determine the effects of deviations from design conditions. This means that a Hazop Study is a systematic, step-by-step approach to brainstorming possible deviations; determining the likelihood of the deviation (is there a realistic cause); evaluating existing protections; and estimating the resulting impact and potential catastrophic result of the deviation.
The process system is evaluated as designed and noting the potential for deviations. All potential causes of failure are identified. Existing safeguards and protection systems are identified and their ability to handle the deviations evaluated. An assessment is written weighing the potential deviations, their consequences, their causes, and the protection requirements. When a hazard condition is identified, recommendations may be made for process or system modifications, or further study by a specialist may be required.
What Do You Do With the Hazop Study Results?
A Hazop Study may be a one-time study of limited duration, or it may be ongoing, not having a specific end date. Study results should be released as action items as they are identified. Typical actions a Hazop Study might recommend include:
- A review of existing protection system designs by a specialist
- Adding or modifying alarms that warn of deviations
- Adding or modifying relief systems
- Adding or modifying ventilation systems
- Increasing sampling and testing frequency
- Implementation of additional engineering controls
The Role of Labelling in Hazop
Process system components such as mixers, vats, piping, valves, sample points, instruments, and vessels must be identified and labeled in accordance with the P&IDs. Being able to correctly and reliably locate the component identified on the P&IDs is important both for an effective Hazop Study, as well as for operational safety. Opening a wrong valve, or starting to cut a wrong pipe, have often been the causes of serious accidents.
The key to effective labels and signs is that they be durable, being able to withstand the environment in which they are used. This is why DuraLabel printers and tough-tested supplies have become the fastest growing label printer brand. They deliver tough, long-lasting labels and signs. They are so tough that labels and signs made with DuraLabel vinyl are the only ones that come with a warranty.
Software Failure Mode and Effects Analysis (SFMEA)
The application of Failure Mode and Effects Analysis (FMEA) to software (SFMEA) was first proposed in 1979.
Since that time, SFMEA, sometimes known as Software Error Effect Analysis (SEEA), has been refined and applied successfully at functional, interface and detailed levels.
Some of the approaches taken to SFMEA; however, are flawed.
Software FMEA has also been useful in conjunction with requirements analysis.
Extreme caution is advised as this technique, in the wrong hands, will burn investment dollars at a rapid rate and provide little "bang for the buck".
HCRQ has years of experience applying SFMEA. In addition, we teach SFMEA in our Software Safety Course and in our SFTA & SFMEA Webinar.
Occasionally, Software Failure Mode, Effects and Criticality Analysis {that's right, SFMECA not SFMEA}
is stipulated. For example, 49CFR238.105 states:
"The hardware and software safety program shall be based on a formal safety methodology that includes a Failure Modes, Effects, Criticality Analysis (FMECA); verification and validation testing for all hardware and software components and their interfaces; and comprehensive hardware and software integration testing to ensure that the hardware and software system functions as intended."
As another example, a client of ours received a SOW which called for SFMECA per MIL-STD-1629A.
SFMECA is not straight-forward. HCRQ conceived an approach to comply with the requirement for SFMECA should it be impossible to escape from.
Fault Tree Analysis (FTA)
The fault tree analysis (FTA) was first introduced by Bell Laboratories and is one of the most widely used methods in system reliability, maintainability and safety analysis. It is a deductive procedure used to determine the various combinations of hardware and software failures and human errors that could cause undesired events (referred to as top events) at the system level.
The deductive analysis begins with a general conclusion, then attempts to determine the specific causes of the conclusion by constructing a logic diagram called a fault tree. This is also known as taking a top-down approach.
The main purpose of the fault tree analysis is to help identify potential causes of system failures before the failures actually occur. It can also be used to evaluate the probability of the top event using analytical or statistical methods. These calculations involve system quantitative reliability and maintainability information, such as failure probability, failure rate and repair rate. After completing an FTA, you can focus your efforts on improving system safety and reliability.
FTA logic diagram
The basic symbols used in an FTA logic diagram are called logic gates and are similar to the symbols used by electronic circuit designers. Two kinds of gates, "and" and "or," are described in Table 1.
The partial FTA logic diagram in Figure 1 uses the "and" and "or" gates' symbols to analyze hazard to the patient. Inputs to the "or" gate at the top identify the four reasons this failure can occur. One of the reasons, electrical shock, is then broken down because it results from simultaneously grounding the patient and creating a pathway to a current source (an "and" gate). The analysis continues on, using the same technique, until the lowest levels such as operator error or open ground pin are identified.
When you perform an FTA, you systematically determine what happens to the system when the status of a part or another factor changes. In some applications, the minimum criterion for success is that no single failure can cause injury or an undetected loss of control over the process. In others, where extreme hazards exist or when high value product is being processed, the criteria may be increased to require toleration of multiple failures.
Fault tree construction
To do a comprehensive FTA, follow these steps:
- Define the fault condition, and write down the top level failure.
- Using technical information and professional judgments, determine the possible reasons for the failure to occur. Remember, these are level two elements because they fall just below the top level failure in the tree.
- Continue to break down each element with additional gates to lower levels. Consider the relationships between the elements to help you decide whether to use an "and" or an "or" logic gate.
- Finalize and review the complete diagram. The chain can only be terminated in a basic fault: human, hardware or software.
- If possible, evaluate the probability of occurrence for each of the lowest level elements and calculate the statistical probabilities from the bottom up.
Markov Model
Models
A model is an abstract representation of reality.
Consider model airplanes. Some model airplanes look very much like a small version of a real airplane, but do not fly well at all. Other model airplanes (e.g., a paper airplane) do not look very much like airplanes at all, but fly very well. These two kinds of models represent different features of the airplane; the first represents its outward appearance, while the second represents its aerodynamic properties (in part). Thus what type of model is appropriate to use depends upon the intended purpose.
Mathematical models represent a system, and are used to make predictions about that system.
Parameters
Every model consists of a structure, along with parameters that must be defined for the model to be meaningful. The structure of the model defines dependencies among the various parts of the model. Parameters are values -- often, but not necessarily numerical values -- that are required by the model. Parameters may be fixed, in which case the constitute assumptions of the model, or they may be variable. If they are variable,
Markov Process
A Markov process is a process that is capable of being in more than one state, can make transitions among those states, and in which the states available and transition probabilities depend only upon what state the system is currently in. In other words, there is no memory in a Markov process.
Markov Chain
A Markov Chain is a statistical model of a system that moves sequentially from one state to another
The probabilities of transition from one state to another are dependent only on the current state (not on previous states)
Generally modeled as a stochastic process
A Markov chain can be described by a transition matrix
Hidden Markov Models (HMMs)
A hidden Markov model models a Markov process, but assumes that there is uncertainty in what state the system is in at any given time.
A common metaphor is to think of the HMM as if the Markov Model were a mechanism hidden behind a curtain. The observer doesn't have any knowledge of how the mechanim is operating, and only knows what is going on behind the curtain from periodic reports that are emitted by the mechanism.
Emission probabilities
Three basic problems for HMMs [per Jack Ferguson (IDA), ex Rabiner (1989)]
There are three fundamental types of problems to which HMMs can be applied:
Problem type 1 (event evaluation): given a model and a sequence of events, estimate the probability that the HMM would give rise to the observed events.
Problem type 2 (path optimisation): given a series of events and a model, find the "optimal" set of model states (i.e., the optimal path through that model). This often means finding the set of model states that best correspond to the observed series.
Problem type 3 (parameter extimation): given a model and empirical observations, find the model parameters that best fit the observations.
Probability vs. likelihood
The term probability is used to refer to the relative odds of an event occurring in a case where all possible outcomes can be accounted for. By contrast, the term likelihood is used to refer to the odds of an event occurring when it is not possible (or not practical) to account for all possible outcomes. Thus the sum of the probabilities of all possible outcomes will always be 1, while likelihood is often interpreted in a relative sense, i.e., one event may be considered to be more likely than another even when there is a third possible outcome of unknown likelihood.
HMMs and hypothesis testing
HMMs can be used to test hypotheses. To perform hypothesis testing it is essential to be able to relate the data (the empirical observations) to the hypothesis to be tested:
The model is accepted as a set of a priori assumption. These assumption make it possible to calculate a numeric value for the hypothesis given the data. Thus the relative likelihood of two different hypotheses can be calculated given the same model and data.
A single hypothesis and model could correspond to many different specific datasets. For example, molecular phylogenies for two different genes might have identical tree topologies (hypotheses) and models (model of sequence evolution and parameters), but completely different gene sequences (data). For this reason it is frequently convenient to approach this as a problem of type 1 (above), and calculate the likelihood that a given DNA sequence (the data) would be observed given the hypothesis being tested (a tree topology) and a model (of sequence evolution). This is often written L(hypothesis) = L (data|model)
What makes this a powerful way of thinking about things is that it is possible to move components among these different elements. One can move part of the model into the hypothesis and test it, and if new data become available, then elements of the hypothesis or the model can be moved into the data, etc. Thus one can estimate parameters of sequence evolution if one accepts a phylogenetic tree as an a priori assumption, etc.
HMMs provide a very flexible context in which a variety of biological processes and datatypes may be modelled.
Cause-Consequence Analysis (CCA)
Cause-consequence analysis (CCA) is a method for analysing consequence chains and can be used individually or as a supportive method for other analysis methods. The objective of the analysis is to recognise consequence chains developing from failures or other unwanted events, and to estimate these consequences with their probabilities. The cause-consequence structure of the analysis is formed by combining two different types of tree structures together. To the consequence tree, built from left to right, includes the examined primary event and its follow-up events leading eventually to a failure or some other unwanted event like for example a serious injury of a person.
The causes and the probabilities for the realisation of the primary event and the follow-up events are defined to cause trees built from top to down. Often cause trees describe failures and are therefore called fault trees. The top level of the cause tree is at the same time a node in the consequence tree describing an event realising or not. Cause and consequence tree together create a visual consequence chain to help illustrate the relations between causes and consequences that lead into different damages. Consequence tree shows the possible consequence chains and damages of a single event, whereas cause trees (fault trees) describe the causes and probabilities of each consequence.
Cause-consequence analysis includes the following phases:
- Recognising damage chains
- Recognising the primary event (failure or some unwanted event that triggers the damage chain)
- Recognising the follow-up events (events between primary event and final damages)
- Final consequence damages (damages coming from different levels of follow-up events)
- Defining causes of primary and follow-up events to cause/fault trees
- Inputting realisation probabilities (failure data) for the causes of primary and follow-up events
Cause-consequence analysis is an effective tool when confirming that the operational safety features have been taken into account already on the design phase. The method can be applied especially when examining complex event chains where there are many possible consequence damages for a single primary event.
The results of cause-consequence analysis include among other things:
- Visual and logical description of the consequence chain evolving from the examined primary event
- Probabilities for the final consequence damages based on the cause-consequence structure
- Cause-consequence relations (causalities) between events
- Requirements for the safety features
The Management Oversight and Risk Tree (MORT)
The Management Oversight and Risk Tree (MORT) is an analytical procedure for determining causes and contributing factors. MORT arose from a project undertaken in the 1970s. The work aimed to provide the U.S. Nuclear industry with a risk management programme competent to achieve high standards of health and safety. Although the MORT chart (the logic diagram that accompanies this text) was just one aspect of the work, it proved to be popular as an evaluation tool and lent its name to the whole programme.
By virtue of public domain documentation, MORT has spawned several variants, many of them translations of the MORT User's Manual into other languages. The durability of MORT is a testament to its construction; it is a highly logical expression of the functions required for an organisation to manage risks effectively. These functions have been described generically – the emphasis is on "what" rather than "how" and this allows MORT to be applied to different industries. The longevity of MORT may also be a reflection of the far-sighted philosophy from which it emerged, a philosophy which held that the most effective way of managing safety is to make it an integral part of business management and operational control.
The MORT programme for assuring safety was written up by W.G. Johnson under the title "MORT: the Management Oversight & Risk Tree" (SAN 821-2, February 19732). Part of this was a method for investigating incidents and accidents that relied upon a logic tree diagram (the eponymous tree of the MORT acronym). The MORT diagram served as a graphical index to Johnson's text, allowing people to apply its contents in a methodical way. To help investigators, especially novices, the original text (which is in excess of 500 pages) was distilled into a forty-two-page question set: the MORT Users Manual3. MORT as a method is now largely independent of MORT as a programme, certainly in Europe. In practice, the MORT text (i.e. SAN 821-2) has become disassociated from the MORT chart, leaving the MORT User's Manual as the most common source of reference.
General Approach
In MORT, accidents are defined as unplanned events that produce harm or damage, that is, losses. Losses occur when a harmful agent comes into contact with a person or asset. This contact can occur either because of a failure of prevention or, as an unfortunate but acceptable outcome of a risk that has been properly assessed and acted-on (a so-called "assumed risk"). MORT analysis always evaluates the "failure" route before considering the "assumed risk" hypothesis.
In MORT analysis, most of the effort is directed at identifying problems in the control of a work/process and deficiencies in the protective barriers associated with it. These problems are then analysed for their origins in planning, design, policy, etc.
To use MORT, you must first identify key episodes in the sequence of events. Each episode can be characterised as:
- a vulnerable target exposed to –
- an agent of harm in the –
- absence of adequate barriers.
MORT analysis can be applied to any one or more of the episodes identified; it is a choice for you to make in the light of the circumstances particular to your investigation. To identify these key episodes, you will need to undertake a barrier analysis (or "Energy Trace and Barrier Analysis" to give it its full title). Barrier analysis allows MORT analysis to be focussed; it is very difficult to use MORT, even in a superficial way, without it.
The MORT process is rather like a dialogue between the generic questions of MORT and the situation that you are investigating. You, the analyst, act as the interpreter between MORT and the situation. The questions in MORT are asked in a particular sequence, one that is designed to help you clarify the facts surrounding the incident. Even so, not every question posed by MORT will be relevant on all occasions. Getting acquainted with MORT is essentially about becoming familiar with the gist of questions in this manual. The chart itself then acts as a prompt list allowing you to concentrate on the issues revealed through the process. It is important for you to make notes as you go, just as it would be if you were conducting an interview. In practice, MORT analysts make brief notes on the MORT chart - enough to capture the issues that arise and their assessment of them. To make this process easier to review, it is customary to colour-code the chart as you go:
- red, where a problem is found;
- green, where a relevant issue is judged to have been satisfactory, and;
- blue, to indicate where you think an issue is relevant but you don't have enough information to properly assess it.
In addition, issues presented by MORT that you judge to be irrelevant, should be crossed-out to show that you have considered them. The outcomes of a MORT analysis are:
- the creation of new lines of enquiry;
- visibility of causal factors (which are grouped thematically) and;
- increased confidence in the thoroughness of the investigation.
These results are not gained without effort; one sweep through MORT for one episode is likely to take an experienced MORT analyst about one hour. As a general rule, only use MORT when you judge that it will add to your investigation – do not use it just because you can. Furthermore, you need to be familiar with the method and to have performed it at least once on a real investigation, to be in a good position to make this judgement.
MORT Structure
The top event in MORT is labelled “Losses”, beneath which are its two alternative causes: (1) Oversights and Omissions, or (2) Assumed Risks. All contributing factors in the accident sequence are treated as oversights and omissions unless they are transferred to the Assumed Risks branch. Input to the Oversights and Omissions event is through an AND logic gate. This means that problems manifest in the specific control of work activities, necessarily involve issues in the management processes that govern them.
The Specific and Management branches are the two main branches in MORT. Specific control factors are broken down into two classes: those related to the incident or accident itself (SA1) and those related to restoring control following an accident (SA2). These are under an OR gate because either can be a cause of losses.
MORT is accomplished using the MORT diagrams. As indicated above there are several levels of the MORT diagram available. The most comprehensive, with about 10,000 blocks basically fills a book. There is an intermediate diagram with about 1500 blocks, and a basic diagram with about 300. Of course it is possible to tailor a MORT diagram by choosing various branches of the MORT tree and using only those segments. The MORT is essentially a negative tree, so the process begins by placing an undesired loss event at the top of the diagram used. The MORT user then systematically responds to the issues posed by the MORT diagram. All aspects of the diagram are considered and the “less than adequate” blocks are highlighted for risk control action.
Full application of MORT is reserved for the highest risks and most mission critical activities because of the time and expense required. MORT is also basically a professional tool requiring a specially trained loss control professional to assure proper application. The basic MORT diagram can be used to facilitate and check on the overall hazard ID process by those with the interest and motivation to ensure excellence.
MORT Procedures
Choose an episode from your Barrier Analysis and write it on the MORT chart above SA1 “Incident”
- Begin at SB1 ("Harmful energy flow…")
- State the energy flow above SB1
- Proceed through chart top to bottom, left to right
- Code RED or GREEN only with evidence and standard of judgement
- Code BLUE if evidence or required standard is uncertain
- Maintain your list of further enquiries as you go
- Note any provisional Assumed Risks into the table
- When SB3 ("Controls & Barriers LTA") completed
- explore M-branch either by: ad hoc exploration of M-branch or in sequence – a2-MB1, a1-MB1, MA1, MA2, MB2
- If needed, select another episode from Barrier Analysis
- Use fresh MORT chart
- Repeat steps 3 and 4
- When all required SA1 analyses are complete
- Note on the barrier analysis episodes that have been subject to MORT analysis
- Move to SA2 – Amelioration
- Move to M-Branch and explore in the light of the SA2 analysis
- Review Provisional Assumed Risks
- Explore any that are LTA using a1-MB1
- Review MB2 in the light of the analysis so far
- Review the M-branch issues, taking the overview
Comments
The management oversight and risk tree (MORT) is the ultimate hazard ID tool. MORT uses a series of MORT charts developed and perfected over several years by the Department of Energy in connection with their nuclear safety programs. Each MORT chart identifies a potential operating or management level hazard that might be present in an operation. The attention to detail characteristic of MORT is illustrated by the fact that the full MORT diagram or tree contains more than 10,000 blocks. Even the simplest MORT chart contains over 300 blocks. Obviously, full application of MORT is a very time-consuming and costly venture. The basic MORT chart with about 300 blocks can be routinely used as a check on the other hazard ID tools. By reviewing the major headings of the MORT chart, an analyst will often be reminded of a type of hazard that was overlooked in the initial analysis. The MORT diagram is also very effective in assuring attention to the underlying management root causes of hazards.
MORT is the ultimate in ORM hazard ID processes. Unfortunately, in a military context only rarely will the time, resources, expertise, and mission critical issue come together to permit full application of the process. Nevertheless, the wise risk manager will become familiar with MORT processes and will frequently use the basic MORT diagram to reinforce mainstream hazard ID tools.
The MORT diagram is essentially an elaborate negative logic diagram. The difference is primarily that the MORT diagram is already fill-out for the user, allowing a person to identify various contributory cause factors for a given undesirable event. Since the MORT is very detailed, as mentioned above, a person can identify basic causes for essentially any type of event.
RASCI Responsibility Matrix
RASCI Responsibility Matrix, sometimes also just RASCI Matrix. It is one of the methods used to assign and display responsibilities of individuals or jobs in a task (project, service or process) in the organization. RASCI (sometimes RASIC) is an acronym from the initial letters of words:
- R - Responsible - who is responsible for carrying out the entrusted task?
- A - Accountable (also Approver) - who is responsible for the whole task and who is responsible for what has been done?
- S - Support - who provides support during the implementation of the activity / process / service?
- C - Consulted - who can provide valuable advice or consultation for the task?
- I - Informed - who should be informed about the task progress or the decisions in the task?
How to use RASCI matrix in practice and what is for?
RASCI matrix is used for the allocation and assignment of responsibilities to the team members in projects, processes or their parts. Use the letters R A S C I in the matrix in order to describe level of responsibility. There is a rule applied that the overall responsibility (A - Accountability) has the only one person. The people involved (R - Responsibility) should be adequate to the task. Method RASCI is a simple form of competency model and expansion of RACI matrix by people that support the execution of their mission.