Normally when we think
of Root Cause Analysis (RCA) we think of a machine that has failed.
Perhaps a drive shaft has fractured on a critical piece of equipment and
we need to find out what happened. Although
this is a common use of RCA there are other types of events that lend themselves
to this type of analysis. These
include process issues, quality defects, customer complaints, and many others.
In
actuality, process issues that cause daily upsets are perfect candidates for
RCA. Whether you are having trouble
removing moisture from a product, shutting down a production line due to
continual conveyor trips, exceeding the limits on critical process variables or
anything in between, RCA can help. These
issues tend to be very costly to a plant because they are ongoing and can cause
significant losses in production, product quality, and ultimately customer
dissatisfaction.
Before
we begin talking about the specifics of using RCA on process issues, it is
important to define what is meant by RCA. Although
there is not an official standard for performing RCA, there are some specific
guidelines that are needed to perform any RCA:
q
A factual
definition of the failure
q
Data, Data,
Data!!!
q
A
cross-functional team
q
A systematic
process for analyzing the data
q
A mechanism for
addressing the corrective actions
q
A means of
ensuring that the corrective actions were effective
Let’s begin by discussing these critical factors.
Most analysis teams fail because they are trying to solve a problem that
does not exist. This may seem silly,
but it is surprising how often the analysis team defines the problem improperly.
Instead of focusing on the consequences and factual failure modes, they
tend to focus on symptoms that often are not the real problem.
We will discuss this in more detail later in this paper, but for now just
remember that you cannot solve problems that “do not exist”.
Accurately defining the problem is the key to a successful analysis.
I often
correlate RCA investigations with police investigations.
I ask my clients what the first step is in a police investigation.
The answer is always to collect the evidence.
The success of a police investigation is totally based on the evidence.
While we are not looking for “whodunit” in our investigations, we
still require the physical data to make our case.
Without data, we are simply guessing, using trial and error!
This is a very expensive method for solving problems.
In order to be successful, you must collect pertinent data about the
issues you are studying. For
example, if we are having a problem with controlling process temperature, we
will need to collect data on when the problem started, process changes, process
flow diagrams, maintenance histories, and much more.
I use a
simple acronym to help collect this data. It
is simple to remember, and works for almost any type of investigation.
It is called the 5P’s, and stands for the five categories of data that
need to be collected for any analysis. The
5P’s stands for:
q
Parts
q
People
q
Position
q
Paper
q
Paradigms
The idea is for the analysis team to discuss what data will be necessary to
determine the root causes of the issue based on these 5 categories.
They would need to determine what parts would have to be analyzed to
solve the problem. For example,
items like failed heat exchanger tubes, broken coupling bolts, instruments, etc.
Positional
information would include things like the time of the events.
This is particularly important when you are performing analysis on
process related issues that happen repeatedly.
The idea is to determine if there is a pattern or correlation for when
the events are occurring. Other
positional information is related to key process variables, location of the
events, and the like. Positional
information is arguably the most important of the 5P’s.
Especially when dealing with process related issues.
Paper
information is fairly obvious, but there are a couple of items that should
always be collected. Drawings like
process flow diagrams (PFD) and piping and instrumentation diagrams (P&ID)
are extremely useful in any RCA. The
analysis team needs a visual reference for the problem they are studying and
drawings like these usually fit the bill. If
relevant, machine or equipment cutaway drawings can also be useful.
These can usually be acquired from the vendor or the manufacturer of the
equipment. Other common paper items
that should be considered are maintenance histories, shift logs, inspection
reports, specifications, and a host of others.
Be careful with paper data because it can quickly be overwhelming.
If paper data is secure and available, then you can always collect it
later in the analysis as you need it. Do
not collect more than you need, or it can stifle the analysis unnecessarily.
As in a
police investigation, interviews with knowledgeable people are a must.
People information usually comes in the form of a discussion or
interview, and should be specific to the problem at hand.
In some cases, it will be eyewitness accounts of the issue or a third
party like a vendor or sister facility that might be having similar issues.
Just remember that if you are trying to get eyewitness accounts of the
issue, it is best to get to the individual as quickly as possible so that their
information is not lost through short term memory loss or the opinions of
others.
Finally
the last “P” represents paradigms. Paradigms
are the collective opinions of others. This
is the typically evident after a number of interviews with plant personnel.
They usually come in the form of statements like “It is a design
problem” or “It is maintenance’s fault it continues to fail”.
Although paradigms are not facts, they are perceived as facts and can
often dictate how we deal with the issue at hand.
For example, if we have a problem with leaking tubes it might be a
paradigm that it is a metallurgical issue, and therefore, we have to solve the
problem with different metals. It
may have nothing to do with the metallurgy, but if that is the paradigm, then
the problem will persist.
We have
heard about the virtues of cross-functional teams for years.
The fact is it is hard to solve problems in a vacuum.
We need the opinions and knowledge of others.
If conventional wisdom were enough, then the problems would already be
solved. The reason it persists is
that the chronic causes fall outside of conventional wisdom.
I
suggest having representation of operations, maintenance, and technical services
involved in an analysis. These,
coupled with an unbiased facilitator, will provide the catalyst for a successful
outcome. Let me expand for a moment
on the unbiased facilitator. It is
human nature to assign a complex problem to the person we feel has the most
knowledge and experience. However,
this is precisely the wrong thing to do. The
facilitator should not be the expert in the particular problem, but rather an
expert in RCA methodology and facilitation.
This gives them the ability to ask the tough questions since they have
nothing to gain or lose from the outcome of the analysis.
Once the
data has been collected and a formal team is in place, we can now begin to
systematically determine the real root causes of the problem.
I want to highlight “causes” because every failure or problem has
multiple contributing causes. There
are many techniques for systematically determining root causes.
Some that come to mind are fault trees, fishbone diagrams, cause and
effect diagrams, and many others. I
prefer to use a logic tree for looking at process related issues.
Logic trees are similar to fault trees except that they are historical
and not probabilistic. The key to
any scientific method for solving a problem is to develop hypotheses, and then
use facts and data to prove or disprove the hypotheses.
Logic
Tree Structure
The first two levels of the logic tree define the problem.
A better way to think of the Event block is the consequence of the event.
For example, if we have repeated conveyor failures, then the consequence
would be lack of feed to a downstream process.
The
modes are the factual reasons for the event (consequence).
For example, the conveyor might have repeated trips, or perhaps the
rollers continue to malfunction. The
key to the both the mode and the event components of the logic tree is that they
are facts. The top of the logic tree
must be facts, or the process will not be successful.
Remember, you cannot solve problems that do not exist!
The next
step is the formulation of ideas. These
are hypotheses or “educated guesses”. These
are the ideas about how the modes have occurred in the past.
These hypotheses are guesses they need to be validated to see if they are
true or not true. This step is
critical to the process. I always
tell my clients that if you are not planning on doing the verifications of the
hypotheses, than you should not bother even performing the RCA in the first
place. You are simply just guessing
with the exclusion of the verification process.
Verifying
hypotheses is a reiterative process. Eventually,
the generation of hypotheses will result in the discovery of causes.
There are three types of causes:
q
Physical
q
Human
q
Latent
Let me
explain the difference between these different types of causes.
Physical root causes are the physical mechanisms that cause the modes,
and ultimately, the consequences to occur. Examples
of physical causes are high vibration, overload, and corrosion just to name a
few. The human root causes are
related to human intervention. Things
like not doing something you were expected to do, or doing something that you
were not supposed to do. Examples of
human root causes are things like misalignment, opening the wrong valves,
running a machine beyond it design limitations, and many others.
Generally, people do not wake up in the morning and decide to make thing
fails or run poorly at work. Most
workers want to keep their jobs and want to do a good job at work.
Therefore we need to dig a little deeper into why people make these
mistakes of omission or commission. These
are called latent or system roots. These
are the underlying systems that dictate how work gets done.
Examples of latent root causes are things like time pressures to perform
a repair, incorrect on nonexistent procedures, lack of knowledge or skill, and
many others. These are the real
“ROOT CAUSES” of failure. If
these underlying or latent issues are not addressed the problem will persist.
I like to make the analogy to a weed in your yard.
If you simply cut it at the stem it will grow back, but if you dig up the
underlying roots it will not reappear.
The
predictable thing about these causes is that they always come in that order
(Physical, Human and Latent). In
order to be successful, you must identify all of the causes.
It is virtually impossible to eliminate a problem if you do not
first identify the physical, human, and latent roots.
I often see analysis teams make mistakes by identifying things like poor
design or poor maintenance practices on the top of their logic tree.
These are latent issues, and you cannot accurately identify those until
you know what the physical causes are. If
it is a design issue, then you must first identify what about that design makes
the mode occur. Is it too big, too
small, too fast, or too slow?
Once
the causes have been identified and verified, it is now time to implement
corrective actions to ensure that the problem will not continue to cause the
negative consequences. This seems
like common sense, but many analysis teams fail to convince others of the need
to make the necessary changes to alleviate the consequences of the problem.
Some things are simple and require little or no authorization from
management, while others require management support and backing.
Analysis teams need to be able to accurately communicate the business
need for the corrective actions so that management will buy in to the effort.
Many companies have formal processes for recommendations and the tracking
of those recommendations. If this is
the case at your company, then find out how the system works and make sure to
take advantage of the work process.
Last of
all, you need to monitor the success or effectiveness of the corrective actions.
For example, if you are having repetitive conveyor failures, then you
need to track the number of those events prior and after the corrective actions
have been implemented to verify that they were effective at reducing the
consequences.
Now that
we understand the basic RCA approach, let’s examine how this technique can and
should be used to address process failures.
First let me explain the difference between what I am calling process
failures and more traditional equipment failures.
Many
times RCA is only used when there is a catastrophic failure of an asset or there
is a safety or environmental incident. Although
these issues must be analyzed to ensure that they do not happen again, they are
typically sporadic in nature. Sporadic
simply meaning that they are somewhat rare events.
Rather than focus on sporadic events, I would like to focus on the more
common chronic process events. I am
not talking about the pump that might fail every 6 month due to a seal leak.
I am talking about process issues like bottling lines that jam every few
minutes in a beverage plant, web breaks in a printing press, or paper making
machine or product quality problems due to the inability to control key process
variables. These are events that are
ongoing and are effecting the bottom-line each and everyday.
These
issues are somewhat easier to solve due to one very important reason.
They provide us the ability to collect lots of data about the problem.
This is unlike sporadic failures where you really only get one
opportunity to collect the failure data. If
you do not collect the data immediately, then it is impossible to recover it at
a later time. The steps for
analyzing a process failure are not really that much different than studying a
mechanical type of failure. The key
is to define the failure event and modes accurately and factually.
Let’s
examine a couple of examples to make the ideas more concrete.
I once worked with a cigarette manufacturer who was having problems with
“rod breaks”. A rod is the
actual paper and tobacco rolled into a continuous “rod” before it is
actually cut into individual cigarettes. This
event was happening many times a day on most of the cigarette making machines.
In actuality, it was happening literally hundreds of times a day.
Each event only took a few minutes to correct with only limited
interaction from the operator. When
they added up the frequency times the cost of the few minutes of downtime, it
ended up being a multi-million dollar problem.
When we
began to assist in the analysis, we asked how the rod was breaking.
Since they had obviously given this issue a great deal of thought, we
figured it would be easy to get the answer to this simple question.
It turned out that they were not sure how to answer the question.
They said that it just broke. We
pressed on because we needed to know the exact “mode” of the event.
Since they were not sure how to answer, we had them do some data
preservation work to help describe the mode(s) for a broken rod.
It was determined that there were several modes, but there were two that
occurred most frequently. We call it
Jagged Edge and it went from NE to SW, or NW to SE.
Figure
2 – Rod Break Example
By clearly describing the failure modes in this way, we were able to more
clearly define the problem. Although
there were many ways that a rod had broken in the past, these were the most
common and the ones in most need of a solution.
So the
logic tree definition was created to very specifically delineate the modes of a
rod break. The causes for each of
these failure modes could be similar or, like in many cases, totally different.
For this reason, it is critical to clearly define how the failure is
currently occurring.
Below
is an example of how the logic tree problem definition was created.
Figure 3 – Rod
Break Logic Tree Failure Definition
Once the
top of the logic tree is defined and factual, then the process for developing
hypotheses is consistent with any other type of RCA.
Start by simply asking a series of “How Can” questions, starting on
the mode in question, and slowly work down to the physical, human and latent
roots.
Figure
4 – Sample Logic Tree
Let’s look at a few more examples of defining the problem definition for
process related failures.
High
Temperature Issues of Hydrocarbon in a Petrochemical Plant
A
temperature indicator identifies that the temperature on the outlet of the heat
exchanger is above the specified level. The
consequence is that the product is off-spec, and could potentially cause an
over-pressurization situation downstream.
Figure
5 – Petrochemical Product Cooling Issue
Steel
Mill Example
A steel
mill is experiencing an issue where a width gauge is indicating that the width
of a roll of steel is too narrow, and does not meet customer specifications.
Figure 6 – Problem Definition for Coil Issue
Gas
Plant Example
A gas
plant is cleaning gas for its downstream customers.
Operations indicate that there is a foaming situation in the amine
scrubber. The foaming is causing the
plant to have unplanned downtime, resulting in a restriction of service to its
customers.

Figure 7 – Problem
Definition for Gas Plant Foaming Issues
These
are just a few examples of defining the problem definition for process issues.
If you can successfully define the problem, then the success of the
analysis improves exponentially. I
would challenge you to go out and look for process issues (defects) at your
facility and apply these simple yet powerful techniques.
As I
mentioned earlier, in the past there has been a mindset in industry that RCA is
a tool for large catastrophic system or asset failures, or for safety and
environmental incidents. Although
these are excellent uses of RCA, it is missing many of the large process
opportunities that are robbing our facilities each and every day.
These
techniques coupled with the data, knowledge, and experience of our workforce
allows us to solve almost any problem. The
key is to make sure that you properly define the problem based on facts and not
assumptions. If you can master this
skill, then solving the problems is just a matter of data validation and
perseverance. As my son’s
soccer coach always like to say; “can’t is a cowardly word”.
No problem is too large to solve given the right mix of tools,
techniques, and people.