Home Up Products Partners Articles and Info Newsletters About Us Contact

Utilizing RCA for Process Related Issues
SMRP Presentation – October 2006  
Ken Latino
PDF Version for Printing
 

Normally when we think of Root Cause Analysis (RCA) we think of a machine that has failed.  Perhaps a drive shaft has fractured on a critical piece of equipment and we need to find out what happened.  Although this is a common use of RCA there are other types of events that lend themselves to this type of analysis.  These include process issues, quality defects, customer complaints, and many others. 

In actuality, process issues that cause daily upsets are perfect candidates for RCA.  Whether you are having trouble removing moisture from a product, shutting down a production line due to continual conveyor trips, exceeding the limits on critical process variables or anything in between, RCA can help.  These issues tend to be very costly to a plant because they are ongoing and can cause significant losses in production, product quality, and ultimately customer dissatisfaction.   

Before we begin talking about the specifics of using RCA on process issues, it is important to define what is meant by RCA.  Although there is not an official standard for performing RCA, there are some specific guidelines that are needed to perform any RCA:

q     A factual definition of the failure

q     Data, Data, Data!!!

q     A cross-functional team

q     A systematic process for analyzing the data

q     A mechanism for addressing the corrective actions

q     A means of ensuring that the corrective actions were effective

Let’s begin by discussing these critical factors.  Most analysis teams fail because they are trying to solve a problem that does not exist.  This may seem silly, but it is surprising how often the analysis team defines the problem improperly.  Instead of focusing on the consequences and factual failure modes, they tend to focus on symptoms that often are not the real problem.  We will discuss this in more detail later in this paper, but for now just remember that you cannot solve problems that “do not exist”.  Accurately defining the problem is the key to a successful analysis.

I often correlate RCA investigations with police investigations.  I ask my clients what the first step is in a police investigation.  The answer is always to collect the evidence.  The success of a police investigation is totally based on the evidence.  While we are not looking for “whodunit” in our investigations, we still require the physical data to make our case.  Without data, we are simply guessing, using trial and error!  This is a very expensive method for solving problems.  In order to be successful, you must collect pertinent data about the issues you are studying.  For example, if we are having a problem with controlling process temperature, we will need to collect data on when the problem started, process changes, process flow diagrams, maintenance histories, and much more. 

I use a simple acronym to help collect this data.  It is simple to remember, and works for almost any type of investigation.  It is called the 5P’s, and stands for the five categories of data that need to be collected for any analysis.  The 5P’s stands for:

q     Parts

q     People

q     Position

q     Paper

q     Paradigms

The idea is for the analysis team to discuss what data will be necessary to determine the root causes of the issue based on these 5 categories.  They would need to determine what parts would have to be analyzed to solve the problem.  For example, items like failed heat exchanger tubes, broken coupling bolts, instruments, etc.

Positional information would include things like the time of the events.  This is particularly important when you are performing analysis on process related issues that happen repeatedly.  The idea is to determine if there is a pattern or correlation for when the events are occurring.  Other positional information is related to key process variables, location of the events, and the like.  Positional information is arguably the most important of the 5P’s.  Especially when dealing with process related issues.

Paper information is fairly obvious, but there are a couple of items that should always be collected.  Drawings like process flow diagrams (PFD) and piping and instrumentation diagrams (P&ID) are extremely useful in any RCA.  The analysis team needs a visual reference for the problem they are studying and drawings like these usually fit the bill.  If relevant, machine or equipment cutaway drawings can also be useful.  These can usually be acquired from the vendor or the manufacturer of the equipment.  Other common paper items that should be considered are maintenance histories, shift logs, inspection reports, specifications, and a host of others.  Be careful with paper data because it can quickly be overwhelming.  If paper data is secure and available, then you can always collect it later in the analysis as you need it.  Do not collect more than you need, or it can stifle the analysis unnecessarily.

As in a police investigation, interviews with knowledgeable people are a must.  People information usually comes in the form of a discussion or interview, and should be specific to the problem at hand.  In some cases, it will be eyewitness accounts of the issue or a third party like a vendor or sister facility that might be having similar issues.  Just remember that if you are trying to get eyewitness accounts of the issue, it is best to get to the individual as quickly as possible so that their information is not lost through short term memory loss or the opinions of others.

Finally the last “P” represents paradigms.  Paradigms are the collective opinions of others.  This is the typically evident after a number of interviews with plant personnel.  They usually come in the form of statements like “It is a design problem” or “It is maintenance’s fault it continues to fail”.  Although paradigms are not facts, they are perceived as facts and can often dictate how we deal with the issue at hand.  For example, if we have a problem with leaking tubes it might be a paradigm that it is a metallurgical issue, and therefore, we have to solve the problem with different metals.  It may have nothing to do with the metallurgy, but if that is the paradigm, then the problem will persist.

We have heard about the virtues of cross-functional teams for years.  The fact is it is hard to solve problems in a vacuum.  We need the opinions and knowledge of others.  If conventional wisdom were enough, then the problems would already be solved.  The reason it persists is that the chronic causes fall outside of conventional wisdom. 

I suggest having representation of operations, maintenance, and technical services involved in an analysis.  These, coupled with an unbiased facilitator, will provide the catalyst for a successful outcome.  Let me expand for a moment on the unbiased facilitator.  It is human nature to assign a complex problem to the person we feel has the most knowledge and experience.  However, this is precisely the wrong thing to do.  The facilitator should not be the expert in the particular problem, but rather an expert in RCA methodology and facilitation.  This gives them the ability to ask the tough questions since they have nothing to gain or lose from the outcome of the analysis.

Once the data has been collected and a formal team is in place, we can now begin to systematically determine the real root causes of the problem.  I want to highlight “causes” because every failure or problem has multiple contributing causes.  There are many techniques for systematically determining root causes.  Some that come to mind are fault trees, fishbone diagrams, cause and effect diagrams, and many others.  I prefer to use a logic tree for looking at process related issues.  Logic trees are similar to fault trees except that they are historical and not probabilistic.  The key to any scientific method for solving a problem is to develop hypotheses, and then use facts and data to prove or disprove the hypotheses. 

 

Logic Tree Structure

The first two levels of the logic tree define the problem.  A better way to think of the Event block is the consequence of the event.  For example, if we have repeated conveyor failures, then the consequence would be lack of feed to a downstream process.

The modes are the factual reasons for the event (consequence).  For example, the conveyor might have repeated trips, or perhaps the rollers continue to malfunction.  The key to the both the mode and the event components of the logic tree is that they are facts.  The top of the logic tree must be facts, or the process will not be successful.  Remember, you cannot solve problems that do not exist!

The next step is the formulation of ideas.  These are hypotheses or “educated guesses”.  These are the ideas about how the modes have occurred in the past.  These hypotheses are guesses they need to be validated to see if they are true or not true.  This step is critical to the process.  I always tell my clients that if you are not planning on doing the verifications of the hypotheses, than you should not bother even performing the RCA in the first place.  You are simply just guessing with the exclusion of the verification process.

Verifying hypotheses is a reiterative process.  Eventually, the generation of hypotheses will result in the discovery of causes.  There are three types of causes:

q     Physical

q     Human

q     Latent

Let me explain the difference between these different types of causes.  Physical root causes are the physical mechanisms that cause the modes, and ultimately, the consequences to occur.  Examples of physical causes are high vibration, overload, and corrosion just to name a few.  The human root causes are related to human intervention.  Things like not doing something you were expected to do, or doing something that you were not supposed to do.  Examples of human root causes are things like misalignment, opening the wrong valves, running a machine beyond it design limitations, and many others.  Generally, people do not wake up in the morning and decide to make thing fails or run poorly at work.  Most workers want to keep their jobs and want to do a good job at work.  Therefore we need to dig a little deeper into why people make these mistakes of omission or commission.  These are called latent or system roots.  These are the underlying systems that dictate how work gets done.  Examples of latent root causes are things like time pressures to perform a repair, incorrect on nonexistent procedures, lack of knowledge or skill, and many others.  These are the real “ROOT CAUSES” of failure.  If these underlying or latent issues are not addressed the problem will persist.  I like to make the analogy to a weed in your yard.  If you simply cut it at the stem it will grow back, but if you dig up the underlying roots it will not reappear. 

The predictable thing about these causes is that they always come in that order (Physical, Human and Latent).  In order to be successful, you must identify all of the causes.   It is virtually impossible to eliminate a problem if you do not first identify the physical, human, and latent roots.  I often see analysis teams make mistakes by identifying things like poor design or poor maintenance practices on the top of their logic tree.  These are latent issues, and you cannot accurately identify those until you know what the physical causes are.  If it is a design issue, then you must first identify what about that design makes the mode occur.  Is it too big, too small, too fast, or too slow?

Once the causes have been identified and verified, it is now time to implement corrective actions to ensure that the problem will not continue to cause the negative consequences.  This seems like common sense, but many analysis teams fail to convince others of the need to make the necessary changes to alleviate the consequences of the problem.  Some things are simple and require little or no authorization from management, while others require management support and backing.  Analysis teams need to be able to accurately communicate the business need for the corrective actions so that management will buy in to the effort.  Many companies have formal processes for recommendations and the tracking of those recommendations.  If this is the case at your company, then find out how the system works and make sure to take advantage of the work process.

Last of all, you need to monitor the success or effectiveness of the corrective actions.  For example, if you are having repetitive conveyor failures, then you need to track the number of those events prior and after the corrective actions have been implemented to verify that they were effective at reducing the consequences.  

Now that we understand the basic RCA approach, let’s examine how this technique can and should be used to address process failures.  First let me explain the difference between what I am calling process failures and more traditional equipment failures.

Many times RCA is only used when there is a catastrophic failure of an asset or there is a safety or environmental incident.  Although these issues must be analyzed to ensure that they do not happen again, they are typically sporadic in nature.  Sporadic simply meaning that they are somewhat rare events.  Rather than focus on sporadic events, I would like to focus on the more common chronic process events.  I am not talking about the pump that might fail every 6 month due to a seal leak.  I am talking about process issues like bottling lines that jam every few minutes in a beverage plant, web breaks in a printing press, or paper making machine or product quality problems due to the inability to control key process variables.  These are events that are ongoing and are effecting the bottom-line each and everyday.

These issues are somewhat easier to solve due to one very important reason.  They provide us the ability to collect lots of data about the problem.  This is unlike sporadic failures where you really only get one opportunity to collect the failure data.  If you do not collect the data immediately, then it is impossible to recover it at a later time.  The steps for analyzing a process failure are not really that much different than studying a mechanical type of failure.  The key is to define the failure event and modes accurately and factually.

Let’s examine a couple of examples to make the ideas more concrete.  I once worked with a cigarette manufacturer who was having problems with “rod breaks”.  A rod is the actual paper and tobacco rolled into a continuous “rod” before it is actually cut into individual cigarettes.  This event was happening many times a day on most of the cigarette making machines.  In actuality, it was happening literally hundreds of times a day.  Each event only took a few minutes to correct with only limited interaction from the operator.  When they added up the frequency times the cost of the few minutes of downtime, it ended up being a multi-million dollar problem. 

When we began to assist in the analysis, we asked how the rod was breaking.  Since they had obviously given this issue a great deal of thought, we figured it would be easy to get the answer to this simple question.  It turned out that they were not sure how to answer the question.  They said that it just broke.  We pressed on because we needed to know the exact “mode” of the event.  Since they were not sure how to answer, we had them do some data preservation work to help describe the mode(s) for a broken rod.  It was determined that there were several modes, but there were two that occurred most frequently.  We call it Jagged Edge and it went from NE to SW, or NW to SE. 

 Figure 2 – Rod Break Example

 By clearly describing the failure modes in this way, we were able to more clearly define the problem.  Although there were many ways that a rod had broken in the past, these were the most common and the ones in most need of a solution.

So the logic tree definition was created to very specifically delineate the modes of a rod break.  The causes for each of these failure modes could be similar or, like in many cases, totally different.  For this reason, it is critical to clearly define how the failure is currently occurring.

Below is an example of how the logic tree problem definition was created.

 Figure 3 – Rod Break Logic Tree Failure Definition

Once the top of the logic tree is defined and factual, then the process for developing hypotheses is consistent with any other type of RCA.  Start by simply asking a series of “How Can” questions, starting on the mode in question, and slowly work down to the physical, human and latent roots.

 

Figure 4 – Sample Logic Tree

Let’s look at a few more examples of defining the problem definition for process related failures.

High Temperature Issues of Hydrocarbon in a Petrochemical Plant

A temperature indicator identifies that the temperature on the outlet of the heat exchanger is above the specified level.  The consequence is that the product is off-spec, and could potentially cause an over-pressurization situation downstream.  

 

Figure 5 – Petrochemical Product Cooling Issue

Steel Mill Example

A steel mill is experiencing an issue where a width gauge is indicating that the width of a roll of steel is too narrow, and does not meet customer specifications. 

Figure 6 – Problem Definition for Coil Issue

Gas Plant Example

A gas plant is cleaning gas for its downstream customers.  Operations indicate that there is a foaming situation in the amine scrubber.  The foaming is causing the plant to have unplanned downtime, resulting in a restriction of service to its customers. 

 

Figure 7 – Problem Definition for Gas Plant Foaming Issues

These are just a few examples of defining the problem definition for process issues.  If you can successfully define the problem, then the success of the analysis improves exponentially.  I would challenge you to go out and look for process issues (defects) at your facility and apply these simple yet powerful techniques.

As I mentioned earlier, in the past there has been a mindset in industry that RCA is a tool for large catastrophic system or asset failures, or for safety and environmental incidents.  Although these are excellent uses of RCA, it is missing many of the large process opportunities that are robbing our facilities each and every day.

These techniques coupled with the data, knowledge, and experience of our workforce allows us to solve almost any problem.  The key is to make sure that you properly define the problem based on facts and not assumptions.  If you can master this skill, then solving the problems is just a matter of data validation and perseverance.   As my son’s soccer coach always like to say; “can’t is a cowardly word”.  No problem is too large to solve given the right mix of tools, techniques, and people. 


Home ] Products ] Partners ] Articles and Info ] Newsletters ] About Us ] Contact ]