Root Cause Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

QUALITY BASICS

Root Cause Analysis


For Beginners
by James J. Rooney and Lee N. Vanden Heuvel

oot cause analysis (RCA) is a process


designed for use in investigating and categorizing the root causes of events with safety, health, environmental, quality, reliability and
production impacts. The term event is used to

In 50 Words
Or Less
Root cause analysis helps identify what, how
and why something happened, thus preventing
recurrence.
Root causes are underlying, are reasonably
identifiable, can be controlled by management
and allow for generation of recommendations.
The process involves data collection, cause
charting, root cause identification and recommendation generation and implementation.

generically identify occurrences that produce or


have the potential to produce these types of consequences.
Simply stated, RCA is a tool designed to help
identify not only what and how an event occurred,
but also why it happened. Only when investigators are able to determine why an event or failure
occurred will they be able to specify workable
corrective measures that prevent future events of
the type observed.
Understanding why an event occurred is the
key to developing effective recommendations.
Imagine an occurrence during which an operator is instructed to close valve A; instead, the
operator closes valve B. The typical investigation would probably conclude operator error
was the cause.
This is an accurate description of what happened and how it happened. However, if the analysts stop here, they have not probed deeply
enough to understand the reasons for the mistake.
Therefore, they do not know what to do to prevent it from occurring again.
In the case of the operator who turned the
wrong valve, we are likely to see recommendations such as retrain the operator on the procedure, remind all operators to be alert when

QUALITY PROGRESS

I JULY 2004 I 45

QUALITY BASICS

manipulating valves or emphasize to all personnel


that careful attention to the job should be maintained at all times. Such recommendations do little
to prevent future occurrences.
Generally, mistakes do not just happen but can
be traced to some well-defined causes. In the case
of the valve error, we might ask, Was the procedure confusing? Were the valves clearly labeled?
Was the operator familiar with this particular
task?
The answers to these and other questions will
help determine why the error took place and
what the organization can do to prevent recur-

Identifying severe weather


as the root cause of parts not
being delivered on time to
customers is not appropriate.
rence. In the case of the valve error, example
recommendations might include revising the
procedure or performing procedure validation to
ensure references to valves match the valve labels
found in the field.
Identifying root causes is the key to preventing
similar recurrences. An added benefit of an effective
RCA is that, over time, the root causes identified
across the population of occurrences can be used to
target major opportunities for improvement.
If, for example, a significant number of analyses
point to procurement inadequacies, then resources
can be focused on improvement of this management
system. Trending of root causes allows development
of systematic improvements and assessment of the
impact of corrective programs.

Definition
Although there is substantial debate on the definition of root cause, we use the following:
1. Root causes are specific underlying causes.
46

I JULY 2004 I www.asq.org

2. Root causes are those that can reasonably be


identified.
3. Root causes are those management has control
to fix.
4. Root causes are those for which effective recommendations for preventing recurrences can
be generated.
Root causes are underlying causes. The investigators goal should be to identify specific underlying causes. The more specific the investigator can
be about why an event occurred, the easier it will
be to arrive at recommendations that will prevent
recurrence.
Root causes are those that can reasonably be
identified. Occurrence investigations must be cost
beneficial. It is not practical to keep valuable manpower occupied indefinitely searching for the root
causes of occurrences. Structured RCA helps analysts get the most out of the time they have invested in the investigation.
Root causes are those over which management
has control. Analysts should avoid using general
cause classifications such as operator error, equipment failure or external factor. Such causes are not
specific enough to allow management to make
effective changes. Management needs to know
exactly why a failure occurred before action can be
taken to prevent recurrence.
We must also identify a root cause that management can influence. Identifying severe weather
as the root cause of parts not being delivered on
time to customers is not appropriate. Severe weather is not controlled by management.
Root causes are those for which effective recommendations can be generated. Recommendations
should directly address the root causes identified
during the investigation. If the analysts arrive at
vague recommendations such as, Improve adherence to written policies and procedures, then
they probably have not found a basic and specific
enough cause and need to expend more effort in the
analysis process.

Four Major Steps


The RCA is a four-step process involving the following:
1. Data collection.
2. Causal factor charting.

Causal Factor Chart

FIGURE 1

Burner

Part one

Electric
burner
shorts out
CF

Pan

Arcing heats
bottom of
aluminum
pan

Had it
not been
originally charged?
Fire
extinguisher

Pan

Jane

Had it
leaked?

Aluminum
melts,
forming
hole in pan

Fire extinguisher,
floor

Jane comes
to the door
Conclusion

Jane, Mary

How
much oil is
used? How
much chicken?
Chicken,
pan, oil Mary

Mary
begins
frying
chicken
5:00 pm
Pan

Mary
uses an
aluminum
pan

Grease ignites
when it
contacts
burner

Jane rings
the doorbell

What
exactly
did she see?

Had it
been
previously used?

Mary

Inspection tag

Assumed

Fire
generates
smoke

Mary

Mary

Mary sees
the fire
on the stove

Fire extinguisher
is not
charged

Mary

Mary

Mary leaves
the frying
chicken
unattended
CF

Fire starts
on the
stove
Mary

Mary meets
with Jane

Jane, Mary

Smoke
detector
alarms

Mary

Mary runs
into the
kitchen

Mary

Mary

Mary tries
to use
the fire
extinguisher

About 5:10 pm

Fire extinguisher
does not
operate when
Mary tries to use it
CF

Mary

10 minutes

Mary pulls
the plug
on the fire
extinguisher
Is "plug"
the same
as pin?

Does Mary
know how
to use a fire
extinguisher?

Mary

Mary

CF = Causal factor
Figure 1 continued on next page

QUALITY PROGRESS

I JULY 2004 I 47

QUALITY BASICS

Part two
Did she know
this was wrong?
Lack of practice
fighting fires?

Did she do
anything else?

Mary

What is
Jane doing during
this time?
Mary, Jane
Mary, pan

Mary

Was Mary
trying to do this?

How long
did it take for the
FD to arrive?

Fire was a
grease fire

FD
dispatcher

Mary
Mary

Mary throws
water on
the fire

Fire spreads
throughout
the kitchen
CF

Mary, FD

Kitchen, Mary

Mary calls the


fire department
Time?

3. Root cause identification.


4. Recommendation generation and implementation.
Step onedata collection. The first step in the
analysis is to gather data. Without complete information and an understanding of the event, the
causal factors and root causes associated with the
event cannot be identified. The majority of time
spent analyzing an event is spent in gathering
data.
Step twoCausal factor charting. Causal factor
charting provides a structure for investigators to organize and analyze the information gathered during
the investigation and identify gaps and deficiencies
in knowledge as the investigation progresses. The
causal factor chart is simply a sequence diagram
with logic tests that describes the events leading up
to an occurrence, plus the conditions surrounding
these events (see Figure 1, p. 47).
Preparation of the causal factor chart should
begin as soon as investigators start to collect information about the occurrence. They begin with a
skeleton chart that is modified as more relevant
facts are uncovered. The causal factor chart should
48

I JULY 2004 I www.asq.org

Did the FD
use the correct
techniques?
FD

Observation

FD, observation

Fire department
arrives

Fire department
puts out fire

Time?

Time?

Kitchen
destroyed
by fire

Other losses
from smoke and
water damage?

drive the data collection process by identifying


data needs.
Data collection continues until the investigators
are satisfied with the thoroughness of the chart
(and hence are satisfied with the thoroughness of
the investigation). When the entire occurrence has
been charted out, the investigators are in a good
position to identify the major contributors to the
incident, called causal factors. Causal factors are
those contributors (human errors and component
failures) that, if eliminated, would have either prevented the occurrence or reduced its severity.
In many traditional analyses, the most visible
causal factor is given all the attention. Rarely, however, is there just one causal factor; events are usually the result of a combination of contributors.
When only one obvious causal factor is addressed,
the list of recommendations will likely not be complete. Consequently, the occurrence may repeat
itself because the organization did not learn all that
it could from the event.
Step threeroot cause identification. After all
the causal factors have been identified, the investigators begin root cause identification. This step

involves the use of a decision diagram called the


Root Cause Map (see Figure 2, p. 50) to identify the
underlying reason or reasons for each causal factor.
The map structures the reasoning process of the
investigators by helping them answer questions
about why particular causal factors exist or
occurred. The identification of root causes helps
the investigator determine the reasons the event
occurred so the problems surrounding the occurrence can be addressed.
Step fourrecommendation generation and
implementation. The next step is the generation of
recommendations. Following identification of the
root causes for a particular causal factor, achievable
recommendations for preventing its recurrence are
then generated.
The root cause analyst is often not responsible
for the implementation of recommendations generated by the analysis. However, if the recommendations are not implemented, the effort expended in
performing the analysis is wasted. In addition, the
events that triggered the analysis should be expected to recur. Organizations need to ensure that recommendations are tracked to completion.

Presentation of Results
Root cause summary tables (see Table 1, p. 52)
can organize the information compiled during data
analysis, root cause identification and recommendation generation. Each column represents a major
aspect of the RCA process.
In the first column, a general description of the
causal factor is presented along with sufficient
background information for the reader to be
able to understand the need to address this
causal factor.
The second column shows the Path or Paths
through the Root Cause Map associated with
the causal factor.
The third column presents recommendations
to address each of the root causes identified.
Use of this three-column format aids the investigator in ensuring root causes and recommendations are developed for each causal factor.
The end result of an RCA investigation is generally an investigation report. The format of the
report is usually well defined by the administrative
documents governing the particular reporting sys-

tem, but the completed causal factor chart and


causal factor summary tables provide most of the
information required by most reporting systems.

Example Problem
The following example is nontechnical, allowing
the reader to focus on the analysis process and not
the technical aspects of the situation. The following
narrative is the account of the event according to
Mary:
It was 5 p.m. I was frying chicken. My friend
Jane stopped by on her way home from the doctor, and she was very upset. I invited her into
the living room so we could talk. After about 10
minutes, the smoke detector near the kitchen
came on. I ran into the kitchen and found a fire
on the stove. I reached for the fire extinguisher
and pulled the plug. Nothing happened. The
fire extinguisher was not charged. In desperation, I threw water on the fire. The fire spread
throughout the kitchen. I called the fire department, but the kitchen was destroyed. The fire
department arrived in time to save the rest of
the house.

Data gathering began as soon as possible after


the event to prevent loss or alteration of the data.
The RCA team toured the area as soon as the fire

In many traditional analyses,


the most visible causal factor
is given all the attention.
department declared it safe. Because data from
people are the most fragile, Mary, Jane and the firefighters were interviewed immediately after the
fire. Photographs were taken to record physical
and position data.
The analysts then developed the causal factor
chart (see Figure 1, p. 47) to clearly define the
sequence of events that led to the fire. The causal
factor chart begins with the event; Mary begins frying chicken at 5 p.m. As the chart develops from
QUALITY PROGRESS

I JULY 2004 I 49

QUALITY BASICS

FIGURE 2

Root Cause Map


Start here with each causal factor.

Section one

1
1

Equipment difficulty

Equipment
design problem

Equipment
reliability program
problem
6

Installation/
fabrication

Equipment
misuse

8
2

Design input/
output
15
Design input
LTA 16
Design output
LTA 17

Equipment
records

Equipment reliability
program design
less than adequate (LTA) 21

18

Equipment
design records
LTA 19
Equipment
operating/
maintenance
history LTA 20

No program 22
Program LTA 23
Analysis/design
procedure LTA 24
Inappropriate type
of maintenance
assigned 25
Risk acceptance
criteria LTA 26
Allocation of
resources LTA 27

Note: Node numbers correspond to matching page in Appendix A of the


Root Cause Analysis Handbook.

Standards,
policies or
administrative
controls (SPACs)
LTA 57
No SPACs 59
Not strict
enough 60
Confusing,
contradictory or
incomplete 61
Technical error 62
Responsibility
for item/activity
not adequately
defined 63
Planning, scheduling
or tracking of work
activities LTA 64
Rewards/incentives
LTA 65
Employee screening/
hiring LTA 66

50

Safety/hazard/
risk review 72
Review LTA or
not performed 74
Recommendations not
yet implemented 75
Risk acceptance
criteria LTA 76
Review procedure
LTA 77

SPACs not used 67


Communication of
SPACs LTA 69
Recently changed
Enforcement LTA

I JULY 2004 I www.asq.org

70
71

Equipment reliability
program implementation
LTA
28
Corrective maintenance
LTA 29
Troubleshooting/corrective
action LTA 30
Repair implementation
LTA 31
Preventive maintenance
LTA 32
Frequency LTA 33
Scope LTA 34
Activity implementation
LTA 35
Predictive maintenance
LTA 36
Detection LTA 37
Monitoring LTA 38
Troubleshooting/
corrective action LTA 39
Activity implementation
LTA 40

Product/material
control 85
Handling LTA 87
Storage LTA 88
Packaging/
shipping LTA 89
Unauthorized material
substitution 90
Product acceptance
criteria LTA 91
Product inspections
LTA 92

Problem
identification
control 78
Problem reporting
LTA 80
Problem analysis
LTA 81
Audits LTA 82
Corrective action
LTA 83
Corrective actions not
yet implemented 84

Procedures
111

Proactive maintenance
LTA 41
Event specification
LTA 42
Monitoring LTA 43
Scope LTA 44
Activity implementation
LTA 45
Failure finding maintenance
LTA 46
Frequency LTA 47
Scope LTA 48
Troubleshooting/
corrective action LTA 49
Repair implementation 50
Routine equipment
rounds LTA 51
Frequency LTA 52
Scope LTA 53
Activity implementation
LTA 54

Procurement
control 93
Purchasing
specifications LTA 95
Control of changes
to procurement
specifications LTA 96
Material acceptance
requirements LTA 97
Material inspections
LTA 98
Contractor selection
LTA 99

Not used 112


Not available or
inconvenient to
obtain 113
Procedure difficult
to use 114
Use not required
but should be 115
No procedure for
task 116

Administrative/
management
systems 55

Document and
configuration
control 100
Change not
identified 102
Verification of design/
field changes LTA
(no PSSR*) 103
Documentation
content not kept
up to date 104
Control of official
documents LTA 105

Misleading/confusing 117
Format confusing or
LTA 118
More than one action
per step 120
No checkoff space
provided but should be 121
Inadequate checklist 122
Graphics LTA 123
Ambiguous or confusing
instructions/
requirements 124
Data/computations
wrong/incomplete 125
Insufficient or excessive
references 126
Identification of revised
steps LTA 127
Level of detail LTA 128
Difficult to identify 129

Customer
interface/
services 106
Customer
requirements
not identified 108
Customer needs
not addressed 109
Implementation
LTA 110

Wrong/incomplete 130
Typographical error 131
Sequence wrong 132
Facts wrong/
requirements not
correct 133
Wrong revision or
expired procedure
revision used 134
Inconsistency
between
requirements 135
Incomplete/situation
not covered 136
Overlap or gaps
between
procedures 137

Figure 2 continued on next page

Start here with each causal factor.

Section Two

1
Personal difficulty

Company
employee

Other difficulty

Contract
employee

Natural
phenomena

10

Sabotage/
horseplay

11

12

External
events

Other

13

14

2
Human factors
engineering
138
No training 164
Decision not
to train 165
Training
requirements not
identified 166

Immediate
supervision

Training
163
Training records
system LTA 167
Training records
incorrect 168
Training records
not up to date 169

No communication or
not timely 194
Method unavailable or
LTA 195
Communication between
work groups LTA 196
Communication between
shifts and management
LTA 197
Communication with
contractors LTA 198
Communication with
customers LTA 199

Training LTA 170


Job/task analysis
LTA 171
Program design/
objectives LTA 172
Lesson content
LTA 174
On-the-job
training LTA 175
Qualification
testing LTA 176
Continuing
training LTA 177
Training
resources LTA 178
Abnormal events/
emergency
training LTA 179

Misunderstood
communication 200
Standard
terminology not
used 201
Verification/
repeat back not
used 202
Long message 203

Wrong
instructions

204

180

Communications
192

Personal
performance

208

Preparation 181
No preparation 182
Job plan LTA 183
Instructions to workers
LTA 184
Walkthrough LTA 185
Scheduling LTA 186
Worker selection/
assignment LTA 187
Supervision during
work 188
Supervision LTA 189
Improper performance
not corrected 190
Teamwork LTA 191

Problem
detection LTA 209
*Sensory/perceptual
capabilities LTA 210
*Reasoning
capabilities LTA 211
*Motor/physical
capabilities LTA 212
*Attitude/attention
LTA 213
*Rest/sleep LTA
(fatigue) 214
*Personal/medication
problems 215

Job turnover LTA 205


Communication
within shifts LTA 206
Communication
between shifts
LTA 207

*PSSR = Project scope summary report

Shape

Description
Primary difficulty source
Problem category

Workplace layout 140


Controls/displays
LTA 141
Control/display
integration/
arrangement LTA 143
Location of
controls/displays
LTA 144
Conflicting layouts 145
Equipment
location LTA 146
Labeling of
equipment or
locations LTA 147

Work environment 148


Housekeeping LTA 149
Tools LTA 150
Protective clothing/
equipment LTA 151
Ambient
conditions LTA 152
Other environmental
stresses excessive 154

Workload 155
Excessive control
action
requirements 156
Unrealistic
monitoring
requirements 157
Knowledge based
decision
required 158
Excessive
calculation or
data manipulation
required 159

Intolerant
system 160
Errors not
detectable 161
Errors not
correctable 162

Root cause category


Near root cause
Root cause
1995, 1997, 1999, 2000 and 2001, ABSG Consulting Inc.

*Note: These nodes are for descriptive


purposes only.

QUALITY PROGRESS

I JULY 2004 I 51

QUALITY BASICS

TABLE 1

Root Cause Summary Table

Event description: Kitchen is destroyed by fire and damaged by smoke and water.
Causal factor # 1
Description:
Mary leaves the frying chicken unattended.

Paths Through Root Cause Map


Personnel difficulty.
Administrative/management systems.
Standards, policies or administrative
controls (SPACs) less than adequate (LTA).
No SPACs.

Causal factor # 2
Description:
Electric burner element fails (shorts out).

Paths Through Root Cause Map

Causal factor # 3
Description:
Fire extinguisher does not operate when
Mary tries to use it.

Paths Through Root Cause Map

Paths Through Root Cause Map is a trademark of ABSG Consulting.

52

I JULY 2004 I www.asq.org

Recommendations
Implement a policy that hot oil is never left
unattended on the stove.
Determine whether policies should be
developed for other types of hazards in the
facility to ensure they are not left unattended.
Modify the risk assessment process or
procedure development process to address
requirements for personnel attendance
during process operations.
Recommendations
Replace all burners on stove.
Develop a preventive maintenance strategy
to periodically replace the burner elements.
Consider alternative methods for preparing
chicken that may involve fewer hazards,
such as baking the chicken or purchasing
the finished product from a supplier.
Recommendations

Equipment difficulty.
Equipment reliability program problem.
Equipment proactive maintenance LTA.
Activity implementation LTA.

Refill the fire extinguisher.


Inspect other fire extinguishers in the
facility to ensure they are full.
Have incident reports describing the use of
fire protection equipment routed to
maintenance to trigger refilling of the fire
extinguishers.

Equipment difficulty.
Equipment reliability program problem.
Administrative/management systems.
Problem identification and control LTA.

Add this fire extinguisher to the audit list.


Verify that all fire extinguishers are on the
quarterly fire extinguisher audit list.
Have all maintenance work requests that
involve fire protection equipment routed to
the safety engineer so the quarterly
checklists can be modified as required.

Causal factor # 4
Description:
Mary throws water on fire.

Equipment difficulty.
Equipment reliability program problem.
Equipment reliability program design LTA.
No program.

Event #: 2003-1

Paths Through Root Cause Map

Personnel difficulty.
Company employee.
Training.
Training LTA.
Abnormal events/emergency training LTA.

Recommendations
Provide practical (hands-on) training
on the use of fire extinguishers. Classroom
training may be insufficient to adequately
learn this skill.
Review other skill based activities to
ensure appropriate level of hands-on training
is provided.
Review the training development process
to ensure adequate guidance is provided for
determining the proper training setting (for
example,classroom, lab, simulator, on the job
training, computer based training).

left to right, the sequences begin to unfold. The loss


eventskitchen destroyed by fire and other losses
from smoke and water damageare the shaded
rectangles in the causal factor chart.
Although we read the chart from left to right, it
is developed from right to left (backwards).
Development always starts at the end because that
is always a known fact. Logic and time tests are
used to build the chart back to the beginning of
the event. Numerous questions are usually generated that identify additional necessary data.
After the causal factor chart was complete (additional data were gathered to answer the questions
shown in Figure 1), the analysts identified the factors that influenced the course of events. There are
four causal factors for this event (see Table 1).
Elimination of these causal factors would have
either prevented the occurrence or reduced its severity. Note the recommendations in Table 1 are written
as if Marys house were an industrial facility.
Notice that causal factor two may be unexpected. It wasnt overheating of the oil or splattering of
the oil that ignited the fire. If the wrong causal factor is identified, the wrong corrective actions will
be developed.
The application of the technique identified that
the electric burner element failed by shorting out.
The short melted Marys aluminum pan, releasing
the oil onto the hot burner, starting the fire.
The analyst must be willing to probe the data
first to determine what happened during the occurrence, second to describe how it happened, and
third to understand why.

Root Cause Analysis Handbook, WSRC-IM-91-3, Department of


Energy, 1991 (and earlier versions).
Root Cause Analysis Handbook: A Guide to Effective
Investigation, ABSG Consulting Inc., 1999.
Users Guide for Reactor Incident Root Cause Coding Tree, revision five, DPST-87-209, E.I. duPont de Nemours, Savannah River Laboratory, 1986.

JAMES J. ROONEY is a senior risk and reliability engineer


with ABSG Consulting Inc.s Risk Consulting Division in
Knoxville, TN. He earned a masters degree in nuclear engineering from the University of Tennessee. Rooney is a Fellow
of ASQ and an ASQ certified quality auditor, quality auditor-hazard analysis and critical control points, quality engineer, quality improvement associate, quality manager and
reliability engineer.

LEE N. VANDEN HEUVEL is a senior risk and reliability


engineer with ABSG Consulting Inc.s Risk Consulting
Division in Knoxville, TN. He earned a masters degree in
nuclear engineering from the University of Wisconsin.
Vanden Heuvel co-authored the Root Cause Analysis
Handbook: A Guide to Effective Incident Investigation, co-developed the RootCause Leader software and was
a co-author of the Center for Chemical Process Safetys
Guidelines for Investigating Chemical Process
Incidents. He develops and teaches courses on the subject.

BIBLIOGRAPHY

Accident/Incident Investigation Manual, second edition,


DOE/SSDC 76-45/27, Department of Energy.
Events and Causal Factors Charting, DOE/SSDC 76-45/14,
Department of Energy, 1985.
Ferry, Ted S., Modern Accident Investigation and Analysis, second edition, John Wiley and Sons, 1988.
Guidelines for Investigating Chemical Process Incidents,
American Institute of Chemical Engineers, Center for
Chemical Process Safety, 1992.
Occupational Safety and Health Administration Accident
Investigation Course, Office of Training and Education, 1993.

Please
comment
If you would like to comment on this article,
please post your remarks on the Quality Progress
Discussion Board at www.asq.org, or e-mail them
to [email protected].

QUALITY PROGRESS

I JULY 2004 I 53

You might also like