1
FAILURE: How to Understand It, Learn from It and Recover from It
Failure and fault are virtually inseparable in households, organizations, and cultures. But the wisdom of learning from failure is much more than from success. Many a time we discover what works well, by finding out what will not work; and “probably he who have never made a mistake never made a discovery.”
Thomas Edison’s associate, Walter S. Mallory, while discussing inventions, once said to him, “Isn’t it a shame that with the tremendous amount of work you have done you haven’t been able to get any results?” Edison replied, with a smile, “Results! Why, my dear, I have gotten a lot of results! I know several thousand things that won’t work.”
People see success as positive and failure as negative phenomena. Edison’s quote emphasizes that failure isn’t a bad thing. You can learn and evolve from your past mistakes. But in organizations executives believe that failure is bad. These widely held beliefs are misguided. Understanding of failure’s causes and contexts will help to avoid the blame game and create an atmosphere of learning in the organization. Failure may sometimes considered bad, sometimes inevitable, and sometimes even good in organizations. In most companies, the system and procedures required to effectively detect and analyze failures are in short supply. Even the context‐specific learning strategies are not appreciated many times. In many organizations, managers often want to learn from failures to improve future performance. In the process, they and their teams used to devote many hours in after‐action reviews, post‐mortems, etc. But time after time these painstaking efforts led to no real change. The reason: being, managers think about failure in a wrong way.
To be able to learn from our failures, we need to develop a methodology to decode the “teachable moments” hidden within them. We need to find out what exactly those lessons are and how they can improve our chances of future success.
Failure Type
Although an infinite number of things can go wrong in machinery, systems, and process, mistakes fall into three broad categories: preventable failure, failure in complex system, and intelligent failure.
Preventable Failures
Most failures in this category are considered as “bad.” These could have been foreseen but weren’t. This is the worst kind of failure, and it usually occurs because an employee didn’t follow best practices, didn’t have the right talent, or didn’t pay attention to detail. They usually deviate from specification in the closely defined processes or deviate from routine operations and maintenance practices. But in such cases, the causes can be readily identified and solutions can be developed.
If you’ve experienced a preventable failure, it’s time to more deeply analyze the effort’s weaknesses and stick to what works in future. Employees can follow those new processes learned from past mistakes consistently, with proper training and support.
Human error used to be an area that was associated with high‐risk industries like aviation, rail, petrochemical and the nuclear industry. The high consequences of failure in these industries meant that there was a real obligation on companies to try to reduce the likelihood of all failure causes. Human error is also a high‐priority, preventable issue.
Unavoidable Failures in Complex Systems
In complex organizations such as aircraft carriers, nuclear power plants, and petrochemical plants, system failure is a perpetual risk. A large number of failures are due to the inherent uncertainty of working of such systems.
The lesson from this type of failure is to create systems to try to spot small failures resulting from complex factors, and take corrective action before it snowballs and destroys the whole system. These type of failure may not be considered bad but reviewed how complex systems work. Most accidents in these systems result from a series of small failures that went unnoticed and unfortunately lined up in just the wrong way.
The complex systems are heavily and successfully defended against failure by construction of multiple layers of defense against failure. These defenses include obvious technical components (e.g. backup systems, “safety” features of equipment) and human components (e.g. training, knowledge) but also a variety of organizational, institutional, and regulatory defenses (e.g. policies and procedures, certification, work rules, team training). The effect of these measures is to provide a series of shields that normally divert operations away from accidents.
Intelligent Failures
Intelligent failures occur when answers are not known in advance because this exact situation hasn’t been encountered before and experimentation is necessary in these cases. For example testing a prototype, designing a new type of machinery or operating a machine in different operating condition. In these settings, “trial and error” is the common term used for the kind of experimentation needed. These type of failures can be considered “good,” because they provide valuable insight and new knowledge that can help an organization to learn from past mistakes for its future growth. The lesson here is clear: If something works, do more of it. If it doesn’t, go back to the drawing board
Building a Learning Culture
Leaders can create and reinforce a culture that makes people feel comfortable for surfacing and learning from failures to avoid blame game. When things go wrong, they should insist to find out what happened – rather than “who did it.” This requires consistently reporting failures, small, and large; systematically analyzing them; and proactively taking steps to avoid reoccurrence.
Most organizations engage in all three kinds of work discussed above – routine, complex, and intelligent. Leaders must ensure that the right approach to learning from failure is applied in each of them. All organizations learn from failure through following essential activities: detection, analysis, learning, and sharing.
Detecting Failure
Spotting big, painful, expensive failures are easy. But failure that are hidden are hidden as long as it’s unlikely to cause immediate or obvious harm. The goal should be to surface it early, before it can create disaster when accompanied by other lapses in the system. High‐reliability‐organization (HRO) helps prevent catastrophic failures in complex systems like nuclear power plants, aircraft through early detection.
In a big petrochemical plant, the top management is religiously interested to tracks each plant for anything even slightly out of the ordinary, immediately investigates whatever turns up, and informs all its other plants of any anomalies. But many a time, these methods are not widely employed because senior executives – remain reluctant to convey bad news to bosses and colleagues.
Analyzing Failure
Most people avoid analyzing the failure altogether because many a time it is emotionally unpleasant and can chip away at our self‐esteem. Another reason is that analyzing organizational failures requires inquiry and openness, patience, and a tolerance for causal ambiguity. Hence, managers should be rewarded for thoughtful reflection. That is why the right culture can percolate in the organization.
Once a failure has been detected, it’s essential to find out the root causes not just relying on the obvious and superficial reasons. This requires the discipline to use sophisticated analysis to ensure that the right lessons are learned and the right remedies are employed. Engineers need to see that their organizations don’t just move on after a failure but stop to dig in and discover the wisdom contained in it.
A team of leading physicists, engineers, aviation experts, naval leaders, and even astronauts devoted months to an analysis of the Columbia disaster. They conclusively established not only the first‐order cause – a piece of foam had hit the shuttle’s leading edge during launch – but also second‐order causes: A rigid hierarchy and schedule‐obsessed culture at NASA made it especially difficult for engineers to speak up about anything but the most rock‐solid concerns.
Motivating people to go beyond first‐order reasons (procedures weren’t followed) to understanding the second‐ and third‐order reasons can be a major challenge. One way to do this is to use interdisciplinary teams with diverse skills and perspectives. Complex failures in particular are the result of multiple events that occurred in different departments or disciplines or at different levels of the organization. Understanding what happened and how to prevent it from happening again requires detailed, team‐based discussion, and analysis.
Here are some common root causes and their corresponding corrective actions:
- Design deficiency caused failure → Revisit in‐service loads and environmental effects, modify design appropriately.
- Manufacturing defect caused failure → Revisit manufacturing processes (e.g. casting, forging, machining, heat treat, coating, assembly) to ensure design requirements are met.
- Material defect caused failure → Implement raw material quality control plan.
- Misuse or abuse caused failure → Educate user in proper installation, use, care, and maintenance.
- Useful life exceeded → Educate user in proper overhaul/replacement intervals.
- There are various methods that failure analysts use – for example, Ishikawa “fishbone” diagrams, failure modes and effects analysis (FMEA), or fault tree analysis (FTA). Methods vary in approach, but all seek to determine the root cause of failure by looking at the characteristics and clues left behind.
Once the root cause of the failure has been determined, it is possible to develop a corrective action plan to prevent recurrence of the same failure mode. Understanding what caused one failure may allow us to improve upon our design process, manufacturing processes, material properties, or actual service conditions. This valuable insight may allow us to foresee and avoid potential problems before they occur in the future.
Share the Lessons
Failure is less painful when you extract the maximum value from it. If you learn from each mistake, large and small, share those lessons, and periodically check that these processes are helping your organization move more efficiently in the right direction, your return on failure will skyrocket. While it’s useful to reflect on individual failures, the real payoff comes when you spread the lessons across the organization. As one executive commented, “You need to build a review cycle where this is fed into a broader conversation.” When the information, ideas, and opportunities for improvement gained from an ...