Over the weekend, I went to the local theater to watch Sully, the movie about the US Airways pilot who successfully landed his A320 jet on an icy river after losing both engines shortly after takeoff. I enjoyed the movie for its main story line but also for some of the back story elements that were included that were scarcely covered in the media.
For as long as I can remember, I have been an aviation enthusiast, with a particular interest in aviation accidents and near misses. When my wife finds me watching yet another story of an airplane crash, she wonders aloud why someone who travels as much as I do (I typically do 60,000 miles a year in the air) would have such an interest in flight disasters. “Doesn’t it make you a more nervous flier to learn about those crashes?”, she’ll often ask. My answer is this: It actually makes me a more confident air traveler, because each incident that occurs increases the body of knowledge on why planes crash and how to prevent future accidents. The more our experts know about the dangers of manned flight, the saver air travel becomes.
Studying Failures
For the same reasons, technical professionals should study disasters in their respective fields to learn what can happen and how to avoid catastrophes. Think about the types of events that constitute a technical disaster: an organization gets hacked, someone misplaces a hard drive with sensitive information, a database gets corrupted without a good backup, a data center’s backup power fails, or numerous other happenings that result in loss of data, revenue, or reputation. As awful as any of these events is, it would be even more tragic to miss the opportunity to learn exactly why and how it happened. To learn nothing from such a failure is to increase the risk of it occurring again.
Learning from failure shouldn’t be limited to the times when disaster strikes; near misses also provide an opportunity to forensically pick apart what could have been a news-making event. Imagine a scenario in which a DBA discovers that the system administrator account has a blank password, or finds a production database that has never been backed up. The first course of action is to fix the problem, but the analysis should not end there. Learning what systems or processes allowed such a condition to exist can help ensure that it never happens again.
Failure analysis, especially when the failure is within your own organization or group, takes some courage and a thick skin. Opening up systems and processes to scrutiny, particularly in the wake of a serious incident, can be painful. However, creating an environment where failures are used as learning experiences makes for a more robust infrastructure.
Studying failures, your own as well as those of others, is a valuable exercise for preventing failures and responding appropriately when they do occur. In any disaster or near miss, fix the problem first, and then focus on learning about why it happened.
This article was originally posted on Tim Mitchell’s Data Geek Newsletter. For early access to articles like this, subscribe to the Data Geek Newsletter!
Well said about disaster analysis. As you said this is the correct way to build the robust process. Thanks for your ideas apart from SQL.