How software systems fail - Part 2a - Processes
Dr Richard Cook's 18 characteristics of complex systems failure applied to software. Part 2a of 3 focuses on three of the six characteristics, demonstrating how quality is lost at the process level.
Key insight
Software system failures are often more complex than they may seem. We are influenced by hindsight bias and our inclination to attribute them to a single root cause, leading us to overestimate their preventability. Developing our understanding of how these biases affect our judgement helps us create processes that limit their effect on our decision-making.
Top three takeaways
1. Attempting to achieve failure-free operations of complex software systems is futile. A failure can occur at any time and by anyone and is a characteristic of complex software systems.
2. Attributing complex systems failure to a single root cause is mistaken because failure results from many smaller issues.
3. Hindsight bias causes people to believe that failures were easier to prevent than they were. It also blocks investigators from fully understanding what events led to, during, and after the failure, limiting their ability to learn from failure.
Some background
What is quality engineering?
If you are new to the Quality Engineering Newsletter or looking for a recap of what quality engineering is all about, then check out What is quality engineering. In short:
Quality engineering is more than testing earlier in the software life cycle. It's about looking at all the facets of software engineering. From delivering the product to the processes we use to build it and the people involved. It's about taking a holistic approach to quality and understanding how quality is created, maintained and lost throughout the software life cycle. It is then, using this insight, we build quality at the source.
What are products, processes and people all about?
In this series of posts, I plan to apply the lens of quality engineering to Richard Cook's 18 characteristics of how complex software systems fail and specifically look at how quality is lost in complex software systems. To make it more consumable, I will split the 18 characteristics into three groups: products, processes and people or 3Ps. From What is quality engineering:
Products, Process and People (3Ps)
Engineering teams have the most direct influence on the quality of the products or services they provide in these three areas.
Products are typically the software that the team produces. Processes are how they create that product, and the People are the team members who will use the processes to create the products.
When we think about quality, we usually only consider the product quality attributes. However, those attributes are highly influenced by the processes adopted by the team and their inter-team interactions. Therefore, if the quality of their processes or inter-team interactions is poor, these will likely transfer to their products.
I aim to use the 3Ps to help identify the patterns that cause complex software systems to fail. By doing so, I hope to use this knowledge to increase the chances of creating more favourable conditions for building quality into complex software systems.
In Part 1 of How Software Systems Fail, we looked at 6 of Dr Richard Cook's 18 characteristics of how complex systems fail and how they relate to the product level of software systems. If you want some context on the original characteristics, then take a look at Part 1, as this post will assume you have read it.
I had originally intended for How Software Systems Fail to be a three-part series. But Part 2 turned out to be a much longer post than I first thought. So, to make Part 2 more consumable, I've broken it up into three additional parts.
Part 2a and 2b will look at another 6 characteristics that help us understand how quality can be lost at the process level of socio-technical software systems. Part 2c will cover how these 6 characteristics can help quality engineers understand the patterns that cause quality to be lost within systems.
The 3 characteristics this post will look at are
6) Catastrophe is always just around the corner.
7) Post-accident attribution accident to a 'root cause' is fundamentally wrong.
8) Hindsight biases post-accident assessments of human performance.
Part 2b will come out in a week and will cover:
14) Change introduces new forms of failure.
15) Views of 'cause' limit the effectiveness of defences against future events.
18) Failure free operations require experience with failure.
Note: I kept Richard's original numbering scheme for the characteristics. Hence, the numbers jump around. I've also included the original text for each characteristic so you can see how I interpreted them. However, you can find the original research paper at ResearchGate.
6. Catastrophe is always just around the corner.
Complex systems possess potential for catastrophic failure. Human practitioners are nearly always in close physical and temporal proximity to these potential failures – disaster can occur at any time and in nearly any place. The potential for catastrophic outcome is a hallmark of complex systems. It is impossible to eliminate the potential for such catastrophic failure; the potential for such failure is always present by the system's own nature.
Delete the wrong lines of code, run the wrong command, miss a hyphen or send the wrong type of message, and suddenly, you have a system failure on your hands. Each of these failures can be caused by any engineering team member who has access to the relevant systems. Failure can occur at any time and in nearly any place. The potential for catastrophe is always just around the corner. Systems failure is a characteristic of complex software systems, and some form of failure will inevitably occur. Trying to deny, ignore, blame or avoid failure is futile.
7. Post-accident attribution accident to a 'root cause' is fundamentally wrong.
Because overt failure requires multiple faults, there is no isolated 'cause' of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident. Indeed, it is the linking of these causes together that creates the circumstances required for the accident. Thus, no isolation of the 'root cause' of an accident is possible. The evaluations based on such reasoning as 'root cause' do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localised forces or events for outcomes.
The 2016 React issue, in which engineering teams could no longer build their projects due to a deleted third-party dependency (as described in Part 1, Catastrophe requires multiple failures), is a good example of how searching for a "root cause" would be misleading.
You could say the "root cause" of that issue occurred due to the maintainer of dependency deleting their repository. This is factually true, but what caused the maintainer to delete the repository? It wasn't a mistake but a deliberate action. As he explained in his blog post (no longer available), the maintainer deleted all 273 of his repositories in protest that NPM had sided with a large corporation over his use of the name Kik, which clashed with another company's name. He wasn't even aware that his 11-line package was within the React dependency tree. So, the root cause was NPM siding with Kik.com, or the root cause was Kik.com demanding that they have rightful ownership of Kik namespace even though someone else had gotten there before them.
The React issue's "root cause" depends on your perspective of who is at fault. Essentially, you're looking for someone to blame for the issue, which changes the whole narrative of what caused the problem and limits our understanding of all the nuances within the situation that led up to the issue. The reality is that all these events are linked and needed to occur for the problem to manifest within the React package.
8. Hindsight biases post-accident assessments of human performance.
Knowledge of the outcome makes it seem that events leading to the outcome should have appeared more salient to practitioners at the time than was actually the case. This means that ex post facto accident analysis of human performance is inaccurate. The outcome knowledge poisons the ability of after-accident observers to recreate the view of practitioners before the accident of those same factors. It seems that practitioners "should have known" that the factors would "inevitably" lead to an accident. Hindsight bias remains the primary obstacle to accident investigation, especially when expert human performance is involved.
It's always easier to look back and see how things connect than it is to look forward and see all the possible outcomes that could happen. Hindsight bias occurs as it makes our memory more efficient by dumping inaccurate information and saving only the correct. The problem is because we lose the erroneous information, we now believe that the outcome was obvious, and the people involved should have known the situation would play out the way it did.
Confirmation bias also means that once we know the outcomes of events, we focus on only the evidence that supports that outcome, neglecting the other information in the situation that may have led to other conclusions at the time.
In addition, if we were involved in the situation, the availability heuristic could come into play, and we only recall information that readily comes to mind, again biassing us toward the actual outcome.
Our biases are not bad as they are helpful in aiding us to process large amounts of information quickly and efficiently, and in simple cause-and-effect situations, they work quite accurately. Where they can become problematic is in complex situations where cause and effect is not clear and lots of hidden variables could lead to misdiagnoses of the situation.
Hindsight, confirmation and availability bias are why blameless postmortems of incidents start with getting as many of the people involved together and creating a timeline of events that focus on the things people did, not who and why they did them - no matter the insignificance of the action. Ideally, when in-production failures occur, you want a named incident responder that starts by creating an incident response document that records all the events and information shared as the incident unfolds. This way, you limit the bias that can creep in and build empathy for those involved.
Sources
Next time
Part 2b will continue to cover the 3 remaining characteristics that help us understand how quality can be lost at the process layer. These being:
14) Change introduces new forms of failure.
15) Views of 'cause' limit the effectiveness of defences against future events.
18) Failure free operations require experience with failure.
Part 2b will be out next week, and Part 2c, due the week after, will cover what these characteristics mean and how we can use them.
If you don't want to miss the next post, subscribe to get it directly in your inbox. If you've already subscribed, thank you. Subscribing, sharing, and commenting help me see that you appreciate my work and keeps pushing me to share more. Thank you!