How software systems fail - Part 2c - Processes
Part 2c will look at how we can use 6 characteristics of how complex systems fail to improve our understanding of how quality is lost at the process level of software systems.
Key insight
Failures are bound to happen in complex software systems. Trying to avoid, ignore, delay, or blame others will only make the failures worse. Software engineering teams need to be familiar with process failures to identify, contain, and recover from them calmly and efficiently. Instead of viewing failures as something to be avoided at all costs, they should see them as valuable learning opportunities to gain a deeper understanding of their systems' behaviour.
Three key takeaways
1. As builders, maintainers, and operators of complex software systems, we need to become intimately familiar with failure to help diagnose when it occurs, limit its impact, and restore systems to acceptable performance standards.
2. Root cause analysis, hindsight, confirmation, and availability biases can limit investigations into failures and need to be accounted for when conducting postmortems.
3. Improving future processes is not about blocking "human error" but about understanding why people did what they did and improving the system so that people can make better decisions in the future.
Some background
What is quality engineering?
If you are new to the Quality Engineering Newsletter or looking for a recap of what quality engineering is all about, then check out What is quality engineering. In short:
Quality engineering is more than testing earlier in the software life cycle. It's about looking at all the facets of software engineering. From delivering the product to the processes we use to build it and the people involved. It's about taking a holistic approach to quality and understanding how quality is created, maintained and lost throughout the software life cycle. It is then, using this insight, we build quality at the source.
What are products, processes and people all about?
In this series of posts, I plan to apply the lens of quality engineering to Richard Cook's 18 characteristics of how complex software systems fail and specifically look at how quality is lost in complex software systems. To make it more consumable, I will split the 18 characteristics into three groups: products, processes and people or 3Ps. From What is quality engineering:
Products, Process and People (3Ps)
Engineering teams have the most direct influence on the quality of the products or services they provide in these three areas.
Products are typically the software that the team produces. Processes are how they create that product, and the People are the team members who will use the processes to create the products.
When we think about quality, we usually only consider the product quality attributes. However, those attributes are highly influenced by the processes adopted by the team and their inter-team interactions. Therefore, if the quality of their processes or inter-team interactions is poor, these will likely transfer to their products.
I aim to use the 3Ps to help identify the patterns that cause complex software systems to fail. By doing so, I hope to use this knowledge to increase the chances of creating more favourable conditions for building quality into complex software systems.
Past posts in the series
How software systems fail: Part 1 - Products - looked at 6 characteristics that can help quality engineers understand how quality can be lost at the product layers. This post covers the background of where the 18 characteristics originated, how complex systems are inherently unpredictable, how catastrophic failure can occur when multiple defences fail and how quality is an emergent behaviour of complex software systems.
How software systems fail - Part 2a - Processes looked at the first 3 (of 6) characteristics that can help quality engineers understand how quality can be lost at the process layer. This post covered how we are influenced by hindsight bias and our inclination to attribute them to a single root cause, leading us to overestimate their preventability.
How software systems fail - Part 2b - Processes examined the last 3 (of 6) characteristics that can help quality engineers understand how quality can be lost at the process layers. This post covered how change can lead to new forms of failure and that systems builders' and maintainers' experience with failure is essential to operating complex software systems to lower the chances of catastrophic failure.
What is this post about?
How software systems fail - Part 2c - Processes (this post) will look at how we can use the 6 characteristics from posts 2a and 2b to build quality into the process layers.
I decided to break up the process post as I felt it would be easier to read as three smaller posts, and it allowed me to include more detail than if it was one big post, but what would people have preferred? If combined, the process post would have been toward the top end of large (3000+ words).
I aim for medium-sized posts (this post is approx. 1400 words) as they let me include the details without taking too long to read. But my talk write-ups can be towards essay size.
What are the six characteristics that illustrate how quality can be lost at the process layer of systems
The 6 characteristics that can help quality engineers understand how quality is lost at the process level:
6) Catastrophe is always just around the corner
7) Post-accident attribution of accidents to a 'root cause' is fundamentally wrong
8) Hindsight biases post-accident assessments of human performance
14) Change introduces new forms of failure
15) Views of 'cause' limit the effectiveness of defences against future events
18) Failure free operations require experience with failure
Note: I kept Richard's original numbering scheme for the characteristics, so the numbers jump around. If you’re looking for the orginial paper then you can find it at ResearchGate.
How do we lose quality at the process level?
We lose quality at the process layer of complex software systems due to failures either through our actions [6, 14] or due to some other environmental change. Trying to avoid, deny, ignore or blame others that it will not will only lead to even bigger catastrophes [6]. As builders, maintainers and operators of complex software systems, we need to become intimately familiar with failure to help diagnose when a failure occurs, limit its impact and bring systems back to acceptable performance standards [18]. However, once we have restored our systems, we need to take the time to investigate what happened and how. Investigation is not about attributing failure to a single root cause [7] but taking the time to understand all the actions and decisions that led to the failure [8]. Then, when looking to improve future processes, it's not about blocking "human error" but understanding why people did what they did and looking to improve the processes so people can make better decisions in the future [15].
How do we use our understanding of how quality is lost at the process layer?
Share stories of system failures
As Quality Engineers, we need to help our engineering teams see that failure is inevitable and in the vast majority of cases, unintentionally through our actions or some external environmental factor. One of the best ways to do this is by sharing stories of other systems' failures. For instance, how deleting a tiny package broke the internet, how NASA had to crash a probe due to a missing hyphen or how you could delete an entire company by running a command in the wrong directory. But the most compelling stories are the ones that are from your team and company, as they bring home the message that any one of us can make a mistake. (See characteristics 6, 7 and 14 for more details).
Develop an understanding of how biases affect decision making
We then want to help our teams understand how some of our inbuilt heuristics in decision-making can bias us towards blaming rather than understanding. For instance, helping people see how hindsight, confirmation, and availability bias affect our judgment, again through real-life stories. (See characteristic 8 and Hindsight Bias from The Decision Lab for some examples).
Complex software systems failure is multifaceted
Another aspect we have become too comfortable with is always looking for a root cause when a failure occurs. Root causes are likely to bias us towards a binary outcome, "If we had just done X, then none of this would have happened", or worse, blame human error for the cause of the issue. When we do this, we're more likely to put in "end-of-the-chain" solutions that attempt to block people from doing bad things, which has the potential not only to cause more issues but also cause quality to stagnate. (See characteristics 7 and 15 for more details).
Use game days and postmortems
Two of the best tools we have for helping organisations become more familiar with failure and better at working through it are game days and blameless postmortems. Game days help us to deliberately practice and improve our skills in resolving failures and blameless postmortems help to extract as much value from the failures that occur. Postmortems help teams see where their processes had latent failures. They also help build empathy for those involved and become a source of real-life failure stories that help others learn. Quality Engineers can incorporate postmortem insights into future game days to help other teams better handle the situation and reinforce empathy for those who had to handle the problem the first time. See characteristic 18 for more details).
Next time
Part 3 of this series will focus on last six characteristics that lead to quality loss at the people layer. These are:
9) Human operators have dual roles: as producers & as defenders against failure.
10) All practitioner actions are gambles.
11) Actions at the sharp end resolve all ambiguity.
12) Human practitioners are the adaptable element of complex systems.
13) Human expertise in complex systems is constantly changing.
17) People continuously create safety.
It will look at how people have dual roles: as producers and defenders against failure. People are the adaptable elements of complex systems, and they continuously create quality.
If you enjoyed reading this post, please forward it to others who may find it interesting. Additionally, please sign up below to become a subscriber and receive the next issue directly in your inbox. By forwarding and signing up, you support me and my work. Thank you!