How software systems fail - Part 2b - Processes
Dr Richard Cook's 18 characteristics of complex systems failure applied to software. Part 2b of 3 focuses on three of the six characteristics, demonstrating how quality is lost at the process level.
Key insight
The introduction of change can lead to new forms of failure, and it is essential to address these potential failures by having the humility to accept that they could be due to the actions of the system, not that of its users. Systems builder's and maintainers' experience with failure is essential to operating complex software systems to lower the chances of catastrophic failure.
The three main takeaways
1. Change introduces new forms of failure: Being open to the fact that systems builders and maintainers actions may have (unintentionally) introduced them is critical to resolving them when they are high-frequency but low-consequence. Otherwise, you risk them building up and becoming low-frequency but high-consequence.
2. "End-of-the-chain" testing creates issues: Placing testers at the end of the development process slows it down and leads to developers using testers as a safety net instead of focusing on improving their testing processes.
3. Failure-free operations require experience with failure: Similar to how firefighters, police officers, and soldiers practice in simulated events to learn and apply their skills effectively, engineering teams should use chaos engineering and simulated failures to build resilience and learn how to respond more effectively in the future.
Some background
What is quality engineering?
If you are new to the Quality Engineering Newsletter or looking for a recap of what quality engineering is all about, then check out What is quality engineering. In short:
Quality engineering is more than testing earlier in the software life cycle. It's about looking at all the facets of software engineering. From delivering the product to the processes we use to build it and the people involved. It's about taking a holistic approach to quality and understanding how quality is created, maintained and lost throughout the software life cycle. It is then, using this insight, we build quality at the source.
What are products, processes and people all about?
In this series of posts, I plan to apply the lens of quality engineering to Richard Cook's 18 characteristics of how complex software systems fail and specifically look at how quality is lost in complex software systems. To make it more consumable, I will split the 18 characteristics into three groups: products, processes and people or 3Ps. From What is quality engineering:
Products, Process and People (3Ps)
Engineering teams have the most direct influence on the quality of the products or services they provide in these three areas.
Products are typically the software that the team produces. Processes are how they create that product, and the People are the team members who will use the processes to create the products.
When we think about quality, we usually only consider the product quality attributes. However, those attributes are highly influenced by the processes adopted by the team and their inter-team interactions. Therefore, if the quality of their processes or inter-team interactions is poor, these will likely transfer to their products.
I aim to use the 3Ps to help identify the patterns that cause complex software systems to fail. By doing so, I hope to use this knowledge to increase the chances of creating more favourable conditions for building quality into complex software systems.
Past posts in the series
How software systems fail: Part 1 - Products - looked at 6 characteristics that can help quality engineers understand how quality can be lost at the product layers. This post covers the background of where the 18 characteristics originated, how complex systems are inherently unpredictable, how catastrophic failure can occur when multiple defences fail and how quality is an emergent behaviour of complex software systems.
How software systems fail - Part 2a - Processes looked at the first 3 (of 6) characteristics that can help quality engineers understand how quality can be lost at the process layer. This post covered how we are influenced by hindsight bias and our inclination to attribute them to a single root cause, leading us to overestimate their preventability.
The processes post was broken up into 3 smaller posts as it was too long to consume in one go.
What is this post about?
This post is part of a series that examines Dr. Richard Cook's 18 characteristics of complex systems failure as applied to software.
How software systems fail - Part 2b - Processes (this post) will look at the last 3 (of 6) characteristics that can help quality engineers understand how quality can be lost at the process layers. This post covers how change can lead to new forms of failure and that systems builders' and maintainers' experience with failure is essential to operating complex software systems to lower the chances of catastrophic failure.
What is the next post?
How software systems fail - Part 2c - Processes will look at how we can use the 6 characteristics from posts 2a and 2b to build quality into the process layers.
The 3 characteristics this post will look at are
14) Change introduces new forms of failure.
15) Views of 'cause' limit the effectiveness of defences against future events.
18) Failure free operations require experience with failure.
See post How software systems fail - Part 2a - Processes for the first 3 characteristics
6) Catastrophe is always just around the corner.
7) Post-accident attribution accident to a 'root cause' is fundamentally wrong.
8) Hindsight biases post-accident assessments of human performance.
Note: I kept Richard's original numbering scheme for the characteristics. Hence, the numbers jump around. I've also included the original text for each characteristic so you can see how I interpreted them. However, you can find the original research paper at ResearchGate.
14. Change introduces new forms of failure.
The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. When new technologies are used to eliminate well understood system failures or to gain high precision performance they often introduce new pathways to large scale, catastrophic failures. Not uncommonly, these new, rare catastrophes have even greater impact than those eliminated by the new technology. These new forms of failure are difficult to see before the fact; attention is paid mostly to the putative beneficial characteristics of the changes. Because these new, high consequence accidents occur at a low rate, multiple system changes may occur before an accident, making it hard to see the contribution of technology to the failure.
Before 1996, each British Post Office used paper receipts to manage their tills and accounts. Which was slow and error-prone but workable or, put another way, low consequence but high-frequency failures. The Post Office, looking to modernise, introduced a new software system and tills called Horizon in 1999, developed externally by Fujitsu.
The new system would connect the entire post office network, allowing post office managers (more commonly called sub-postmasters in the UK) to manage their accounts at the click of a button, saving time and money and improving accuracy.
But the system wasn't up to scratch, resulting in bugs that that could result in huge discrepancies in accounts (there was one instance of up to £24,000 shortfall in till takings).
Due to these accounting discrepancies, the Post Office took the managers to court and prosecuted up to 700* people between 1999 and 2015. Put another way, these were low-frequency but high-consequence failures.
*To be fair, 700 is a lot of people, and in my opinion, even one was too many considering the harm this caused, but considering there are over 6000+ sub-postmasters, it is 10% of the workforce.
Where this case goes from a software systems failure to a catastrophe is because the Post Office leadership appears to have known that the system was not up to scratch, but instead of trying to correct the issue, they covered it up and severely prosecuted anyone who suggested the system was at fault.
The senior leadership team at the post office would not accept that their sophisticated software system and all its benefits could be the source of the accounting issues and that all wrongdoing was placed on the 100s of post office managers.
Change not only introduced software systems failure but also highlighted the cultural failure at the Post Office, where leadership was always right and anyone who disagreed was wrong and punished for saying so.
Sources:
A good high level summary from the BBC on the Horizon Post Office scandal
Timeline of events from the Post Office Website about the Horizon Post Office scandal
Horizon Post Office scandal Yahoo timeline includes details of bonuses paid to investigators to prosecute subpostmasters
A short article from The Guardian about the Post Office Horizon scandal on what the past system was, the advantage Horizon would give them and some of the bugs it contained
An excellent Computer Weekly article about the Post Office Horizon scandal covering all the details, including what the post office did, how they behaved and tried to cover it up. It also covers how the subpostmasters managed to get them to stop prosecuting. This is well worth a read.
15. Views of 'cause' limit the effectiveness of defences against future events.
Post-accident remedies for "human error" are usually predicated on obstructing activities that can "cause" accidents. These end-of-the-chain measures do little to reduce the likelihood of further accidents. In fact that likelihood of an identical accident is already extraordinarily low because the pattern of latent failures changes constantly. Instead of increasing safety, post-accident remedies usually increase the coupling and complexity of the system. This increases the potential number of latent failures and also makes the detection and blocking of accident trajectories more difficult.
The most common post-accident remedy for human error I've seen in software systems is placing Testers at the “end-of-the-chain”. This is where Testers are placed at the end of the development process to catch the bugs that Developers unintentionally introduce into the software.
While this approach does tell you what issues there are, it does nothing to lower the likelihood of future problems. This "end-of-the-chain" approach not only slows down the process—now all dev work must go through Testers—but also makes Testers gatekeepers. If something gets past the Testers, then test teams are blamed for not catching the issue, so they block the release until they've checked everything.
All work that has to pass through Testers has the unintended side effect of Developers doing less testing and starting to use Testers as a safety net to catch their issues. I've seen instances where Developers have known about issues in their code but opted not to fix them to see if the Testers caught them. This wasn't about Developers testing the Testers but about seeing if the issue was worth fixing. If the Testers missed the problem, what is the chance that end users would see the bug?
"End-of-the-chain" testing also slows the Developer feedback loop to help them understand if their changes produce the intended results. Slowing the feedback loop can make solving issues harder due to context switching—it's much easier to understand and fix something as you're doing the work than to stop and start.
18) Failure free operations require experience with failure.
Recognising hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure. More robust system performance is likely to arise in systems where operators can discern the "edge of the envelope". This is where system performance begins to deteriorate, becomes difficult to predict, or cannot be readily recovered. In intrinsically hazardous systems, operators are expected to encounter and appreciate hazards in ways that lead to overall performance that is desirable. Improved safety depends on providing operators with calibrated views of the hazards. It also depends on providing calibration about how their actions move system performance towards or away from the edge of the envelope
Firefighters, Police officers, and Soldiers can only learn to do their jobs if they have experience with the situations they are called to handle. But learning only in real-life situations is probably too late and will likely cause more harm to themselves or the people they are trying to help. So, they practice in simulated events to learn and apply the skills they need to be effective at their jobs.
Building and maintaining complex software systems should be no different with engineering teams using chaos engineering to experiment on their systems and build confidence that they will withstand turbulent conditions. Organisations have different terms for this, with some calling them game days, war games or fire drills. Still, all have similar aims: to create simulated failures (sometimes in test environments, other times in production) and see how the team handles the event. With a view to learning how they could respond more effectively in the future and explore the boundaries of performance of their systems (the edge of the envelope). Learning this way helps make engineering teams more resilient to failure. Simulated failures can also be great opportunities to practice blameless postmortems too.
Exploring the boundaries of the systems also needs tooling that allows engineering teams to monitor how systems are performing to detect potential issues. But also tools that will enable observability of those systems as they function to understand how failure and remedies affect the system. Monitoring and observability are what allow engineering teams to gain "calibrated views of the hazards" and learn how their actions "move system performance towards or away from the edge of the envelope."
Next time
Part 2c will be out next week and will cover what these characteristics mean and how we can use them.
Don't want to miss the next post? Then subscribe to get the post directly to your inbox. By subscribing, you'll be the first to know about our latest content. And if you've already subscribed, I want to emphasize how much your engagement means to me. Your subscriptions, shares, and comments are not just appreciated, they are vital in shaping the future of this newsletter. Thank you!