How software systems fail: Part 1 - Products
Dr Richard Cook's 18 characteristics of complex systems failure applied to software. Part 1 of 3 focuses on six characteristics that demonstrate how quality is lost at the product level.
Key insight
Dr Richard Cook's 18 characteristics of how complex systems fail can help quality engineers identify the patterns that result in quality being lost at the product, process and people layers of organisations, allowing them to support engineering teams to build quality in. This post looks at how quality can be lost at the product layer.
Three key takeaways
Complex software systems are inherently unpredictable. They can exhibit unintended or hidden behaviours due to their socio-technical nature and constantly changing environments.
End users typically experience software systems as intended due to the numerous defences deployed by system designers, builders, maintainers, and operators. However, catastrophic failures can occur when multiple defences fail or do not exist.
Quality in complex software systems is not just the sum of its components. It is an emergent property that requires ongoing attention throughout the system's design, construction, maintenance, and operation.
If you are new to the Quality Engineering Newsletter or looking for a recap of what quality engineering is all about, then check out What is quality engineering. In short:
Quality engineering is more than testing earlier in the software life cycle. It's about looking at all the facets of software engineering. From delivering the product to the processes we use to build it and the people involved. It's about taking a holistic approach to quality and understanding how quality is created, maintained and lost throughout the software life cycle. It is then, using this insight, we build quality at the source.
This post will examine how quality is lost at the product level and how engineering teams can use this knowledge to improve quality at the source.
How Complex Systems Fail by Dr Richard Cook
In 1998, Dr. Richard Cook wrote a fascinating paper called How complex systems fail, which lists 18 characteristics of what can cause catastrophic failure in large complex systems. Richard was probably not just thinking about software systems when he wrote this but any system that involved people, especially within the medical field - he was an anaesthesiologist, a professor, and a software engineer.
In around 2010, John Allspaw of 10+ Deploys Per Day: Dev and Ops Cooperation at Flickr (which, if you've never watched it, I highly recommend you do as this is pretty much the birth of DevOps) came across Richard's 18 characteristics and realised how related they where to web infrastructure. This probably explains why Richard spoke at the Velocity Conference in 2012 about how complex systems fail. As he mentions in his talk, it's more about how complex systems don't fail. Still, it is a very watchable talk with many fascinating points. I particularly liked the ideas of systems as imagined and systems as found.
I recommend taking the time to watch and read Richard's talk and paper, but what I would like to do is apply the lens of quality engineering to his 18 characteristics and specifically look at how quality is lost in complex software systems. To make it more consumable, I will split the 18 characteristics into three groups: products, processes and people (3Ps).
I aim to use the 3Ps to help identify the patterns that cause complex software systems to fail. By doing so, I hope to use this knowledge to increase the chances of creating more favourable conditions more often.
This post is part 1 and will focus on the characteristics that I feel most closely align with how quality is lost at the product layer and will cover the following characteristics:
1) Complex systems are intrinsically hazardous systems.
2) Complex systems are heavily and successfully defended against failure.
3) Catastrophe requires multiple failures – single-point failures are not enough.
4) Complex systems contain changing mixtures of failures latent within them.
5) Complex systems run in degraded mode.
16) Safety is a characteristic of systems and not of their components.
Note: I kept Richard's original numbering scheme for the characteristics, hence the jump to 16 above. I've also included the original text for each characteristic so you can see how I interpreted them. However, you can find the original research paper at ResearchGate.
Six characteristics that lead to the loss of quality at the product layer
1. Complex systems are intrinsically hazardous systems.
All of the interesting systems (e.g. transportation, healthcare, power generation) are inherently and unavoidably hazardous by the own nature. The frequency of hazard exposure can sometimes be changed but the processes involved in the system are themselves intrinsically and irreducibly hazardous. It is the presence of these hazards that drives the creation of defenses against hazard that characterise these systems.
Any interesting software system tends to be complex, meaning it's made up of many parts that interact. These interactions will produce behaviours we intended, probably some we didn't and others we never even thought they could. Some of those unintended behaviours may be hazardous to users. For example, data loss may be due to user error (you refreshed the web page, and it lost what you had typed in) or complete system failure (you enter the URL and get a 404 error).
It is the presence of these hazards that drive the creation of defences against them that characterise these systems. For instance, Google Docs auto-saves as you type without needing people to save as they go. Document history allows users to view or revert to prior edits, no longer requiring people to have multiple back-ups.
2. Complex systems are heavily and successfully defended against failure
The high consequences of failure lead over time to the construction of multiple layers of defenses against failure. These defenses include obvious technical components (e.g. backup systems, 'safety' features of equipment) and human components (e.g. training, knowledge) but also a variety of organisational, institutional, and regulatory defenses (e.g. policies and procedures, certification, work rules, team training). The effect of these measures is to provide a series of shields that normally divert operations away from accidents.
Software systems will typically have multiple layers of defences to check for errors. Think of technical approaches, such as linters, automated unit, integration, and end-to-end testing, that execute as parts of build pipelines and stop the build if they fail.
However, human components such as pair programming are also used to catch issues (amongst many other benefits) as they are typed, and exploratory testing to identify end-to-end system behaviour. Software systems also have organisational and regulatory defences, such as defined ways of working within engineering teams, e.g. scrum or Kanban, to break down work into smaller chunks and minimise the impact of failure, but also ISO standards that specific industries have to follow, e.g. financial, to prevent poor practice with security. These combined approaches are measures to divert daily operations away from accidents.
3. Catastrophe requires multiple failures – single point failures are not enough
The array of defenses works. System operations are generally successful. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure. Put another way, there are many more failure opportunities than overt system accidents. Most initial failure trajectories are blocked by designed system safety components. Trajectories that reach the operational level are mostly blocked, usually by practitioners.
But what constitutes a catastrophe? There are many instances of software failures caused by small code changes. For example, in 2016 engineering teams could no longer rebuild their React projects (used by Facebook, Netflix, Atlassian etc) for two days because a dependency of only 11 lines was deleted by its maintainer. Or when, in 1962, the NASA Mariner 1 probe had to be destructively aborted due to the team no longer being able to control its trajectory due to a missing hyphen in its guidance system. These issues occur when small failures become significant problems, but are they catastrophes?
The React issue was detected almost immediately as teams tried to rebuild their projects using their build pipelines. It also never had any end-user impact, and npm restored the deleted package. The Mariner 1 probe did cost NASA millions of dollars, but it was detected and aborted by them 293 seconds into the flight so they didn't lose control of the probe, which could have crash-landed in a residential area, causing major injury or loss of human life.
In both cases, the practitioners working in their respective fields detected them and blocked them from having further impact. However, both these examples show that multiple failures would have to occur for them to have been catastrophes.
4) Complex systems contain changing mixtures of failures latent within them.
The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations. Eradication of all latent failures is limited primarily by economic cost but also because it is difficult before the fact to see how such failures might contribute to an accident. The failures change constantly because of changing technology, work organisation, and efforts to eradicate failures.
With every passing year, engineering teams can do more with less by relying on other engineers' software. These packages can come from our organisation or, in the case of opensource external developers. This means that at any point, our code bases are likely to include code that our teams have not written or even seen other than the interface that it makes available. Therefore, our software systems will likely have bugs we know about (from our code) but lots we don't (code from other sources).
For example, the React issue mentioned above more than likely has other 3rd party packages that it depends on that could be deleted by their maintainers at any time. However, this is an accepted risk by engineering teams that could happen, and before 2016, most thought it would not. 3rd party dependency is just one example of a known latent failure lurking in a system. The engineering teams could put in workarounds to stop it from becoming a reality, but it would cost too much time and effort. In addition, they also have systems (build pipelines) that warn them if it does become a problem.
But how many latent failures lurk in systems we don't know about? They often only ever surface as they somehow manage to evade our existing defences or, worse, are exploited by bad actors, such as the log4j vulnerability.
5. Complex systems run in degraded mode.
A corollary to the preceding point is that complex systems run as broken systems. The system continues to function because it contains so many redundancies and because people can make it function, despite the presence of many flaws. After accident reviews nearly always note that the system has a history of prior 'proto-accidents' that nearly generated catastrophe. Arguments that these degraded conditions should have been recognised before the overt accident are usually predicated on naïve notions of system performance. System operations are dynamic, with components (organisational, human, technical) failing and being replaced continuously.
If you've worked with any extensive software system, you will have spotted issues present in the live environment. These known bugs are often not fixed as the impact is minimal, reproduction steps are convoluted, or users have workarounds (assuming the engineering teams know about the issue). Many software systems operate in some degraded mode, but end users are typically unaware that it is. When catastrophes occur in the live environment, those known limitations become readily known.
For instance, Public Health England lost 16,000 coronavirus cases during the COVID pandemic due to using an old Excel file format. Every time they imported test results into the old Excel template, and it went over the 65,000-row limit, the data was omitted. This was not a bug but a limitation (feature, if you will) of older Excel file formats, probably to preserve the performance of the software. They would not have hit this issue if they had used a newer Excel file format until over a million rows were exceeded.
16) Safety is a characteristic of systems and not of their components*
* I kept Richard's original numbering scheme for the characteristics, hence the jump to 16 above. You can find the original research paper at ResearchGate.
Safety is an emergent property of systems; it does not reside in a person, device or department of an organisation or system. Safety cannot be purchased or manufactured; it is not a feature that is separate from the other components of the system. This means that safety cannot be manipulated like a feedstock or raw material. The state of safety in any system is always dynamic; continuous systemic change insures that hazard and its management are constantly changing
Safety is a quality attribute and, therefore, is an emergent behaviour from What do quality engineers do?
One aspect of quality we often miss is that quality is an emergent system behaviour. You can not inspect the individual sub-systems and deduce the overall quality attributes of the more extensive system. You have to take a holistic view of the sub-system components and their interactions to understand how the system behaves and, therefore, how its quality attributes are likely to manifest.
Safety is no different from any other quality attribute. It is not a feature (like a UI element) that engineering teams can add on but a characteristic of the end-to-end system that emerges. Safety must be present throughout all the components that make up the system, as it only takes one incident to bring all system safety into question. Just ask Boeing, which has had multiple safety issues with its 787 Dreamliners, calling into question the safety of all their planes.
In summary
Most meaningful software systems are complex due to the socio-technical system they comprise. Therefore, predicting every possible interaction between its components and the resulting behaviour is impossible. Some of these systems' behaviours could be intended, some not, and others we didn't even know were possible.
In addition, due to the systems' existence in constantly changing environments, some unintended behaviours become more likely to occur while at other times, they become impossible, which results in them having hidden behaviours. Therefore, complex software systems usually run in some degraded mode, with end users completely unaware.
For end users, software systems typically operate as intended more often than not due to the numerous defences systems designers, builders, maintainers, and operators deploy to keep them functioning and block the majority of catastrophic failures. But every so often, a significant failure slips through. Still, to do so, multiple defences must fail or not exist for it to pass through to impact users.
Complex software systems have a unique characteristic where quality is not just the sum of all the quality aspects of its components. Instead, it is an emergent property that requires constant attention during the system's design, construction and maintenance, as well as during day-to-day operations.
How can we use our newfound understanding of how quality is lost in the product layer?
First, we have to go back to what do quality engineers do:
This is where the idea of creating healthier systems comes in. We can't control or know all the variables that affect a complex socio-technical system. But we can observe the system, identify its patterns of operating and run experiments to see how we can increase the chances of the conditions we want to occur.
We can use these characteristics as templates of patterns to observe within the systems that we work in. Then, we can use those as starting points for where we should run experiments within the system to see how we can mitigate the conditions that lead to failure and engineer more of the conditions that lead to healthier socio-technical systems.
Now, spotting these patterns will be challenging at first as you will need to train yourself to see them in and around your organisation. So, an excellent place to deliberately practice this is by looking outside of your organisation, specifically at other teams, departments, and organisations that have failed. A quick search for catastrophic software system failures brings up quite a few collated lists, and there is even a list of software bugs on Wikipedia.
The skill you need to develop is training yourself to look at failure via the lenses of these six characteristics. You want to start looking at what makes them complex, what defences the system developers deployed to protect them from failure, and how they failed. What opportunities did they have to block the issues but missed? Who were the people involved, and what roles did they play? What are the latent failures in the system?
With time, you'll train your brain's pattern-matching ability to spot these characteristics and become quite adaptable at seeing where your engineering teams can deploy new defences (for inspiration, see Building quality into products, processes and people). This will help you make your socio-technical systems healthier in the long run rather than the typical knee-jerk reactions to failure.
Next time
Part 2 of this series will focus on the characteristics that lead to quality loss at the process layer. It will look at why catastrophe is just around the corner, why root cause analysis is wrong, the problems with hindsight bias, and how mitigating failure requires hands-on experience with failure. We will look deeper at the React package issue mentioned earlier and what happened at the British Post Office that resulted in them prosecuting hundreds of postmasters due to software bugs.
If you enjoyed reading this post, please forward it to others who may find it interesting. Additionally, please sign up below to become a subscriber and receive the next issue directly in your inbox. By forwarding and signing up, you are showing your support for me and my work. Thank you!