How software systems fail: Part 3a - People
Dr Richard Cook's 18 characteristics of complex systems failure applied to software. Part 3a of 3 focuses on three of the six characteristics, demonstrating how quality is lost at the people level.
Key insight
Failure resolves all ambiguities within complex software systems and highlights how software system creators' actions are gambles that often have dual roles as defenders and producers of failures.
Three key takeaways
Actions at the sharp end in complex software systems resolve all ambiguities, and unexpected events can lead to a better understanding of how failures occur.
All practitioner actions in building software systems are essentially gambles, and it's important to encourage safe-to-fail experiments and focus on learning from approaches rather than punishing failures.
It's important to view human operators as both producers and defenders against failure in software systems and to ensure collaboration rather than rivalry between different roles.
Some background
What is this series of posts about?
In 1998, Dr. Richard Cook wrote a fascinating paper called How complex systems fail, which lists 18 characteristics of what can cause catastrophic failure in large complex systems. Quality Engineers can use these characteristics to help them identify the patterns that result in quality loss at the product, process, and people layers of organisations, allowing them to support engineering teams in building quality in.
What is quality engineering?
If you are new to the Quality Engineering Newsletter or looking for a recap of what quality engineering is all about, then check out What is quality engineering. In short:
Quality engineering is more than testing earlier in the software life cycle. It's about looking at all the facets of software engineering. From delivering the product to the processes we use to build it and the people involved. It's about taking a holistic approach to quality and understanding how quality is created, maintained and lost throughout the software life cycle. It is then, using this insight, we build quality at the source.
What are products, processes and people (3Ps) all about?
In this series of posts, I plan to apply the lens of quality engineering to Richard Cook's 18 characteristics of how complex software systems fail and specifically look at how quality is lost in complex software systems. To make it more consumable, I will split the 18 characteristics into three groups: products, processes and people or 3Ps. From What is quality engineering:
Products, Process and People (3Ps)
Engineering teams have the most direct influence on the quality of the products or services they provide in these three areas.
Products are typically the software that the team produces. Processes are how they create that product, and the People are the team members who will use the processes to create the products.
When we think about quality, we usually only consider the product quality attributes. However, those attributes are highly influenced by the processes adopted by the team and their inter-team interactions. Therefore, if the quality of their processes or inter-team interactions is poor, these will likely transfer to their products.
Past posts in the series
How software systems fail: Part 1 - Products - looked at 6 characteristics that can help quality engineers understand how quality can be lost at the product layers. That post covers the background of where the 18 characteristics originated, how complex systems are inherently unpredictable, how catastrophic failure can occur when multiple defences fail and how quality is an emergent behaviour of complex software systems.
How software systems fail - Part 2a - Processes looked at the first 3 (of 6) characteristics that can help quality engineers understand how quality can be lost at the process layer. That post covered how we are influenced by hindsight bias and our inclination to attribute them to a single root cause, leading us to overestimate their preventability.
How software systems fail - Part 2b - Processes looked at the last 3 (of 6) characteristics that can help quality engineers understand how quality can be lost at the process layers. That post covered how change can lead to new forms of failure and that systems builders' and maintainers' experience with failure is essential to operating complex software systems to lower the chances of catastrophic failure.
How software systems fail - Part 2c - Processes Looked at how we can use the 6 characteristics from posts 2a and 2b to build quality into the process layers, such as using stories of failure to build empathy, how biases affect decisions making and why we should game days and postmortems to become more familiar with failure.
What is this post about?
Part 3 of this series on how complex software systems fail will look at the last six remaining characteristics from Dr Richard Cook's 18 characteristics of how complex systems fail and how they relate to the people level of software systems. If you want some context on the original characteristics, then take a look at Part 1, as this post will assume you have read it.
I had originally intended How Software Systems Fail to be a three-part series. But Parts 2 and 3 were much longer posts than I first thought. So, to make Part 3 more consumable, I've broken it up into three additional parts.
Parts 3a and 3b will look at the last six characteristics that help us understand how quality can be lost at the people level of socio-technical software systems. Part 3c will cover how these six characteristics can help quality engineers understand the patterns that cause quality loss within systems.
The 3 characteristics this post will look at are
9) Human operators have dual roles: as producers & as defenders against failure.
10) All practitioner actions are gambles.
11) Actions at the sharp end resolve all ambiguity.
Part 3b will look at:
12) Human practitioners are the adaptable element of complex systems.
13) Human expertise in complex systems is constantly changing.
17) People continuously create safety.
9) Human operators have dual roles: as producers & as defenders against failure.
The system practitioners operate the system in order to produce its desired product and also work to forestall accidents. This dynamic quality of system operation, the balancing of demands for production against the possibility of incipient failure is unavoidable. Outsiders rarely acknowledge the duality of this role. In non-accident filled times, the production role is emphasized. After accidents, the defense against failure role is emphasized. At either time, the outsider's view misapprehends the operator's constant, simultaneous engagement with both roles.
The production and defence roles are two frames with which practitioners (members of engineering teams) have to look at the systems they are working with. However, holding two frames simultaneously is probably impossible, and people are likelier to switch roles as they work. Within software systems, we have tended to separate these roles, with software engineers in the production role and testers in the defence role. This is often a good way to combat switching roles and dividing responsibilities. But the trade-off is that each role can start to see the other as slowing them down and preventing them from doing their jobs efficiently. It sets up an us and them mentality.
Therefore, leadership needs to be diligent that this dynamic doesn't occur and that practitioners see themselves as collaborators, not rivals. Some solutions I've seen employed to help with this problem are for testers and developers to have shared responsibilities within their job descriptions or pairing developers and testers together when working on tickets.
10) All practitioner actions are gambles.
After accidents, the overt failure often appears to have been inevitable and the practitioner's actions as blunders or deliberate willful disregard of certain impending failure. But all practitioner actions are actually gambles, that is, acts that take place in the face of uncertain outcomes. The degree of uncertainty may change from moment to moment. That practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.
Building software systems can be highly uncertain. We don't always know if the outcomes we are trying to achieve will give us the results we want or if the steps we take to achieve those outcomes will even work. Failure of some sort is almost inevitable. It is one of the reasons why we want to frame as much of the work teams do as safe to fail experiments. That allows them to take calculated gambles, with extra emphasis placed on their approach over the outcomes they achieve. Because if we only reward success we inadvertently punish failure. But by focusing on safe-to-fail experiments and the strategy they applied, we help practitioners learn from their approaches rather than punishing them for failing. Leaders often worry that this could lead to an anything-goes attitude, but people rarely like to fail, and as long as the practitioners have a meaningful purpose to their work, they will strive towards it.
11) Actions at the sharp end resolve all ambiguity.
Organizations are ambiguous, often intentionally, about the relationship between production targets, efficient use of resources, economy and costs of operations, and acceptable risks of low and high consequence accidents. All ambiguity is resolved by actions of practitioners at the sharp end of the system. After an accident, practitioner actions may be regarded as 'errors' or 'violations' but these evaluations are heavily biased by hindsight and ignore the other driving forces, especially production pressure.
The React issue mentioned in characteristic 3. Catastrophe requires multiple failures and 7. Post-accident attribution to a 'root cause' is fundamentally wrong is an excellent example of how actions at the sharp end resolve ambiguity in complex systems. Before the incident, no one would have thought that a simple 11-line NPM package named leftpad that added padding to text could have blocked massive online estates like Facebook and Netflix from updating. But then again, no one would have predicted the chain of events that led to the package being deleted either:
Kik.com, the messaging app, was going to ask for kik the templating tool to rename themselves,
Or how the Kik template tool maintainer would react to the request,
Or how NPM maintainers would resolve this issue between kik.com and kik templating tool,
Or that kik templating tool maintainer would delete all their packages on NPM as a protest against NPMs decision of siding with Kik.com,
Or that one of those deleted packages was leftpad,
Or that leftpad package was included in the dependency tree for React.
When those packages were deleted, and people started updating their React projects, all those ambiguities were resolved, and (eventually) everyone understood how that chain of events came about.
Next time
Part 3b will continue to cover the three remaining characteristics that help us understand how quality can be lost at the people layer. These being:
12) Human practitioners are the adaptable element of complex systems.
13) Human expertise in complex systems is constantly changing.
17) People continuously create safety.
Part 3b will be out next week, and Part 3c, due the week after, will cover what these characteristics mean and how we can use them.
If you don't want to miss the next post, subscribe to get it directly in your inbox. If you've already subscribed, thank you. Subscribing, sharing, and commenting help me see that you appreciate my work and keeps pushing me to share more. Thank you