How software systems fail: Part 3b - People
Dr Richard Cook's 18 characteristics of complex systems failure applied to software. Part 3b of 3 focuses on three of the six characteristics, demonstrating how quality is lost at the people level.
Key insight
Humans are experts in complex software systems, and their ability to adapt keeps systems functioning.
Three main takeaways
1. Humans are the most adaptable parts of complex systems, actively working to keep them functioning through restructuring, redirecting resources, creating pathways to retreat, and detecting changes in performance.
2. Complex software systems need to build mechanisms to create experts who can operate within them effectively due to ever-changing technology and employee churn.
3. Human adaptability constantly creates safety within complex software systems, either through well-rehearsed routines or through new adaptations to failure.
Some background
What is this series of posts about?
In 1998, Dr. Richard Cook wrote a fascinating paper called How complex systems fail, which lists 18 characteristics of what can cause catastrophic failure in large complex systems. Quality Engineers can use these characteristics to help them identify the patterns that result in quality loss at the product, process, and people layers of organisations, allowing them to support engineering teams in building quality in.
What is quality engineering?
If you are new to the Quality Engineering Newsletter or looking for a recap of what quality engineering is all about, then check out What is quality engineering. In short:
Quality engineering is more than testing earlier in the software life cycle. It's about looking at all the facets of software engineering. From delivering the product to the processes we use to build it and the people involved. It's about taking a holistic approach to quality and understanding how quality is created, maintained and lost throughout the software life cycle. It is then, using this insight, we build quality at the source.
What are products, processes and people (3Ps) all about?
In this series of posts, I plan to apply the lens of quality engineering to Richard Cook's 18 characteristics of how complex software systems fail and specifically look at how quality is lost in complex software systems. To make it more consumable, I will split the 18 characteristics into three groups: products, processes and people or 3Ps. From What is quality engineering:
Products, Process and People (3Ps)
Engineering teams have the most direct influence on the quality of the products or services they provide in these three areas.
Products are typically the software that the team produces. Processes are how they create that product, and the People are the team members who will use the processes to create the products.
When we think about quality, we usually only consider the product quality attributes. However, those attributes are highly influenced by the processes adopted by the team and their inter-team interactions. Therefore, if the quality of their processes or inter-team interactions is poor, these will likely transfer to their products.
Past posts in the series
How software systems fail: Part 1 - Products - looked at 6 characteristics that can help quality engineers understand how quality can be lost at the product layers. That post covers the background of where the 18 characteristics originated, how complex systems are inherently unpredictable, how catastrophic failure can occur when multiple defences fail and how quality is an emergent behaviour of complex software systems.
How software systems fail - Part 2a - Processes looked at the first 3 (of 6) characteristics that can help quality engineers understand how quality can be lost at the process layer. That post covered how we are influenced by hindsight bias and our inclination to attribute them to a single root cause, leading us to overestimate their preventability.
How software systems fail - Part 2b - Processes looked at the last 3 (of 6) characteristics that can help quality engineers understand how quality can be lost at the process layers. That post covered how change can lead to new forms of failure and that systems builders' and maintainers' experience with failure is essential to operating complex software systems to lower the chances of catastrophic failure.
How software systems fail - Part 2c - Processes Looked at how we can use the 6 characteristics from posts 2a and 2b to build quality into the process layers, such as using stories of failure to build empathy, how biases affect decisions making and why we should use game days and postmortems to become more familiar with failure.
How software systems fail - Part 3a - People Looked at the first 3 (of 6) characteristics that can help quality engineers understand how quality can be lost at the people layer. That post covered actions at the sharp end in complex software systems resolve all ambiguities, all practitioner actions in building software systems are essentially gambles, and why it's important to view human operators as both producers and defenders against failure.
What is this post about?
Part 3 of this series on how complex software systems fail will look at the last six remaining characteristics from Dr Richard Cook's 18 characteristics of how complex systems fail and how they relate to the people level of software systems. If you want some context on the original characteristics, then take a look at Part 1, as this post will assume you have read it.
I had originally intended How Software Systems Fail to be a three-part series. But Parts 2 and 3 were much longer posts than I first thought. So, to make Part 3 more consumable, I've broken it up into three additional parts.
Parts 3a and 3b will look at the last six characteristics that help us understand how quality can be lost at the people level of socio-technical software systems. Part 3c will cover how these six characteristics can help quality engineers understand the patterns that cause quality loss at the people level within systems.
The last 3 characteristics this post will look at are
This post (Part 3b) will look at:
12) Human practitioners are the adaptable element of complex systems.
13) Human expertise in complex systems is constantly changing.
17) People continuously create safety.
If you’re looking for the first 3 characteristics, then see How software systems fail - Part 3a - People, which covers:
9) Human operators have dual roles: as producers & as defenders against failure.
10) All practitioner actions are gambles.
11) Actions at the sharp end resolve all ambiguity.
12. Human practitioners are the adaptable element of complex systems.
Practitioners and first line management actively adapt the system to maximise production and minimise accidents. These adaptations often occur on a moment-by-moment basis. Some of these adaptations include: (1) Restructuring the system in order to reduce exposure of vulnerable parts to failure. (2) Concentrating critical resources in areas of expected high demand. (3) Providing pathways for retreat or recovery from expected and unexpected faults. (4) Establishing means for early detection of changed system performance in order to allow graceful cutbacks in production or other means of increasing resiliency.
A great example of humans being the adaptable elements of complex systems was back in 2015 when Paper magazine was going to release some photos of a celebrity that was likely to bring more traffic to their site than usual. To handle the two-fold increase in traffic over a few hours, they:
Restructuring the system in order to reduce exposure of vulnerable parts to failure.
Instead of running the site on their single AWS instance, which could serve about 500,000 requests a month, they restructured it to 4 instances behind a load balancer to direct the traffic and a scalable file server that would hopefully handle 30 million requests over a few days.
Concentrating critical resources in areas of expected high demand.
The paper team realised early on that they had a story that would drive massive traffic to their site, so they contacted their infrastructure team 5 days before going live with the article to ensure the site could handle the extra load.
Providing pathways for retreat or recovery from expected and unexpected faults.
They estimated that the scaled-up infrastructure would handle around 30 million requests, but if this proved insufficient, they planned to manually increase the number of instances.
Establishing means for early detection of changed system performance
They didn't detail how they would detect if the site was starting to slow or fail. But, considering how manual the scaling approach was, they probably watched the response times and social media comments.
Then, once the story had passed, they took down the scaled-up infrastructure and moved back to the old instance. Now, anyone with a DevOps mindset is probably wondering why they didn't upgrade their infrastructure to auto-scale with demand so they are not unnecessarily operating at unused capacity. Maybe it was all held together with duct tape and gum, or the infrastructure team did what was asked and did nothing more. Whatever the reason was, Paper ceased operations in April 2023 but is back under a new owner, so who knows what infrastructure they are running now.
13. Human expertise in complex systems is constantly changing
Complex systems require substantial human expertise in their operation and management. This expertise changes in character as technology changes, but it also changes because of the need to replace experts who leave. In every case, training and refinement of skill and expertise is one part of the function of the system itself. At any moment, therefore, a given complex system will contain practitioners and trainees with varying degrees of expertise. Critical issues related to expertise arise from (1) the need to use scarce expertise as a resource for the most difficult or demanding production needs and (2) the need to develop expertise for future use.
Any complex software system will have people of varying levels of expertise, from less experienced junior members to the most experienced seniors and principals. But those areas of expertise are constantly in flux as technology changes, e.g., from self-managed, on-site servers to cloud services. In addition, people are continually moving around, for instance, being promoted, moving to different teams or leaving the organisation. Now, in some cases, you can hire more experts to replace the people who move, but they'll still need to learn your organisation's unique context. Also, to ensure your existing experts remain experts, they'll need to refine their skills and abilities constantly. So, another function of complex systems will be to train and refine the skills of existing people. Issues around expertise can also arise from balancing using experts to handle challenging or demanding product needs with developing their skills and using them to train more junior members.
17. People continuously create safety.
Failure free operations are the result of activities of people who work to keep the system within the boundaries of tolerable performance. These activities are, for the most part, part of normal operations and superficially straightforward. But because system operations are never trouble free, human practitioner adaptations to changing conditions actually create safety from moment to moment. These adaptations often amount to just the selection of a well-rehearsed routine from a store of available responses; sometimes, however, the adaptations are novel combinations or de novo creations of new approaches.
In May 2024, I attended the Testing Peers conference, where one of the Speakers, Andrew Brown, shared the story of the Gimli Glider. From my post about the conference:
A Boeing 767 ran out of fuel midway through a flight at 41,000 feet. The crew glided the plane back down to a converted airfield that at the time had a live race with 100s of spectators. Not only were they successful in gliding such a large plane, but they suffered only minor injuries to passengers and spectators, and the damage to the plane was repaired and remained in service for another 25 years.
The Gimli Glider is an excellent example of "people continuously creating safety" by adapting to multiple failures to keep the situation from turning into a catastrophic failure. In this case
The fuel quantity indicators sensor (FQIS) was defective, so it was deactivated, and manual measurements had to be taken.
The calculations were made using imperial measures instead of metric, resulting in the aircraft having less fuel.
While flying at 41,000 feet, the first and then the second engine went out.
The electrics also went out as the engines powered them.
Due to most of the instrumentation within the cockpit being electronic, the crew had no way to tell how fast they were travelling.
Hydraulic landing gear needed electricity to operate, which meant that they would be unable to deploy the plane's wheels fully.
There was no way they would reach their final destination with no fuel, engines or instrumentation to guide them.
The crew had to "adapt to the changing conditions" to "create safety moment to moment". For issue 1, the FQIS could be manually calculated, but that led to issue 2 and then eventually 3. When issue 3 occurred, the crew's first point of call was to call in the incident and check their emergency checklist for what to do when both engines were out. This is an example of "selecting a response from a set of well-rehearsed routines". Unfortunately, no one ever thought both engines would stop working, so the checklist or pilot training did not cover it.
Where the crew began to "adapt to the situation with new approaches" was that the Captain was an experienced glider. The other was the co-pilot who had served at a nearby, disused air force base (RCAF Station Gimli). To overcome issues 3, 4, 5 and 7, the crew began gliding the plane towards the disused air base station. But when they saw the airstrip, they realised that it had been converted into a race course with an event taking place. While attempting to land the plane, they managed to deploy the main landing gear but not the front wheels. In addition, due to the engines not running, the aircraft made almost no noise, so it didn't alert the spectators to flee. Issue 6 (landing gear not being fully deployed) turned out to be an advantage. When the plane landed, the front of the aircraft went into the ground due to the wheels not being deployed, helping the plane stop sooner than it would have. In the end, only minor injuries were sustained, mainly from trying to exit the aircraft, and the damage to the plane was repairable.
Next time
Part 3c will be out soon and will cover what these characteristics mean and how we can use them.
Don't want to miss the next post? Then subscribe to get the post directly to your inbox. By subscribing, you'll be the first to know about our latest content. And if you've already subscribed, I want to emphasize how much your engagement means to me. Your subscriptions, shares, and comments are not just appreciated, they are vital in shaping the future of this newsletter. Thank you!