How software systems fail - Part 3c - People
Part 3c will explore how understanding the 6 characteristics of how complex systems fail can help improve our awareness of how quality is lost at the people level of software systems.
Key insight
People continuously adapt their behaviours to complex software systems, developing their expertise to keep them functioning. Each person has dual roles as a producer and defender against failure, which is essential in navigating the volatility and ambiguity inherent in such systems. As a result, people's actions continuously create quality, as quality is an emergent behaviour of complex software systems.
Three takeaways
1. Human operators serve as producers of desired outcomes and defenders against unintended outcomes. This dual role is often overlooked by external observers who need to recognise the necessity of balancing production with defence against failures.
2. Practitioners in complex systems must constantly adapt to changing conditions and technologies. Expertise is developed through hands-on experience but is also subject to change due to staff turnover and technological advancements. Continuous learning and adaptability are essential for maintaining system functionality and quality.
3. Quality in complex systems is not a static attribute but an emergent property arising from the system's interactions. Practitioners can influence and improve system quality by identifying and replicating patterns that lead to successful quality outcomes.
Some background
What is this series of posts about?
In 1998, Dr. Richard Cook wrote a fascinating paper called How complex systems fail, which lists 18 characteristics of what can cause catastrophic failure in large complex systems. Quality Engineers can use these characteristics to help them identify the patterns that result in quality loss at the product, process, and people layers of organisations, allowing them to support engineering teams in building quality in.
What is quality engineering?
If you are new to the Quality Engineering Newsletter or looking for a recap of what quality engineering is all about, then check out What is quality engineering. In short:
Quality engineering is more than testing earlier in the software life cycle. It's about looking at all the facets of software engineering. From delivering the product to the processes we use to build it and the people involved. It's about taking a holistic approach to quality and understanding how quality is created, maintained and lost throughout the software life cycle. It is then, using this insight, we build quality at the source.
What are products, processes and people (3Ps) all about?
In this series of posts, I plan to apply the lens of quality engineering to Richard Cook's 18 characteristics of how complex software systems fail and specifically look at how quality is lost in complex software systems. To make it more consumable, I will split the 18 characteristics into three groups: products, processes and people or 3Ps. From What is quality engineering:
Products, Process and People (3Ps)
Engineering teams have the most direct influence on the quality of the products or services they provide in these three areas.
Products are typically the software that the team produces. Processes are how they create that product, and the People are the team members who will use the processes to create the products.
When we think about quality, we usually only consider the product quality attributes. However, those attributes are highly influenced by the processes adopted by the team and their inter-team interactions. Therefore, if the quality of their processes or inter-team interactions is poor, these will likely transfer to their products.
Past posts in the series
How software systems fail: Part 1 - Products - looked at 6 characteristics that can help quality engineers understand how quality can be lost at the product layers. That post covers the background of where the 18 characteristics originated, how complex systems are inherently unpredictable, how catastrophic failure can occur when multiple defences fail and how quality is an emergent behaviour of complex software systems.
How software systems fail - Part 2a - Processes looked at the first 3 (of 6) characteristics that can help quality engineers understand how quality can be lost at the process layer. That post covered how we are influenced by hindsight bias and our inclination to attribute them to a single root cause, leading us to overestimate their preventability.
How software systems fail - Part 2b - Processes looked at the last 3 (of 6) characteristics that can help quality engineers understand how quality can be lost at the process layers. That post covered how change can lead to new forms of failure and that systems builders' and maintainers' experience with failure is essential to operating complex software systems to lower the chances of catastrophic failure.
How software systems fail - Part 2c - Processes Looked at how we can use the 6 characteristics from posts 2a and 2b to build quality into the process layers, such as using stories of failure to build empathy, how biases affect decisions making and why we should use game days and postmortems to become more familiar with failure.
How software systems fail - Part 3a - People Looked at the first 3 (of 6) characteristics that can help quality engineers understand how quality can be lost at the people layer. That post covered actions at the sharp end in complex software systems resolve all ambiguities, all practitioner actions in building software systems are essentially gambles, and why it's important to view human operators as both producers and defenders against failure.
How software systems fail: Part 3b - People Looked at the last 3 (of 6) characteristics that can help quality engineers understand how quality can be lost at the people layer. That post covered how humans are the most adaptable parts of complex systems, how we need to create new experts to keep systems functioning, and how people's adaptions to complexity are what keep systems safe.
Why are these posts broken up into smaller posts?
I had originally intended How Software Systems Fail to be a three-part series. But Parts 2 and 3 were much longer posts than I first thought. So, to make them more consumable, I've broken them up into smaller parts. Each one focuses on three characteristics (parts “a“ and “b“) and a summary of how we can use the characteristics to build quality (part “c“). Note that Part 1 was kept as one long post.
What is this post about?
Part 3c of this series will cover how these six characteristics can help quality engineers understand the patterns that cause quality loss at the people layer within systems.
Six characteristics of how quality is lost at the people layer
9) Human operators have dual roles: as producers & as defenders against failure.
10) All practitioner actions are gambles.
11) Actions at the sharp end resolve all ambiguity.
12) Human practitioners are the adaptable element of complex systems.
13) Human expertise in complex systems is constantly changing.
17) People continuously create safety.
What can we learn about quality loss at the people layers of complex software systems?
Complex systems are often volatile and simultaneously ambiguous. They are ambiguous in that there could be multiple ways to solve the same problem and volatile in that it's often unclear what results your actions will have. That volatility and ambiguity will lead to two issues.
The first is that volatility means most people's actions within the systems are essentially gambles [10]. For example, if we think of all our actions and decisions as a dot in time, then when we look back, it's easy to see how all the dots connect and how we got from where we were to where we are. But when we look forward, all we can see is the many dots we can move towards, each with varying levels of uncertainty and ambiguity about whether it will lead to the desired results.
This leads to the second issue, that failures resolve all ambiguity within complex software systems [11]. At that moment of failure, it becomes clear to everyone involved how the dots are connected, and the mistaken belief that people should have known their actions would lead to outcomes that resulted in failure. However, external observers need to appreciate that before and during the crisis, the people involved probably worked with considerable uncertainty and ambiguity about what was happening and that successful outcomes were also the results of gambles that paid off.
The gambles of the people within the system highlight the duality of their roles: as producers of the system's desired outcomes and as defenders of unintended outcomes [9]. External observers tend to miss the dual role of people within the system. When things are going well, they expect people to fulfil the producer role, but the moment failure occurs, they question why the defence role wasn't prioritised.
These multiple roles show how humans are the most adaptable parts of complex software systems [12]. They actively work to maximise production and minimise failures, often adjusting their actions on a moment-to-moment basis. For them to perform, they need knowledge of how the systems fit together and works. This knowledge is often developed through hands-on experiences working within complex systems. Over time, this leads to expertise in how the systems operate, allowing people to enhance the production role further and minimise failures.
However, the problem with expertise is that it is constantly changing in complex software systems [13] due to ever-changing technology and staff turnover, leading to a dilemma. Do you focus your experts on further increasing and maintaining production from the system or use them to train the next batch of experts? If countermeasures to this dilemma have not been factored into the experts' roles, where will your next set of experts come from? External training, hiring external experts, and letting people learn on the job can help, but none of these solutions are perfect. We need all three to create future system experts who can keep the systems functioning and continuously create quality [17].
Complex software systems work more often than not due to the actions of the people within them, keeping the systems operating. However, complex systems are constantly changing, so people must adapt to the conditions to continuously create quality. This continually evolving environment with visible and invisible variables leads to quality being an emergent behaviour of complex software systems.
This means quality is more than just the sum of its parts (its quality attributes). Therefore, we can't create quality; it must emerge as an outcome of how the system operates. However, we can influence quality by identifying the patterns that lead to better system outcomes and working to recreate those patterns. Identifying and applying those patterns to different scenarios is how experts continuously create quality and why they are best positioned to train the next set of experts who will take over once they move on.
How do we use our understanding of how quality is lost at the people layer?
Learning from failure
If a lot of our actions within complex systems are gambles, then the risk of failure is highly likely. But as people, we don't like to fail, and we will often do whatever we can to try and deny, ignore, avoid, or blame others for it. We need to foster a culture that allows people to fail and fail safely for them and our systems. Only then can we begin to learn from failure and reframe it as something valuable rather than something that should be punished.
Learning from failure is a vast subject, so I will leave it for a future post. However, some good practices focus on reframing failure as a benefit by looking at the positive sides of it, such as what you now know that you didn't before. We should adopt a growth mindset as a person who can improve their skills and abilities through feedback. Then, expand on that mindset by focusing on humility that you don't have all the answers, empathy that others will get things wrong and curiosity that there is always more to learn.
Developing team members understanding of how we make decisions
Human decision making in complex systems isn't the best. Just take a look at characteristics 7. Post-accident attribution accident to a 'root cause' is fundamentally wrong, and 8. Hindsight biases post-accident assessments of human performance. But then there's also naïve realism (or illusion of personal objectivity), where we believe that we see the world objectively and that others who don't see it the same as us are uninformed, irrational, or biased in some way. Naïve realism leads to biases such as fundamental attribution error, where we emphasise people's personalities and underemphasise situational factors to the outcomes of events.
These biases lead us to blame people for failures rather than looking at the socio-technical system in which the people had to make decisions. The problem with these biases is that they are almost built into us. Some argue that they are not biases but just how we function as humans, as they affect so many people (me included). Simply learning about them is not enough. We need to set up systems and structures that nudge us to question our reasoning when looking at failure and attempt to learn from them to minimise and mitigate the effects of biases on our conclusions.
Culture of learning
Developing our understanding of failure and how biases affect our decision-making are all steps towards a culture of learning within our teams and organisations. Because the systems we work in constantly adapt to their ever-changing environments, what you know now may not be accurate in the future. So, the best way to keep up is to keep learning and updating your mental models of the systems you work in.
Our organisations need to develop learning cultures that encourage their people to learn and share what they are learning by building in slack time so people can do this on the job. For example, encouraging your teams to create communities of practices*, guilds, or book clubs to learn socially can help people connect with others outside of their immediate work circles and develop their knowledge of new ideas and concepts in safe spaces.
*If you need help selling CoPs within your organisations, check out Drew Pontikis's The Business Case for Communities of Practice.
Next and final post
I'm planning one final post on this series of how software systems fail. This final post will pull together all the threads across the seven posts and hopefully act as an entry point into the 18 characteristics. It will also help quality engineers leverage the patterns within the characteristics and the practices we should look to install in our organisations.