Quality Engineering Newsletter
Quality Engineering Newsletter Podcast
Linky #34: The system around the output
0:00
-24:20

Linky #34: The system around the output

This week’s links on AI agents, evaluation, developer thriving, hype cycles, belonging, and the feedback loops around the work.

This week’s links are mostly about the system around the output.

That might be the system around AI-generated code, the system around a proactive agent, the system around a team, or the system around how people learn and make decisions.

A lot of the AI conversation still focuses on what these tools can produce. Can they write the code? Can they fix the bug? Can they generate the test? Can they speed up delivery?

Those are useful questions, but the more interesting quality engineering question is:

What gives us confidence in the output, and what system helped create it?

That includes how the work was prompted, checked, evaluated, constrained, observed and improved. It also includes the humans in the system: their agency, their learning, their confidence, their sense of belonging, and whether they feel safe enough to challenge what is happening.

Because quality does not magically appear at the end of the process. It is influenced by the conditions around the work.


Latest post from the QE Newsletter

This week I shared my takeaways from LDX3 (LeadDev), and a theme I keep seeing with the shift to AI-assisted engineering:

Increased output does not mean improved engineering judgement.

Building quality into that output is going to become even more critical. That means helping teams turn principles into behaviours, making culture visible, understanding incidents from a sociotechnical perspective, and helping teams make better decisions and learn from them.

AI might help teams build faster, but building better still depends on the human system around the work.

There is also a follow-up piece with my full notes from the talks, if you want to understand how I came to the conclusions I did, plus other insights I could not cover in the main post.


Claude’s Fable model will go to any lengths to achieve its goal

This is a fascinating story from Simon Willison about Claude’s latest Fable model debugging what looked like a simple horizontal scrollbar bug.

Simon told it to figure out why there were scrollbars on a dialog and then left it to work. Fable went through multiple steps. It fired up local dev servers, used Playwright sessions, tried to work out which browser was problematic, checked Firefox and Chrome, and eventually figured out it was Safari.

It then worked out how to take screenshots, how to get into the dialog window, came up with a fix, and tested it.

The final change was a two-line CSS fix, and the whole thing cost about $12.

As Simon says in the post, Fable will go to any lengths to achieve its goal.

That is impressive, but it is also the point where quality engineering needs to pay attention.

A proactive agent is not just a code generator. It is part of the delivery system. It can run tools, make decisions, explore paths, consume resources, and interact with the environment around it.

That means we need to think carefully about where these agents run, what they can access, what they are allowed to change, and how their behaviour is observed.

Best to keep these things in sandboxes, because once an agent can act on its own, the risk is not only whether the final code is right. The risk is also what it did to get there. Via Claude Fable is relentlessly proactive


Trajectory evaluation checks for AI-generated output

Testing AI-generated code requires evaluating not just what the agent produced, but how it got there. Output evaluation checks the final artifact: does the code compile, do the tests pass? Trajectory evaluation checks the full sequence of tool calls and intermediate reasoning. Both are necessary because a fluent output that skipped its verification steps is a more dangerous failure than one with a visible error.

We have been quite fixed on checking the output of AI, but not always how it got there. This is what trajectory evaluation checks are for.

The useful bit for me is that this shifts evaluation from “does the answer look right?” to “was the process that created the answer trustworthy?”

That feels much closer to how quality works in real systems.

A piece of AI-generated code might compile, pass the visible tests, and look convincing. But if the agent skipped its verification steps, ignored failing signals, used the wrong tool, misunderstood the context, or took a risky path through the system, then the output alone does not tell us enough.

This links back to the Fable example above.

Once agents can plan, explore, call tools and make changes, we need feedback loops around the path as well as the result.

From a QE perspective, that means building quality into the AI-assisted workflow itself. Not just reviewing the final code, but understanding the behaviours, checks, constraints and signals that led to it. Via The new SDLC with Vibe coding: From ad-hoc prompting to Agentic Engineering


I’m speaking at Agile Cambridge in September, you can use the discount code: 10Jitesh to get 10% off tickets.


Beginner-friendly guide to how LLMs work

Most modern LLMs share the same transformer-family skeleton. The differences come from what each one was trained on, the scale and configuration choices, and the post-training done on top. By the end, you should be able to read many modern LLM papers or model cards and know which piece of the architecture each section is talking about.

This is a great explainer of how LLMs work, written without the heavy maths, so it is easy to get your head around how these systems actually function and predict the next token.

It helped me better understand why earlier LLMs struggled with tasks like maths or counting letters.

The key thing is that models do not “see” words the way we do. They work with token IDs. So rather than actually counting letters, they are predicting what the answer should look like based on patterns they have learned.

To be fair, they still are not counting in an algorithmic sense. They have just become much better at appearing to count, either through training data or techniques layered on top. Some systems use reasoning steps or generate code, such as Python, to do the counting. But in those cases, it is the code doing the work, not the model itself.

At the end of the day, LLMs are probabilistic systems that learn patterns and structure from data. More simply, they are pattern-matching systems. They have just become far more sophisticated at it and can use different strategies to improve their outputs.

From a QE perspective, we do not need to become LLM experts.

But a high-level understanding of how these systems work does change how we think about quality.

They are not “thinking” in the way people think. They do make mistakes, but not always in the ways you would expect. They can be fluent and wrong. They can be useful and unreliable. They can produce something that looks finished while still needing careful evaluation.

That means we need to build feedback loops around them and find better ways to evaluate both their outputs and the process that created those outputs.

Great read, well worth your time. Via Beginner-Friendly Guide to How LLMs Work


The normalisation of deviance in AI

Normalization of Deviance comes from the American sociologist Diane Vaughan, who describes it as the process in which deviance from correct or proper behavior or rule becomes culturally normalized.

The author makes a good point that all these disclaimers around LLMs, such as “AI can make mistakes” and “Double check responses”, can start to normalise the deviance around LLMs.

The risk is that the warning becomes background noise.

We know the system can hallucinate. We know it can be wrong. We know it can produce convincing nonsense. But because that behaviour is expected, we start to accept it as normal rather than treating it as a design constraint. To me, thats feels risky.

I think the answer is not to pretend AI systems are more reliable than they are. It is to be realistic about their capabilities and build guardrails, control mechanisms, monitoring and observability into these systems as part of their design.

This links back to the LLM explainer above.

If we understand these systems as fallible, probabilistic and context-sensitive, then we are more likely to build the feedback loops needed to use them responsibly.

From a QE perspective, that means not treating AI failure as an unfortunate surprise. It should be something we design for. Via The Normalization of Deviance in AI


How long before the AI hype settles?

Simon Wardley thinks 3 to 4 years, or 2029 onwards:

Back around 2018/19, I had the date pegged for 2029 onwards. See the column on conversational programming or what you now call [prompt engineering | vibe coding | agentic engineering | harness engineering | spec driven development | ...].

The practices are still emerging, hence we are still coming up with names for it and have no consistent flag. I’ve seen no reason to change my predictions. We are still in the “myths” phase.

The “myths” phase is where people still believe things like we will need fewer engineers, IT budgets will reduce, we have a choice, and a few others.

I think he is right.

The naming is one clue. Prompt engineering, vibe coding, agentic engineering, harness engineering, spec-driven development. The fact we are still naming and renaming the work tells us the practices are still emerging.

Reading the paper linked above, both agentic engineering and harness engineering make an appearance. While the current version of agents has only just started to work, I think we have some time to go before things settle and established businesses adapt to these new ways of working.

From a QE perspective, this is where it is worth being careful.

When a practice is still emerging, people often overfit to the tool they are using today. They create strong opinions before the patterns have settled. They mistake short-term capability jumps for stable operating models.

The next few years are going to be interesting, but I think the useful work is less about predicting the exact future and more about building organisations that can keep learning as the practices change. Via Why are you so negative about AI?


Four factors of thriving developers

Cat Hicks is back. While she is promoting her book, The Psychology of Software Teams, she has been sharing some great insights from it.

*In my research with 1,282 developers, four factors emerged as core to what I call Developer Thriving. While many factors matter across our psychology, I’ve found that diving into these four areas gives us illuminating beacons, pointing us toward noticing the difference between developers who are genuinely flourishing at work and those who are just grinding through it in cycles of “brittle productivity,” easily broken. [...] the four factors we chose [...] and tested [...] are:

  • Learning culture,

  • Agency,

  • Belonging, and

  • Self-efficacy.*

Learning culture is whether developers think the team actually invests in skill growth, or whether learning is something you are expected to do in your own time.

Agency is whether developers think their judgement influences outcomes, or whether they feel they do not have a voice.

Belonging is whether people feel their value counts here, and whether someone like them will be a valued member of the group.

Self-efficacy is whether people believe they can tolerate and overcome challenges, or whether struggling is seen as a sign of failure.

These four factors feel deeply interconnected.

If your belonging is low, because you do not feel valued, it is easy to see how that has a knock-on effect on agency. You are less likely to share your perspective. That can then affect self-efficacy too, because if you do not feel valued, failure can start to feel unsafe rather than useful.

Some organisations might look at this and think the quick fix is more personal development time. That helps with learning culture and might have positive knock-on effects in the other areas too.

Personal development does matter. It shows the organisation is willing to invest in people.

But development time on its own is not enough.

Without psychological safety, people may learn new things but still not feel able to use them, question existing approaches, admit uncertainty, or take the interpersonal risks needed to improve the work.

That is the QE connection for me.

Quality does not only depend on skills. It depends on whether people are in an environment where those skills can be used. Via Four factors of thriving engineers


Why do we like personality tests?

Despite the joy of being compared to celebrities, psychologists have repeatedly argued that the Myers-Briggs has dubious predictive ability and is grounded in debunked theory. To make matters worse, it’s unreliable. Which means that if you take the test more than once to learn more about your “true self”, it’s quite likely to give you different answers each time.

Things like Myers-Briggs and Gallup Clifton Strengths are popular in many organisations.

I can understand why. They give people a simple language to describe themselves. They put people into a category that is easy to share. They make it easier to find others who seem similar to us. There is something comforting about that.

But the problem is that these categories can carry more weight than the evidence supports.

They can also create in-group and out-group dynamics based on unreliable labels. People start to explain behaviour through a type, colour, profile or category, rather than looking at the actual system someone is working in.

That is where this links back to Cat Hicks’ post.

If what we are really trying to create is belonging, then we should be careful about using weak categories to do that work. Belonging built around personality labels can be brittle. It can make people feel seen, but it can also box them in.

From a QE perspective, the wider point is about measurement and sense-making.

Bad models can still be useful socially, which is part of why they spread. But if we use them to make decisions about people, teams or work, then we need to be much more careful.

If you do want to measure personality to help people better understand themselves, then the Big Five Personality Traits, or OCEAN model, has a stronger evidence base: Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism.

You can take the free Big Five test here: IPIP-NEO personality test

Via Why are we suckers for Astrology, the Myers-Briggs, and other pseudoscientific personality tests?


Closing thought

The thread running through these links is that quality depends on the system around the work.

For AI, that means understanding how the model works, what the agent is allowed to do, how we evaluate the path it took, and what guardrails exist around it.

For teams, it means understanding the learning culture, agency, belonging and confidence that shape how people contribute.

For organisations, it means being careful with hype, measurement, categories and incentives.

The tools are changing quickly.

But the quality question is still the same as it’s been for decades:

What feedback do we have, what assumptions are we making, and what conditions are shaping the outcome?

Past Linkys

Linky #33: The stewardship problem

·
Jun 14
Linky #33: The stewardship problem

This week’s links made me think less about whether AI can help us create more software, and more about what happens after that software exists.


Linky #32 - The practice is not the principle

·
May 25
Linky #32 - The practice is not the principle

This week’s links had me thinking about something that is foundational to a lot of quality engineering work:


Linky #31 - Be careful what you optimise for

·
May 3
Linky #31 - Be careful what you optimise for

This week’s links remind me of a topic I think about often:

Discussion about this episode

User's avatar

Ready for more?