AI-Based Automated Testing: Is It Really the Future of Testing?

AI-based automated testing looks like the future of testing, and could replace manual and automated UI testing. So, is it all it's cracked up to be?

Mar 02, 2025

∙ Paid

Imagine a future where hours of manual regression testing are replaced by automated tests that can intelligently adapt to your application's changing features. Not only that, you would not even need to maintain these tests. All you would need to do is tell the tool what you want, and it would figure out the rest for itself. It sounds almost too good to be true.

In this post, I explore AI-based automated testing and examine how quality engineers should think about it, what the future holds, and how to help our teams navigate these new tools. So, is it all it's cracked up to be?

Duolingo's AI-Based Testing

In last week, Linky #3, I shared a post from Duolingo's engineering team about how they leveraged a GPT-based AI tool to automate 70% of their regression testing. In the post, they shared that the Duolingo mobile app (which helps users learn languages, music and maths) was highly dynamic, which meant that it was often hard to know which screens would be presented to the user within a given scenario. Therefore, creating automated UI tests could become quite cumbersome in handling all the different paths the app could present to the user. Consequently, the test team resorted to manual regression testing to check whether the app performed as expected whenever the engineering team made changes to it.

However, they still wanted to automate more of their regression testing so their testers could focus on retesting bug fixes and new features. As a result, they partnered with Mobile Boost to leverage their GPT driver.

How Did AI-Based Testing Help?

One of the big benefits they got was that anyone in the team could now write tests as they are now written in a more natural language syntax regardless of coding skills. But the downside was that if the tests were too prescriptive (click this, do that, wait for this), they ran into the same problem as writing automated UI tests: large, cumbersome tests that were likely flakey and needed lots of maintenance. So, to leverage the GPT-based AI tool, they had to rethink how they wrote automated tests. From the Duolingo blog post:

If we instead wrote tests to achieve a broader goal like "Progress through the screens until you see XYZ," GPT Driver would interpret each screen as presented with its end goal in mind and continue to progress until it could no longer interpret what to do with a screen or had achieved the aforementioned success criteria.

Using a goal-based approach for tests reduced flakiness and handled the app's dynamic nature. However, the drawback was that the automation could miss issues that didn't stop the test from achieving its outcome.

A new type of testing?

From a quality engineering perspective, this isn't a replacement for regression testing but a new type of testing altogether—a new layer in testing. I'd call this bounded-context outcome-focused automated testing (catchy, right?).

Bounded-context outcome-focused automated testing

It's bounded in that you instruct the AI tool to perform the actions within a context so it doesn't go too far off the track—essentially, you give it some guardrails to work within.

Outcome-focused is telling the AI tool what the end goal you're trying to achieve is and then letting it do its thing.

It's not regression testing

For me, regression testing should be deterministic, and this type of testing is not. Deterministic in computer science means:

"...given a particular input, will always produce the same output, with the underlying machine always passing through the same sequence of states."

While having a particular input, these AI-based tests will not always produce the same output, and the underlying machine will not always pass through the same sequence of states. Why? Because they are probabilistic systems, not deterministic like most software systems.

Regression tests need to be deterministic to ensure that changes made to the codebase have not adversely affected existing behaviour or introduced new unintended behaviour.

These bounded-context outcome-focused automated tests can only tell you if the outcome was unachievable. They will need someone to investigate if the steps to achieve the outcome have regressed or if the automation got stuck somewhere. In this current form, they also couldn't tell you if new unintended behaviour has been introduced. Hence, the testers at Duolingo now spend their time looking through recordings of test runs to spot any issues, which is quicker than running the test suites manually, if not a little tedious.

Side note: You can quite easily argue that people probably don't run regression tests in a deterministic way either and may not always follow the regression testing steps strictly. However, this is almost a feature of people running regression suites, as this deviation from the steps is most likely where they'll discover other unintended behaviours.

I’m speaking at PeersCon in Nottingham, UK, on the 13th of March on Speed Vs Quality. Tickets are only £25, and the lineup looks great. If you can make it, come and say hello.

AI-based tests can't autonomously learn

Another downside of AI-based automated tests is that they can't learn from their output as people can. GPT stands for Generative Pre-trained Transformer, and the pre-trained part is key. Training happens before a GPT generates its output, a process called inference. GPTs can't learn as they infer.

Conversely, when a person performs regression testing, they continuously learn from the output, refining their understanding of the application under test. You could say we are updating our internal models of the system.

For GPT-based systems to improve, their outputs would need to be used to fine-tune or retrain the neural network. Alternatively, they could be updated by refining their prompts, leveraging external memory (if available), or using retrieval-augmented generation (RAG), where they access external data sources before generating responses.

While human testers improve with every test run, GPT-based tools do not—at least not autonomously. However, unlike traditional automated UI tests, GPT-based automation can dynamically adapt by incorporating external data, self-healing mechanisms, or updated prompts. However, it still lacks the continuous, independent learning that human testers naturally perform and the deterministic nature of automated UI tests.

A new layer in testing?

As a result, I see these AI-based tests as a new layer in testing that helps us lower our uncertainty on whether an outcome within our system is achievable within a bounded context.

Now, that is useful and can be a new tool for inspecting quality. But I wouldn't want to rely only on this type of testing. I'd want other layers of testing, too. I'd still expect some forms of human-driven regression testing, which may be more focused on where we know changes have been introduced, and even occasional exploratory testing in areas this type of AI-based automation has covered.

I also see this type of AI testing as most beneficial for developers and testers, particularly those pairing and wanting another layer of testing before getting into more exploratory or deterministic automated end-to-end testing. Accordingly, it could be good to deploy at layer 2 (Automated code tests) and layer 4 (Automated End-to-End testing). From The Six Layers of Testing:

AI-based tests start looking like exploratory tests

For traditional hand-crafted automation to work, it had to be deterministic: all steps and how to perform those steps had to be predefined, with specific assessment criteria to judge whether a test passed or failed.

However, these AI-based automated tests are different. We no longer need to define the steps and how to execute them; we might not even need to explicitly state the outcomes we are looking for other than a general outcome. Consequently, this makes these AI-based tests look more like exploratory testing. Most testers probably don't know beforehand how they will test and what paths they are likely to take other than a notion of what they are trying to achieve, which is very indeterministic.

But that's where the similarities with the current version of bounded-context outcome-focused automated test ends. Unlike exploratory testing, they can only report on what they've been explicitly told to report. These AI-based tests currently don't understand the platforms the software-under-test is run on, the context in which the software will be used, or the quality attributes the key stakeholders value. This is all information a good tester brings when they perform exploratory testing, which helps them uncover the unknown-unknowns.

But that's not to say it couldn't understand this context. So what does the future of AI-based automation hold, and what does it mean for quality engineers? Read on to dive deeper.

Keep reading with a 7-day free trial

Subscribe to Quality Engineering Newsletter to keep reading this post and get 7 days of free access to the full post archives.