LLMs for Test Cases – Our Observations
Most of us are excited about AI’s impact on tech and business. Some of us are afraid, which is reasonable. Above all, many of us likely feel overwhelmed – like generative AI is changing too fast for us to properly make informed decisions about it. Tech titans and giants are constantly pelting us with their latest models and docs full of confusing AI jargon, and we repeatedly feel we’re back at square one in terms of simply picking the right tools for our teams.
In today’s post, I thought I’d try a two-pronged approach: 1.) further discuss our upcoming AI Test Case & Script Generator, and 2.) use my research towards that project so far to share our observations of the major generative AI models you’re hearing about daily. I’ll touch on both the base models and some custom or fine-tuned versions I’ve set up myself. Please note that there are likely others who have conducted more thorough testing of these models – but I’m nonetheless sharing my impressions so far to provide some scope into our ongoing preparation for building an exceptional testing tool for you all.
I’ll start with the first major LLM to shock the world in chatbot form. ChatGPT is only wrapping up its second year in the field, having achieved undeniable ubiquity and impact. We asked the base GPT-4o – one of OpenAI’s most recent multimodal transformers, with a 128,000-token context window – to observe sample pairs of software requirements and test cases, and then make predictions on new requirements. All “multimodal” means is the ability to learn multiple kinds of input – text, images, and audio, in the case of 4o. Unsurprisingly, our results were inspiring. Between two unique sets of few-shot examples and sample requirements, the base GPT-4o model averaged a 97.88% success rate in outputting appropriate, relevant test cases. Using a Custom GPT – one instructed to focus on sample data matching our requirements-to-tests use case – yielded a 95.5% average across the same samples. GPT-4o quickly produced diverse tests of appropriate complexity, covered some edge cases that I or others did not consider, and stayed engaged with me for long periods of constructive follow-up discussions and edits. If the discussion veered into system design, GPT was also adept at structuring both my ideas and its own into a ready-to-go plan. Low points of the experience included the model getting a bit “worn out” near the middle of a long discussion – sometimes starting to blur a few ideas together or suddenly adding a step it previously declared unnecessary. It wasn’t always clear whether these were memory lapses or simply faulty explanations. Either way, I wouldn’t anticipate them impeding a smooth experience for our customers if they were prompting test case conversions from GPT-4o through our app.
Since I mentioned Gemini in the last post, I’ll follow up with its models. For the purposes of our generator, we’ve tested out Geminis Pro and Flash, so I can speak the most to their performance. Gemini 1.5 Pro boasts a 2-million-token context window, doubling that of its Gemini relatives, who already outnumbered (in context, anyway) their competitors. The model was tested on the same sample inputs as GPT-4o, but only in prompt engineering, as fine-tuning it was not yet available in AI Studio. Right off the bat, 1.5 Pro responded to the requirements with creative testing ideas that took some longer discussion to bring out of Mistral and other models. I had a rather intricate product idea that I’d been tossing at models later in the discussion as a challenge – so imagine my surprise when 1.5 Pro mentioned it to me before I brought it up! The model generated strong test cases with novel ideas, reasonable test length and complexity, and acute coverage, earning between the sets an average score of 91.88%.
Clocking in at a 1-million-token context window, Gemini 1.5 Flash is promoted by Google for its general versatility and cost-performance balance. The base model netted a nice 87% quality grade from us, while the fine-tuned model fell short at 71.5% (the reason for this is unclear – the same samples had fed the Custom GPT). Flash produced a healthy balance of quantity and quality of tests – starting with about 5-6 for a given requirement, but always briefly suggesting other possible test ideas. It also took a positive page out of ChatGPT’s book by incorporating specific example measurements in test steps. For example, it included distinctions such as testing videos both longer and shorter than 1 minute if the user’s goal were to test an app’s handling of video files. Finally, it thought thoroughly – writing tests to account for odd edge cases, like a random refresh or logout, or the potential canceling of a pending transaction.
Unfortunately, there were hiccups in the process. Flash neglected to include some rather no-brainer tests to try (ex: testing a very long or very short input) and an obvious prerequisite in our financial test (namely, that an account should have sufficient funds before testing a withdrawal). A general lack of clarity was also observed in multiple parts of the trials. Flash provided some complex or ambiguous assertions – such as ensuring that a currently pending transaction had not affected the current balance – without explaining with specific steps how that should be discerned. When asked to adjust this, it seemed to fail at elaborating without contradicting itself (and when that was pointed out, it acknowledged the inaccuracy, but then simply reverted back to the original unclear response). At one point, the fine-tuned Flash model outputted seven tests for a requirement, all of which came with one repeated prerequisite and identical starting steps. Four of them merely tested that one of the four corners of the application interface displayed a different datum, which was not described in the requirement. Disproportionate coverage, odd self-repetition, and hallucinations were observed, and when discussing these issues, the model abruptly froze and stopped responding more than once.
While we’ve covered the two model families you’re likely hearing the most about today, I’d be remiss not to discuss a strong contender in our pool. Mistral AI, a small French company concentrated in open-source LLM building, offers models of diverse sizes and specialties for direct prompting and fine-tuning. For our use case, we tested direct prompts with Mistral Large 2, which matches GPT-4o at a 128,000-token context window. Large 2 averaged a stellar 97% success rate between our trial prompts. The model generated unique, robust test cases with reasonable quantity and little to no self-repetition. It kept up with my rigorous questioning and challenging discussion prompts, made clear additions and edits to its cases as it was asked to, and reasoned critically, justifying its responses when prompted to do so. It even generated clear, careful solutions to complex problem descriptions added to an already long and detailed mess of user stories. My “cons” list when evaluating Large 2 was essentially empty, and I would seriously consider it over GPT-4o. Mistral is a rapidly-rising star whom we should all be watching!
Finally, we come to Meta’s Llama. If you don’t know, Llama is a standout from other tech titans’ LLM families due to its open-source emphasis. Llama 3.2 quite literally just dropped, as well – introducing lightweight mobile and multimodal image reasoning models. My access to Llama so far has been the most brief – I won’t yet be giving a score – but I did get a quick chance to experiment with Llama 3. Tested on the same requirements as the others, the model performed well – brainstorming some novel ideas, over-complicating some processes, and repeating itself here and there. I felt the results to be about on par with some of the “high, but not highest”-performing models discussed here – perhaps direct-prompted Gemini 1.5 Flash.
There are a million distinguishing aspects of the various LLMs out there today, and a million ways to try to creatively leverage each one. This blog post merely scratches the surface of the intricacies between them all, but I hope that it helped you understand what we look for in a model to power our upcoming AI Test Case & Script Generator, what we observed in these different candidates, and how you just might be able to harness their powers yourself!
In our next post, I’ll likely dive deeper into specific model versions – perhaps even get to reveal our final choice. In the meantime, check out the previous introductory blog post here!