Listen to this story
|
Writing unit tests was already a headache for developers, and AI is making it worse. A recent study has unveiled a critical weakness in LLMs: their inability to create accurate unit tests.
While ChatGPT and Copilot demonstrated impressive capabilities in generating correct code for simple algorithms (success rates ranging from 63% to 89%), their performance dropped significantly when tasked with producing unit tests which are used to evaluate production code.
ChatGPT’s test correctness fell to a mere 38% for Java and 29% for Python, with Copilot showing only slightly better results at 50% and 39%, respectively.
According to a study published by GitLab in 2023, automated test generation is one of the top use cases for AI in software development, with 41% of respondents currently using it. However, this recent study is now questioning the quality of those tests.
A fullstack developer named Randy on Daily.dev forum mentioned that he had tried AI for both writing code and writing unit tests, and it failed miserably as it does not understand testing frameworks like Groovy and Spock.
If you practice TDD, LLMs shouldn’t be writing your tests (as good as it seems). You think through the use cases, break down the logic, and write the tests. Let the LLM fill in the implementation.
— Andrew Nguonly (@andrewnguonly) June 2, 2024
I wonder how far someone could get with this approach 🤔
Reason Why AI is Poor at Software Testing
AI-generated tests often lack the necessary context and understanding of specific requirements and nuances of a given codebase. Due to this, AI may result in an increase of “tautological testing” – tests that prove the code does what the code does rather than proving it’s doing what it’s supposed to do.
“It’s already a danger when you write the implementation first; AI is only going to make it worse,” a user explained in the Reddit discussion.
Moreover, relying on AI for test writing can lead to a false sense of security, as generated tests may not cover all critical scenarios, potentially compromising the software quality and reliability.
When an AI is asked to write unit tests for code that contains a bug, it typically doesn’t have the ability to identify that bug. Instead, it treats the existing code as the “correct” implementation and writes tests that validate the current behavior – including the bugs, if any.
Instead, the developer says that a better use for AI would be to ask it, “What are all the ways that this code can fail?” Instead of having it write tests, have it identify things you might have missed.
Another report by researchers from the University of Houston, suggested similar numbers as ChatGPT-3.5. Only 22.3% of generated tests were fully correct, and 62.3% were somewhat correct.
Besides, the report noted that LLMs struggle to understand and write OpenMP and MPI unit tests due to the inherent complexity and domain-specific nature of parallel programming. Also, when provided with “too much” context, LLMs tended to hallucinate, generating code with nonexistent types, methods, and other constructs.
“Like other LLM-based tools, the generated tests are a “best guess” and developers shouldn’t blindly trust them. In many cases, additional debugging and editing are required,” said Ruiguo Yang, the founder of TestScribe.
When developers consider making new test cases, AI still has a hard time doing that. With their creative problem-solving skills, human testers still need to make thorough test plans and define the overall testing scope.
But What is the Solution?
To solve this problem, researchers from the University of Houston used the LangChain memory method. They passed along smaller pieces of the code as a guide, allowing the system to fill in the rest, similar to how autocomplete works when you’re typing.
This proves that one of the most effective ways to tackle this problem is providing more context to the AI models, such as the full code or associated libraries, which significantly improves the compilation success rate. For instance, with ChatGPT, the increase was from 23.1% to 61.3%, and for Davinci, it was almost 80%.
In recent times, tools like Cursor are helping developers build code without any hassle and in future, we might see these tools building better unit tests along with production code.
But for now, while AI can generate tests quickly, having an experienced engineer will remain crucial to assess the quality and usability of AI-generated code or tests.