UHG
Search
Close this search box.

Why AI Can’t Get Software Testing Right

It’s already a danger when you write the implementation first; AI is only going to make it worse.

Share

Listen to this story

Writing unit tests was already a headache for developers, and AI is making it worse. A recent study has unveiled a critical weakness in LLMs: their inability to create accurate unit tests. 

While ChatGPT and Copilot demonstrated impressive capabilities in generating correct code for simple algorithms (success rates ranging from 63% to 89%), their performance dropped significantly when tasked with producing unit tests which are used to evaluate production code.

ChatGPT’s test correctness fell to a mere 38% for Java and 29% for Python, with Copilot showing only slightly better results at 50% and 39%, respectively.

According to a study published by GitLab in 2023, automated test generation is one of the top use cases for AI in software development, with 41% of respondents currently using it. However, this recent study is now questioning the quality of those tests. 

A fullstack developer named Randy on Daily.dev forum mentioned that he had tried AI for both writing code and writing unit tests, and it failed miserably as it does not understand testing frameworks like Groovy and Spock.

Reason Why AI is Poor at Software Testing

AI-generated tests often lack the necessary context and understanding of specific requirements and nuances of a given codebase. Due to this, AI may result in an increase of “tautological testing” – tests that prove the code does what the code does rather than proving it’s doing what it’s supposed to do.

“It’s already a danger when you write the implementation first; AI is only going to make it worse,” a user explained in the Reddit discussion.

Moreover, relying on AI for test writing can lead to a false sense of security, as generated tests may not cover all critical scenarios, potentially compromising the software quality and reliability.

When an AI is asked to write unit tests for code that contains a bug, it typically doesn’t have the ability to identify that bug. Instead, it treats the existing code as the “correct” implementation and writes tests that validate the current behavior – including the bugs, if any.

Instead, the developer says that a better use for AI would be to ask it, “What are all the ways that this code can fail?” Instead of having it write tests, have it identify things you might have missed.

Another report by researchers from the University of Houston, suggested similar numbers as ChatGPT-3.5. Only 22.3% of generated tests were fully correct, and 62.3% were somewhat correct. 

Besides, the report noted that LLMs struggle to understand and write OpenMP and MPI unit tests due to the inherent complexity and domain-specific nature of parallel programming. Also, when provided with “too much” context, LLMs tended to hallucinate, generating code with nonexistent types, methods, and other constructs.

“Like other LLM-based tools, the generated tests are a “best guess” and developers shouldn’t blindly trust them. In many cases, additional debugging and editing are required,” said Ruiguo Yang, the founder of TestScribe. 

When developers consider making new test cases, AI still has a hard time doing that. With their creative problem-solving skills, human testers still need to make thorough test plans and define the overall testing scope.

But What is the Solution?

To solve this problem, researchers from the University of Houston used the LangChain memory method. They passed along smaller pieces of the code as a guide, allowing the system to fill in the rest, similar to how autocomplete works when you’re typing.

This proves that one of the most effective ways to tackle this problem is providing more context to the AI models, such as the full code or associated libraries, which significantly improves the compilation success rate. For instance, with ChatGPT, the increase was from 23.1% to 61.3%, and for Davinci, it was almost 80%.

In recent times, tools like Cursor are helping developers build code without any hassle and in future, we might see these tools building better unit tests along with production code. 

But for now, while AI can generate tests quickly, having an experienced engineer will remain crucial to assess the quality and usability of AI-generated code or tests.

📣 Want to advertise in AIM? Book here

Picture of Sagar Sharma

Sagar Sharma

A software engineer who loves to experiment with new-gen AI. He also happens to love testing hardware and sometimes they crash. While reviving his crashed system, you can find him reading literature, manga, or watering plants.
Related Posts
Association of Data Scientists
Tailored Generative AI Training for Your Team
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.