Last updated July 24, 2024
In AI in Healthcare

GPT-4o vs Claude 3.5 Sonnet vs Google Gemini in Healthcare

With new AI models from Google and Anthropic emerging, studies highlight GPT-4o's leading performance, though other models also show strong capabilities.

Share

Illustration by Nikhil Kumar

Published on July 24, 2024

by Gopika Raj

In a recent interview, former FDA Commissioner Dr. Scott Gottlieb discussed a study comparing the performance of AI language models on medical licensing exam questions.

The test showed that ChatGPT was the top performer, scoring 98% on 50 questions from the Step 3 medical licensing exam. All five major AI models tested, including Gemini, Grok, and HuggingChat/LLAMA, passed the exam with an average score of 75%.

The study aimed to compare different AI models’ abilities to answer medical questions, as both consumers and physicians are increasingly using these tools.

However, surprisingly, ChatGPT not only provided correct answers but also contextualized responses, explaining its choices and offering additional information.

This is not the only study or report that highlights these capabilities. A few months back, another study evaluated ChatGPT-4’s proficiency across medical specialties and its potential as a study tool for USMLE Step 2 and clinical subject exams.

ChatGPT-4 answered board-level questions with 89% accuracy. However, it showed significant discrepancies as well in performance across specialties.

As AI evolves and transitions into various sectors, healthcare stands out as one of the major ones. A user on Reddit stated that during the transition period before AI can autonomously perform all the tasks of a specific person, hopes are high that it becomes proficient and equipped with long-term memory to serve as a proper personal assistant.

With this evolutionary expectancy of AI, people are optimistic about the use of AI in healthcare.

For instance a Reddit user asserted that AI can already diagnose more accurately and consistently than humans can. Diagnoses can be done much faster and with a lesser error margin using a computer programme that has access to the entirety of humanity’s knowledge than a person.

Whereas, highlighting the score of GPT-4o, a user pointed out, “ChatGPT relies on search engine data to function and is faster… Give even a less than mediocre doctor google (sic) and extra time, I guarantee they will do better than 98%.”

What About Other Models

With the emergence of new AI models from Google and Anthropic, there are numerous studies and examples showcasing GPT-4o’s performance as a leader, while other models also demonstrate strong capabilities.

In a study of how these models fair with the American Academy of Periodontology in-service exam, ChatGPT-4o achieved the highest accuracy, ranging from 85.7% to 100%, surpassing Claude 3 Opus and Gemini Advanced.

Notably, all three outperformed second-year periodontics residents, with ChatGPT-4o in the 99.95th percentile, Claude 3 Opus in the 98th percentile, and Gemini Advanced in the 95th percentile.

Another example is a cross-sectional study on the Italian entrance test for healthcare sciences degrees found that GPT-4 outperformed Microsoft Copilot (which uses GPT-4) and Google Gemini. However, this study did not include Claude 3.5 in its comparison.

Meanwhile, other models are not lagging behind. A few months ago, researchers from Google and DeepMind developed Med-Gemini, a new family of multimodal AI models specialised for medicine. This model, built on the Gemini 1.0 and 1.5 models from 2023, excelled in language, multimodal understanding, and long-context reasoning.

Med-Gemini offers new possibilities for AI in medicine, including complex diagnostics, multimodal medical dialogue, and processing lengthy electronic health records.

To assess Med-Gemini’s performance, researchers tested it on 25 tasks across 14 medical benchmarks. Med-Gemini set new state-of-the-art records on ten benchmarks and achieved 91.1% accuracy on the MedQA benchmark, surpassing the previous best by 4.6%. It also outperformed GPT-4 by an average of 44.5% in multimodal tasks.

Peter H. Diamandis, executive chairman of X Prize Foundation said on X, “Healthcare is just the beginning,” referring to the new AI model, Med-Gemini, especially since it managed to outperform even GPT-4o.

Recently, during an exclusive interview with AIM, Rowland Illing, chief medical officer and director at AWS, discussed how Claude can be accessed through AWS Bedrock. He gave an example of using Claude in healthcare, pointing out that it was employed to conduct a global literature search to identify associations between genes and learning disabilities. This led to the discovery of 20 new gene associations, demonstrating Claude’s capability to process and analyse vast amounts of data efficiently.

Similarly, in a comparative study evaluating AI models in complex medical decision-making scenarios, Claude demonstrated superior performance compared to other models such as ChatGPT, Google Bard, and Perplexity. Specifically, Claude AI achieved higher average scores for relevance (3.64) and completeness (3.43) compared to the other AI tools.

Saving Lives

A Reddit discussion highlighted AI’s role in healthcare, noting its potential for personalised treatment, clinical decision support, drug discovery, and genomics. Another user mentioned, AI can save physicians time on documentation and improve diagnoses, medicine development, and personalised treatments by analysing large datasets and tailoring care to individual needs.

Another Reddit user sarcastically boasted saying, “AI would listen better than any doctor does today! Doctors listen about 11 seconds to their patients then dole out a diagnosis.”

Case in point, a few months ago, Alex, a four-year-old with puzzling symptoms, was diagnosed with tethered cord syndrome by ChatGPT after 17 doctors couldn’t find the cause. His mother, Courtney, used ChatGPT to input his symptoms and medical history, and was able to get the right treatment for her son.

Another instance being a user on X who described how he saved his dog’s life with the help of GPT-4. After veterinarians failed to diagnose the worsening symptoms of a tick-borne disease, Cooper described the situation to ChatGPT, which suggested the correct diagnosis of immune-mediated hemolytic anemia, enabling the veterinarian to save the dog.

AI is increasingly being embraced by healthcare, as various models are implemented and companies invest in AI to address its growing importance in the field.

As mentioned by a Reddit user, AI is definitely the future of healthcare—not to replace doctors, but to help streamline the process.

📣 Want to advertise in AIM? Book here

Gopika Raj

With a Master's degree in Journalism & Mass Communication, Gopika Raj infuses her technical writing with a distinctive flair. Intrigued by advancements in AI technology and its future prospects, her writing offers a fresh perspective in the tech domain, captivating readers along the way.