ChatGPT Improving, but Still Lacks Reliability as a Clinical Support Tool

Despite mostly being accurate and showing clear improvement over time and across versions, caution is still warranted when using ChatGPT as a clinical decision support tool, according to authors of a cross-sectional study.

In an analysis of nearly 300 medical questions developed by 33 physicians across 17 specialties posed to ChatGPT, the median accuracy score was 5.5 (IQR 4.0-6.0) on a 6-point Likert scale, which was considered to be between almost completely and completely correct, reported Douglas B. Johnson, MD, MSCI, of Vanderbilt University Medical Center in Nashville, Tennessee, and co-authors.

However, the chatbot’s mean accuracy score of 4.8 was much lower, which reflected the multiple instances in which ChatGPT was “spectacularly and surprisingly wrong,” they noted in JAMA Network Open.

The median completeness score was 3.0 (IQR 2.0-3.0) on a 3-point Likert scale, indicating that ChatGPT was “complete and comprehensive,” and the mean completeness score was 2.5.

The trend in differences between median and mean scores was consistent across the analysis, Johnson and team said, and the inaccuracies and hallucinations in their analysis suggested that neither version of ChatGPT (3.5 and 4) should be exclusively relied upon for medical knowledge dissemination.

“Our main takeaway is that overall they were not perfect, and so they shouldn’t be relied on as a sole source by any means,” Johnson told MedPage Today. “But they did provide overall relatively complete and accurate information, and interestingly enough, that improved over time.”

“The fact that it was able to improve pretty quickly is certainly encouraging that it may potentially, at some point in the future, become more absolutely reliable,” he added.

The authors also noted that the chatbot’s accuracy on questions of varying difficulty (easy, medium, and hard) was similar as measured by median accuracy scores (P=0.05):

6.0 (IQR 5.0-6.0) on easy questions
5.5 (IQR 5.0-6.0) on medium questions
5.0 (IQR 4.0-6.0) on hard questions

Similarly, the chatbot performed well on both multiple choice (median score 6.0, IQR 4.0-6.0) and descriptive questions (median score 5.0, IQR 3.4-6.0).

The authors retested the chatbot on 36 questions with scores indicating inaccuracies 8 to 17 days later and found substantial improvement (median score 2.0 vs 4.0; P<0.01) on 34 of those questions.

They also retested a subset of questions, regardless of initial scores with version 3.5, using version 4, and again observed improvement (mean accuracy score 5.2 vs 5.7; median score 6.0 [IQR 5.0-6.0] for original and 6.0 [IQR 6.0-6.0] for re-scored; P=0.002).

However, Johnson said ChatGPT would need to completely stop “egregiously” hallucinating before it can be incorporated as a trusted clinical tool.

“If something’s slightly incorrect one out of 1,000 times, that may be a good enough threshold that the benefits may outweigh the risk,” he said. “If it’s egregiously hallucinating even one out of 100 times, you would want to be very, very careful about acting on any advice.”

Johnson noted that ChatGPT would best be used as a source of information or a creative tool for brainstorming around a difficult treatment decision, similar to a Google search.

“At this point, they’re potentially useful as an adjunct to more trusted sources,” he added.

For this analysis, the researchers recruited 33 physicians across 17 medical, surgical, and pediatric specialties. In total, 31 respondents were faculty and two were residents at Vanderbilt University Medical Center. Physicians were asked to generate six specialty-specific medical questions with clear and uncontroversial answers from available medical guidelines dated no later than the beginning of 2021 — the cutoff date for version 3.5 of ChatGPT at the time of analysis.

Johnson and team also created 60 medical questions for 10 common medical conditions. In total, they tested 284 questions on ChatGPT version 3.5 initially, and retested 44 questions on ChatGPT version 4.

“Despite promising results, the scope of our conclusions is limited due to the modest sample size, single-center analysis, and the data set of 284 questions generated by 33 physicians, which may not be representative of all medical specialties and the many questions posed within them,” they wrote.

Michael DePeau-Wilson is a reporter on MedPage Today’s enterprise & investigative team. He covers psychiatry, long covid, and infectious diseases, among other relevant U.S. clinical news. Follow

Disclosures

This study was supported by numerous sources, including the NIH, the National Institute of Diabetes and Digestive and Kidney Diseases, the U.S. Department of Veterans Affairs Clinical Sciences R&D Service, and the National Cancer Institute.

Johnson reported receiving grants from Bristol Myers Squibb and Incyte, and being on the advisory boards of Bristol Myers Squibb, Catalyst, Merck, Iovance, Novartis, and Pfizer.

Co-authors reported multiple relationships with government entities, foundations, and industry.

Primary Source

JAMA Network Open

Source Reference: Goodman RS, et al “Accuracy and reliability of chatbot responses to physician questions” JAMA Netw Open 2023; DOI: 10.1001/jamanetworkopen.2023.36483.

Source link