AI-generated medical responses need monitoring, study finds

A Mass General Brigham study has found that AI large language models (LLMs) used to generate medical responses for patients must have systems to monitor their quality.

AdobeStock

To tackle the rising administrative and documentation responsibilities for healthcare professionals, electronic health record (EHR) vendors have adopted generative AI algorithms to aid in drafting responses to patients.

However, Mass General Brigham researchers said that the efficiency, safety and clinical impact of these algorithms had been unknown prior to this adoption.

In a new study, the researchers found that while LLMs may help reduce physician workload and improve patient education, limitations in the algorithm’s responses could affect patient safety, suggesting that ‘vigilant oversight’ is essential for safe usage.

The research team used OpenAI’s GPT-4 to generate 100 scenarios about patients with cancer and an accompanying patient question. No questions from actual patients were used for the study.

GPT-4 responded to the generated questions, as well as six radiation oncologists. The same radiation oncologists were then provided with the LLM-generated responses for review and editing.

The study found that the radiation oncologists did not know whether GPT-4 or a human had written the responses, and in 31 per cent of cases, believed that an LLM-generated response had been written by a human.

On average, physician-drafted responses were shorter than the LLM-generated responses. GPT-4 tended to include more educational background for patients but was ‘less directive’ in its instructions.

The physicians reported that LLM-assistance improved their perceived efficiency and deemed the LLM-generated responses to be ‘safe’ in 82.1 per cent of cases and acceptable to send to a patient without any further editing in 58.3 per cent of cases.

However, the researchers also found limitations to the AI-generated responses, as if left unedited, 7.1 per cent of LLM-generated responses could pose a risk to the patient and 0.6 per cent of responses could pose a risk of death, most often because GPT-4’s response failed to urgently instruct the patient to seek immediate medical care.

In many cases, the physicians retained LLM-generated educational content, suggesting that they did perceive it to be valuable. While this may promote patient education, the researchers emphasise that overreliance on LLMs may pose risks, given their demonstrated shortcomings.

The study concluded that the emergence of AI tools in healthcare has the potential to positively reshape the continuum of care, but that it is imperative to balance their innovative potential with a commitment to safety and quality.

In a statement, corresponding author Danielle Bitterman, MD, faculty member in the Artificial Intelligence in Medicine (AIM) Programme at Mass General Brigham and a physician in the Department of Radiation Oncology at Brigham and Women’s Hospital, Boston, said: “Keeping a human in the loop is an essential safety step when it comes to using AI in medicine, but it isn’t a single solution.

“As providers rely more on LLMs, we could miss errors that could lead to patient harm. This study demonstrates the need for systems to monitor the quality of LLMs, training for clinicians to appropriately supervise LLM output, more AI literacy for both patients and clinicians, and on a fundamental level, a better understanding of how to address the errors that LLMs make.”

Mass General Brigham said it is currently leading a pilot integrating generative AI into the electronic health record to draft replies to patient portal messages, testing the technology in a set of ambulatory practices across the health system.

Further, the study’s authors are also investigating how patients perceive LLM-based communications and how patients’ racial and demographic characteristics influence LLM-generated responses.

The study, published in Lancet Digital Health, can be read in full here.