Categories
Healthcare Machine Learning Medicine

Standing on the shoulders of clinicians.

The recent publication “Health system-scale language models as all-purpose prediction engines” by Jiang et al. in Nature (June 7th, 2023) piqued my interest. The authors executed an impressive feat by developing a Large Language Model (LLM) that was fine-tuned using data from multiple hospitals within their healthcare system. The LLM’s predictive accuracy was noteworthy, yet it also highlighted the critical limitations of machine learning approaches for prediction tasks using electronic health records (EHRs).

Take a look at the above diagram from our 2021 publication Machine learning for patient risk stratification: standing on, or looking over, the shoulders of clinicians?. It makes the point that the EHR is not merely a repository of objective measurements, but it also includes a record (whether explicit or not) of physician beliefs about the patient’s physiological state and prognosis for every clinical decision recorded. To draw a comparison, using clinicians’ decisions to diagnose and predict outcomes resembles a diligent, well-read medical student who’s yet to master reliable diagnosis. Just as such a student would glean insight from the actions of their supervising physician (ordering a CT scan or ECG, for instance), these models also learn from clinicians’ decisions. Nonetheless, if they were to be left to their own devices, they would be at sea without the cue of the expert decision-maker. In our study we showed that relying solely on physician decisions—as represented by charge details—to construct a predictive model resulted in performances remarkably similar to those models using comprehensive EHR data..

The LLMs from Jiang et al.’s study resemble the aforementioned diligent but inexperienced medical student. For instance, they used the discharge summary to predict readmission within 30 days in a prospective study. These summaries outline the patients’ clinical course, treatments undertaken, and occasionally, risk assessments from the discharging physician. The high accuracy of the LLMs—particularly when contrasted with baselines like APACHE2, which primarily rely on physiological measurements—reveals that the effective use of the clinicians’ medical judgments is the key to their performance.

This finding raises the question: what are the implications for EHR-tuned LLMs beyond the proposed study? It suggests that quality assessment and improvement teams, as well as administrators, should consider employing LLMs as a tool for gauging their healthcare systems’ performance. However, if new clinicians—whose documented decisions might not be as high-quality—are introduced, or if the LLM is transferred to a different healthcare system with other clinicians, the predictive accuracy may suffer. That is because clinician performance is highly variable over time and location. This variability (aka data set shift) might explain the fluctuations in predictive accuracy the authors observed during different months of the year.

Jiang et al.’s study illustrates that LLMs can leverage clinician behavior and patient findings—as documented in EHRs—to predict a defined set of near-term future patient trajectories. This observation paradoxically implies that in the near future, one of the most critical factors for improving AI in clinical settings is ensuring our clinicians are well-trained and thoroughly understand the patients under their care. Additionally, they must be consistent in communicating their decisions and insights. Only under these conditions will LLMs obtain the per-patient clinical context necessary to replicate the promising results of this study more broadly.

By Zak

PI of Zaklab