Policy – Zaklab Blog

What should society do about safe and effective application of AI to healthcare?

In a world awash with the rapid tide of generative AI technologies, governments are waking up to the need for a guiding hand. President Biden’s Executive Order is an exemplar of the call to action, not just within the halls of government but also for the sprawling campuses of tech enterprises. It’s a call to gather the thinkers and doers and set a course that navigates through the potential perils and benefits these technologies wield. This is more than just a precaution; it’s a preemptive measure. Yet these legislative forays are more like sketches than blueprints, in a landscape that’s shifting, and the reticence of legislators is understandable and considered. After all, they’re charting a world where the very essence of our existence — our life, our freedom, our joy — could be reshaped by the tools we create.

On a brisk autumn day, the quiet serenity of Maine became the backdrop for a gathering: The RAISE Symposium, held on October 30th, which drew some 60 souls from across five continents. Their mission? To venture beyond the national conversations and the burgeoning frameworks of regulation that are just beginning to take shape. We convened to ponder the questions of generative AI — not in the abstract, but as they apply to the intimate dance between patient and physician. The participants aimed to cast a light on the issues that need to be part of the global dialogue, the ones that matter when care is given and received. We did not an attempt to map the entirety of this complex terrain, but to mark the trails that seemed most urgent.

The RAISE Symposium’s attendees raised (sorry) a handful of issues and some potential next steps that appeared today in the pages of NEJM AI and Nature Medicine. Here I’ll focus on a singular quandary that seems to hover in the consultation rooms of the future: For whom does the AI’s medical counsel truly toll? We walk into a doctor’s office with a trust, almost sacred, that the guidance we receive is crafted for our benefit — the patient, not the myriad of other players in the healthcare drama. It’s a trust born from a deeply-rooted social contract on healthcare’s purpose. Yet, when this trust is breached, disillusionment follows. Now, as we stand on the precipice of an era where language models offer health advice, we must ask: Who stands to gain from the advice? Is it the patient, or is it the orchestra of interests behind the AI — the marketers, the designers, the stakeholders whose fingers might so subtly weigh on the scale? The symposium buzzed with talk of aligning AI, but the compass point of its benefit — who does it truly point to? How do we ensure that the needle stays true to the north of patient welfare? Read the article for some suggestions from RAISE participants.

As the RAISE Symposium’s discussions wove through the thicket of medical ethics in the age of AI, other questions were explored. What is the role of AI agents in the patient-clinician relationship—do they join the privileged circle of doctor and patient as new, independent arbiters? Who oversees the guardianship of patient data, the lifeblood of these models: Who decides which fragments of a patient’s narrative feed the data-hungry algorithms?

The debate ventured into the autonomy of patients wielding AI tools, probing whether these digital oracles could be entrusted to patients without the watchful eye of a human professional. And finally, we contemplated the economics of AI in healthcare: Who writes the checks that sustain the beating heart of these models, and how might the flow of capital sculpt the very anatomy of care? The paths chosen now may well define the contours of healthcare’s landscape for generations to come.

After you have read the jointly written article, I and the other RAISE attendees hope that it will spark discourse between you and your colleagues. There is an urgency in this call to dialogue. If we linger in complacency, if we cede the floor to those with the most to gain at the expense of the patient, we risk finding ourselves in a future where the rules are set, the die is cast, and the patient’s voice is but an echo in a chamber already sealed. It is a future we can—and must—shape with our voices now, before the silence falls.

I could have kicked off this blog post with a pivotal query: Should we open the doors to AI in the realm of healthcare decisions, both for practitioners and the people they serve? However considering “no” as an answer seemed disingenuous. Why should we not then question the very foundations of our digital queries—why, after all, do we permit the likes of Google and Bing to guide us through the medical maze? Today’s search engines, with their less sophisticated algorithms, sit squarely under the sway of ad revenues, often blind to the user’s literacy. Yet, they remain unchallenged gateways to medical insights that sway critical health choices. Given that outright denial of search engines’ role in health decision-making seems off the table and acknowledging that generative AI is already a tool in the medical kit for both doctors and their patients, the original question shifts from a hypothetical to a pragmatic sphere. The RAISE Symposium stands not alone but as one voice among many, calling for open discussions on how generative AI can be safely and effectively incorporated into healthcare.

February 22nd, 2024

Question	Elaboration and downstream experiments
Does RLHF degrade LLM performance?	What kind of RLHF under what conditions? When does it improve performance?
How does the size and quality of the pre-trained model affect the impact of RLHF?	Zou and his colleagues note that for some tasks GPT-3.5 improved whereas GPT-4 deteriorated.
How do we systematically monitor all these models for longitudinal drift?	What kinds of tasks should be monitored? Is there an information theoretic basis for picking a robust subset of tasks to monitor?
Can the RLHF impact on LLM performance be predicted by computational inspection of the reward model?	Can that inspection be performed without understanding the details of the pre-trained model?
Will we require artificial neurodevelopmental psychologists to avoid crippling the LLMs?	Can Susan Calvin (of Asimov robot story fame) determine the impact of RLHF through linguistic interactions?
Can prompting the developers of RLHF prompts mitigate performance hits?	Is there an engineered path to developing prompts to make RLHF effective without loss of performance?
Should RLHF go through a separate regulatory process than the pre-trained model?	Can RLHF pipelines and content be vetted to be applied to different pre-trained models?

Question

Elaboration and downstream experiments

Does RLHF degrade LLM performance?

What kind of RLHF under what conditions? When does it improve performance?

How does the size and quality of the pre-trained model affect the impact of RLHF?

Zou and his colleagues note that for some tasks GPT-3.5 improved whereas GPT-4 deteriorated.

How do we systematically monitor all these models for longitudinal drift?

What kinds of tasks should be monitored? Is there an information theoretic basis for picking a robust subset of tasks to monitor?

Can the RLHF impact on LLM performance be predicted by computational inspection of the reward model?

Can that inspection be performed without understanding the details of the pre-trained model?

Will we require artificial neurodevelopmental psychologists to avoid crippling the LLMs?

Can Susan Calvin (of Asimov robot story fame) determine the impact of RLHF through linguistic interactions?

Can prompting the developers of RLHF prompts mitigate performance hits?

Is there an engineered path to developing prompts to make RLHF effective without loss of performance?

Should RLHF go through a separate regulatory process than the pre-trained model?

Can RLHF pipelines and content be vetted to be applied to different pre-trained models?