In this Q&A, JAMA Editor in Chief Kirsten Bibbins-Domingo, PhD, MD, MAS, interviews Nigam Shah, MBBS, PhD, professor of medicine at Stanford University and chief data scientist at Stanford Health Care, to discuss how large language models are reshaping medicine and the potential pitfalls of automation.
JN Learning™ is the home for CME and MOC from the JAMA Network. Search by specialty or US state and earn AMA PRA Category 1 Credit(s)™ from articles, audio, Clinical Challenges and more. Learn more about CME/MOC
[This transcript is auto-generated and unedited.]
- We all have concerns about our own personal data. When we think about the AI data diet, how do we distinguish between privacy concerns and security concerns as a goal? What constitutes the anatomy of AI augmentation, the traps of automation, and the key elements of AI healthcare evaluation? I'm Dr. Kirsten Bibbins-Domingo, and I'm the editor-in-chief of JAMA and the JAMA Network. This conversation is part of a series of videos and podcasts hosted by JAMA in which we explore the issues surrounding the rapidly evolving intersection of AI and medicine. Today, I'm joined by Dr. Nigam Shah. As professor of medicine at Stanford University and chief data scientist at Stanford Health, Dr. Shah has written about how AI can advance our understanding of disease, as well as improve clinical practice and the delivery of healthcare. Welcome, Dr. Shah.
- Thank you. It's a pleasure to be here.
- Great. I hope we can do this interview on a first name basis, if that's okay with you.
- So Nigam, you wrote recent special communications for us at JAMA about large language models, and people have described it as a wonderful primer on large language models in healthcare. And also, I think one of the things that was really striking about the piece is that you really issued a call to action to the clinical community that we shouldn't sit on the sidelines and get rather, but get involved in this new technology. So why was this a message that you wanted to send out to our readers and listeners at JAMA?
- Absolutely. So, I mean, as you know, ChatGPT, the web application that is powered by two of the language models, has taken the world by storm. It's not a question of if it'll affect our lives, it's a question of how. And I think doctors, in general, like, when it comes to technology, we tend to be a little bit conservative, you know, because we're dealing with caring for human lives. But in this particular situation, I think we can't afford that conservatism. We have to be a little bit more proactive in shaping how these things enter the world of medicine and healthcare. And so that was a primary motivation to write about it.
- And one of the analogies that you draw, which resonated for me, is that you said we stood more on the sidelines for the development of the EHR, and certainly, in the US, we have, the EHR is now pervasive in clinical practice, but really is a technology that is designed probably not for the things we need it to do, mostly in the care of patients. It's designed for other purposes, really, that's what it does much better. And you can see the challenges, and many people have written about this and have talked about how much the EHR work contributes to physician burnout and clinician dissatisfaction. And so one of the things that you point out in your article is that rather than us just taking the off the shelf large language models and seeing how well they do, what we need them to do in healthcare, is to actually think about how do we say what we need in healthcare and train our large language models to do what the tasks that we need. Tell me a little bit more about what the difference is.
- EHRs, for example, started out predominantly as a billing solution. And then part of the EHR was about managing complex devices like MRI machines and bedside monitors and so on. And then came the bookkeeping task of did we provide the right care? Then came the billing task, and it all got merged together into one beast that we refer to as the EHR. And it's one of like 1,300 IT systems that any large healthcare system has to manage and deal with. So in some sense, it grew by what a computer scientist would call mud ball programming, like things has got layered on top of each other. I don't think we can afford that with something that's new and moving as fast as large language models are. So that was like one motivator. The second thing and the core message is that computer scientists, engineers, tech companies are training these things using content from the internet, so to speak. If we really want these things to work, you know, we better train them on Medline, on our textbooks, on UpToDate, ClinicalKey, whatever, whatever trusted sources, maybe guidelines from professional societies, so that the output coming out of these things can be trusted.
- You make two big points, I think, in your piece. Both how we train, how we train these models and then how we evaluate them, that both of them should be compatible with what our goals are actually for healthcare. Let me ask you first about a little bit more to talk about what it means to train these models on data that would make sense for the types of tasks we have. If we use sort of the general, the whole universe of the internet, it makes sense that we'd wanna train things with sort of medical content that we trust. But when it comes to patient data, we've also heard people talk about training models on what we've done in the past in healthcare, not all of which has been great, can also build in these types of biases that we see and and sort of codifies the bias that we've already had and may wanna change in the way we practice medicine. How do we avoid those types of things?
- I mean, these things are gonna learn what we feed them. And sometimes the patterns in the data are not the ones that we would like to believe are true or what we desire. And so I like to break it down into two parts. One is the creation of the model itself. And to the extent possible, we feed it content, sort of imagine diet for the model that keeps it as unbiased as possible. And it's not perfect I mean, there are things that are patterns of care that are just like hardwired in. And then second is the policies that govern what happens when a model produces a certain output. And we can be intentional about those policies. And for areas where we know that our care practices are not ideal, we say we will not trust the model output. And we intentionally create the diet that we want to feed to these models. Now, language models are a special case of general model as we've been talking about them. And so if we take a simple example of producing a patient instruction after discharge or after an office visit, I mean historically, it's like five pages in English that a lot of people, including doctors, might have trouble with following, like, what exactly was I told? Now, if we train the model to produce exactly that, we help nobody, but we can be intentional and say, produce a one-page version, produce it at the 10th grade reading level, producing it in the language of my choice. And then we put it in an evaluation loop to say, when that was produced, did it help? Did the patient follow the instruction better? Did they read it completely more often than not? And then we use these feedback loops to steer us in directions that work.
- What do we do for more complex situations where it's bias where there's a group of patients who we know, as Dr. Obermeyer has written about, groups of patients who have not been referred routinely for certain types of care, not because their clinical context would have warranted this care, but because clinicians were just not referring. How can we learn and understand and avoid those types of biases?
- So those, unfortunately, I don't think can be avoided if we're just learning from the data. I would flip it around and say that kind of learning makes those biases obvious and in our face, which makes us uncomfortable. And I would use that as a motivation to do something about them. So I'm a huge fan of the work that routinely sort of shows us a mirror and we don't like what we see, which, hopefully, will create the momentum to do something about those problems. But at the same time, I would like to believe that those kinds of issues are, you know, at least the minority, 10%, 20%, not 80%. And so we should not throw out the 80% out of the fear for getting some things wrong. So we absolutely need guardrails to prevent against that, but I don't think fear should be the driving factor in evaluating other uses of that technology.
- One of the other things that I've learned in the course of having these conversations is, especially focused on using patient data to train these models, is the issues around the privacy of these data. How do we, how do you talk about, 'cause you're the work within a large health system, how do you talk about the goal that a system, like Stanford or other health systems, have to protect patient data when it is patient data that is used to help these models learn? And as, I think, another person I've interviewed said, you know, these models can remember, sometimes, from one to the other. So how do we actually put the guardrails to protect patient privacy?
- So this is a fun topic. I might go a little bit longer on this one. I think we often mix privacy and security. Patient data definitely need to be kept secure in the sense that I don't want my medical record out on the internet. Privacy is a slightly nuanced notion, like yes, I do wanna make sure that people who are not authorized don't find out about the medical conditions or the care I'm seeking. But at the same time, if my doctor needs to talk about my situation with three other clinicians, I don't want that prohibited. And at the same time, I do want that if there are 500 other people like me, I would want my doctor to learn from their experience. I mean, that is why we go to a doctor, because implicitly, we are relying on the neural net between their two ears to have remembered the patterns of care that were delivered to previous people like me and what happened to them. So in that case, we're explicitly asking for my data, my record to be used by the clinician to care for another human being, and I want the same. And so if we use that as the guiding principle, then aggregate use of de-identified information should be our moral duty. Now again, this is a very unusual position, but I would argue that we all want the learning health system, we all want personalization in our care, And the core tenant of that is to learn from the prior experience of similar patients. But if everybody in the country keeps their data private and doesn't share it with anybody, how do we get there?
- It's the moral duty for society. But you would argue even for the individual patient, that their data contributing to the training of these types of technologies that are gonna help us in the future helps them or it helps other patients like them. How would you extend that?
- I would obviously want it to help my care, but enough data and enough learning has to accumulate before that happens. So there's a little bit of sharing on faith that has to happen. And I'll use a simple analogy. If we want a spell checker to work perfectly, a lot of us have to share our documents and how we edited them into the right phrasing or, you know, the edit pattern in order for that to work. Auto complete in Google documents and web searches works because millions of other people's completions have been analyzed. No human reads them. So, I think, when we come to privacy, I think what we want is we do not want disclosure of confidential information to people who we do not want to disclose to. But I don't view privacy as the argument to not learn from aggregate data, because otherwise, how do we get to learning health systems?
- Wonderful. So you've thought a lot about this in the context of Stanford healthcare, what types of things are you excited for on the horizons? Are there areas in medicine where you see AI technologies transforming care more quickly? What are you most excited by?
- Right now, the thing I'm most excited about is that a lot of people care. I mean, really, like has there ever been a time that for one single piece of technology, everybody, from the CEO to the chief medical officer to the chief nursing officer or the pharmacist, everybody cared? Like never before has that happened. So, super excited. But at the same time, I think the hype is a little bit outta hand, and sometimes it's like we're riding the hype curve so fast, we might reach escape velocity and never come back. And so there, I think, what we have to do, which is one of the points we tried to make is we have to be intentional about what is the goal here. We have to crisply articulate the desired outcome, and then verify whether we're getting it. If we want these language models to answer patient queries instead of a human answering them, well, what's the goal? Is it to reduce the physician burden or is it to make sure the patient gets the right answer? Now, we can accomplish reduction in physician burden at the expense of giving random answers to patients. And so the goal has to be articulated clearly upfront, and a lot of these pilots that I see happening today, or these days, broadly speaking, are let's try something and see what happens, as opposed to an intentional experiment.
- Well, you write about that in your piece, and you talk about the evaluation, the types of ways to evaluate whether this technology is actually useful to us, what types of studies would you like to see? And JAMA recently put out a call for papers on AI in medicine, what types of studies would you like to see or do you think would be most useful for giving us assurance that something is working towards a goal that we have?
- Absolutely. So imagine a triangle where one of the vortex is building the model, and there's many things we can do to build a model right in terms of the diet of the data that is fed to it, have we instruction tuned it, and so on. One of the other vortex is what is our presumed benefit, and how are we going to verify it? And then the third is deploying it at scale in the healthcare system. And that's typically overlooked because, but scale and deployment matters because we also wanna make sure the use of these technologies doesn't drive up the cost of care. So the experiment that I would like to see is something that spans this whole triangle, where we create a model, where we have pre-declared the benefit that we desire to seek and we have a verification mechanism, and we verify that via a broad deployment in the healthcare system so we know we can sustain it. And then we complete the loop by swapping out a different model. You know, maybe we replace GPT-4 with something that Google provides or Amazon provides or something we home build or, I love this example of Civica Rx, which is a company that health systems came together to build so that they would've more control over their genetic drug supply. Why can we not do that for language models or other kinds of technology that's pre-competitive? Like, we don't compete on who uses the better language model, we compete on who provides better care. I love what you're saying. I like the triangle analogy, 'cause it also reminds me that a lot of the hype of these new technologies, it makes you believe that we would throw out everything we've done before, but what you're describing is our standard way of evaluating whether something new works. Like, we wanna see, you know, does it work on its own? Does it address a need, a pre-specified need? And in that context, can you assess that need in the context of its actual deployment in the real world or in what you want it to do? And so that seems to me the same approach that we have for clinical research writ large across drugs or devices or things like that. And so I think that it's such a great analogy, and I think we're going to need to see more of that. Hopefully, people will pay attention to your call for more of us to get involved. What's your parting word for our readers, our viewers, our listeners, about if you're a physician, a clinician in a healthcare setting, you've heard, you know, you've given us a call to action, but we are not computer scientists. What should we be doing to try to stay abreast of things to try to think about these things? How could we get involved?
- So, two points. One is there's often talk about AI augmenting humans. And, I mean, it makes for a nice story that, you know, we're not gonna replace the human, we're gonna augment the human. But we have to be careful in defining the anatomy of the augmentation, because if I'm getting assistance from whatever, something, and I have to check its output, that's a cognitive burden on me. And if we don't set up that loop properly, we might increase the burden of whoever it is that we're trying to support, The physician, the nurse, the pharmacist, the physiotherapist. So augmenting humans is a great sort of, you know, soundbite, but how you augment them is important. And the analogy I would use there is our phones try to augment us, and they try to augment us by notifying us. And every app on our phone has this god-given right to ding when it pleases. That's not augmentation, that's distraction.
- And so when we get to the augmentation, we gotta, like, exactly how we're doing it, how we divide the work. And then we have to pay attention to whatever is it we're doing. Is it leading to an efficiency gain, or is it actually gonna lead to a productivity gain? And that is crucial. 'Cause often, we see, oh, we'll make the doctor's life easier. Things that used to take 40 minutes are now gonna take 10, Or you'll be able to read this slide that it took a pathologist 20 minutes, and now you'll be done in 14, and it's great, you saved six minutes. How are you going to turn those saved six minutes into seeing more slides or serving more patients writ large? And we often confuse gain in efficiency, which is necessary 'cause we've created quite inefficient system asking doctors to chart at 10:00 PM, and let's say there is somebody who's working between 8:00 to 10:00, and now there's this magical AI, and the 8:00 to 10:00 work is gone. Granted, great thing for the physician and for the health system, no benefit to the patient. We're going to see the exact same number of patients. It's an efficiency gain and not a productivity gain. And so we have to be really careful in how we prioritize design and evaluate AI augmentation of whoever human in the workflow.
- I think that's so nicely stated, is thinking about the ultimate implication for the clinician and for the patient for the use of these technologies. And then how does the system, is the system designed, then, for us to see more patients, for me to spend more of my time with that patient discussing something else which might be the highest and best use of my time and a better outcomes for patients. So there are a lot of ways in which technology that's disruptive like this will be, we have to think about its impact on so many different levels. So I think that's so important. So you've talked about productivity, and one of the ways in which I imagine productivity is enhanced is by automating some of the tasks. But I know there's also challenges when we think about automation. Describe to me a little bit more, go deeper on the productivity with AI.
- So that's a great one. I mean, even if we've figured out efficiency versus productivity, we have to double click on productivity. And I'll start with an analogy that's very well known in public health. There's this story about a person fishing and sees someone drowning, so jumps in and saves the drowning person. And sees another person, jumps in again and again and again. And then they're so exhausted. It's like, what is happening? And they walk upstream and there's a broken bridge from which people are falling. So now, when we automate a task, we have to ask, is the task being automated the root cause of the problems? So in that public health analogy, we could automate by creating a robot to go save drowning people, and that would be a fine automation, but a completely useless one. And so when we come to administrative burden in healthcare, we have to ask, are we automating something that is sensible, such as writing a discharge summary or writing an end of shift summary? Are we automating something that should not have existed in the first place, and it's a misguided policy leading to that work burden, and then if we automate it, we're just gonna do the bad thing faster? And so there's this automation trap that is a little bit separate from the efficiency productivity, and in our pursuit to enhance productivity, we might end ourselves, find ourselves like developing the equivalent of a robot to save drowning people instead of fixing the bridge.
- Wonderful. That's such a great analogy. Resonates for me 'cause I think, in a lot in the public health space, and it is a reminder, again, why we need clinicians, those who've been thinking about healthcare to try to understand the insights. You can only gain the insights that it's the bridge that's broken if you really understand how the whole system is working and how it can work best for patients. So I think that really speaks well to your call for all of us to be more involved in understanding what it means for our delivery of healthcare and the care of patients and the role of clinicians. So, thank you so much for pulling it all together for us, Nigam, really great to have you here and to have this conversation.
- Absolutely, wonderful.
- And to our audience, thank you for watching and listening, as well as giving us feedback on this series. We also welcome submissions in response to JAMA's AI and medicine call for papers. Until next time, stay informed, and stay inspired. We hope you'll join us for future episodes of the AI and Clinical Practice series. For more videos and podcasts, subscribe to the JAMA Network YouTube channel and follow JAMA Network Podcasts available wherever you get your podcasts.
You currently have no searches saved.
You currently have no courses saved.