[Skip to Content]
[Skip to Content Landing]

AI and Clinical Practice—AI Gaslighting, AI Hallucinations, and GenAI Potential

In this Q&A, JAMA Editor in Chief Kirsten Bibbins-Domingo, PhD, MD, MAS, interviews Michael Howell, MD, MPH, a pulmonologist and chief clinical officer at Google, to discuss the evolution of AI and what we should expect next for AI and health care.

JN Learning™ is the home for CME and MOC from the JAMA Network. Search by specialty or US state and earn AMA PRA Category 1 Credit(s)™ from articles, audio, Clinical Challenges and more. Learn more about CME/MOC


[This transcript is auto-generated and unedited.]

- AI gaslighting, AI hallucinations, phrases that we were unfamiliar with just a few months ago. How do we understand the evolution of AI and recognize what AI means for healthcare? I'm Dr. Kirsten Bibbins-Domingo, and I'm the editor-in-chief of JAMA and the JAMA Network. This conversation is part of a series of videos and podcasts hosted by JAMA in which we explore the issues surrounding the rapidly evolving intersection of artificial intelligence and medicine. Today I'm speaking with Dr. Michael Howell, a pulmonologist and chief clinical officer at Google Health. Before joining Google Health, Dr. Howell spent many years as a professor at Harvard University and at the University of Chicago. He also served as the University of Chicago Medicine's Chief Quality Officer. He's an active investigator with more than a hundred research articles, editorials and book chapters, and a book entitled Understanding Healthcare Delivery Science. Dr. Howell, thank you for joining me here today.

- Thanks for having me.

- I hope we can do this on a first name basis, if you don't mind. Is that okay?

- That that would be terrific.

- Okay, terrific. Thank you, Mike. So let's start. You are the Chief Clinical Officer at Google Health and as we were preparing people said to me Google has a Chief clinical officer? So tell us, tell us what that position is and how you got there.

- So Google has a health team led by Karen DeSalvo and the health team has a few teams in it. We have a health equity team and a global employee health team and a team that focuses on regulatory. And my team is the clinical team, which is a team of doctors and nurses and psychologists and health economists. And when Google is making products that have a lot of impact on health, we try to work, you know shoulder to shoulder and elbow to elbow with the engineers and the product managers and the researchers to make sure that it's not Silicon Valley and it's not healthcare. It's a blended voice. A third way between.

- While a lot of the conversations we've been having is about how to bring those who are working in healthcare more closely with the people who are actually innovating in this space of artificial intelligence because we need both to work together. And it sounds like that was what's attractive about the position you're in right now?

- It is, it was a chance to in the middle of my career to have my learning curve get really steep for a while. It was a chance to get to try to be part of the team that was making the things that we're seeing come to fruition now in healthcare, artificial intelligence is going to change a lot of things and it should happen with clinicians, not to clinicians. And so, you know, I'm not part of my job is to be that voice inside the company. It's been a very easy thing to talk through at the company.

- Artificial intelligence has been around for decades. But we are in a time when things have clearly had a large leap forward and that we are now moving since that large leap at a pace that is really not what we've seen before. Give us an example now of what are the applications, describe for those of us who are in practice who wanna understand what, what's around the corner what can we be expecting for these models to do soon and not to do, and where are the danger zones? So what's the application then?

- Yeah, the easiest thing to get a sense of why these things are different is to go play with one of the chatbots. And they have some capabilities that haven't existed before. They can seem like they understand very complex questions. They can remember things over many turns of dialogue. They can give responses that sound like they're from people. They can adjust things. So you can say write this like you're writing a dissertation for a PhD, now write it like you're writing it for a sixth grader. They're very good at that. And other models can deal with not just language, but also photos, videos. We have one that makes music. So they're beginning to be able to have a representation of the world, not just from text but with other things like that. And those, if you're thinking about making something, understanding the capabilities are really important. And those are totally different than what came before. I read a bunch of old papers and I've thought sometimes like what it must have been like to be in practice when penicillin showed up. You're like, okay, that's different. I don't know all the things. I may overuse it a little bit like, but it's a marked moment. So in December of 2022, we put out a pre-print on a model called Med Palm. And then we had Med Palm two come out in a preprint in May. So there's just five months apart. And what the team did was they took a foundation model called Palm, and then in the second one Palm 2 and those things, they've gone and read everything that they can get their hands on and they learn a representation of language. Then they did some prompt tuning and fine tuning to get them to pay more attention to to answer questions well in the medical domain. And then they did two experiments that I think are good that they teach us a lot about what's likely to come. The first is that there, there are sets of things that are open sourced and they're roughly like questions that we might take on a medical licensing exam. And they looked at ones from the US and one's one from India, in the first paper. And they're not exactly a medical licensing exam but they're pretty close. Anybody who's taken a bunch of these who recognize them like that. The nice thing about that is that data set has been benchmarked over a number of years. People have been working on it making progress a few percentage points at a time. People say that roughly the equivalent of passing the exam is about 60%. And until November of 2022, the best in the world was 50%. The Med Palm 2 paper was 67%. And then just in May, a few just single digit months later was, you know 86, 87% correct. Roughly the equivalent of a top quartile of a physician test taker. So that's remarkably fast. And I think you and I would both agree if someone passed the USMLE, we wouldn't just let them out to practice. Right? There's a lot that goes between there. So the really interesting part of this paper was that they took questions that real people ask. So people come to Google and ask us questions all the time. They open source these questions and they're things like is there a cure for incontinence? If I have a rosacea, what's the best diet? Like things that people really come and ask. And in December they gave that question to the model and said, write a long form answer like a couple of paragraphs. And the model wrote an answer. And then they hired physicians and said, physician write an answer like you are answering a patient. And then they took those and they gave them to another physician blinded and said grade this on a number of dimensions. So evaluating is tricky, but grade these on is this consistent with scientific consensus? If the patient followed it, are they likely to be harmed? If they were harmed, how bad? Is there evidence of of demographic or racial bias in the answer? And in December, physician blinded physicians preferred other physicians' answers on most dimensions by a little bit. In May on eight of nine dimensions. Other physicians prefer the answer from the model over physician's answers by a lot on eight of nine dimensions. And so that is an example of how fast things are moving. And it's something that would've been impossible two years ago for sure, a year ago for sure. But now we're seeing really fast.

- So, there are a few things that I was really struck by. So one is, you're describing Med Palm as being a really good test taking doctor. And a doctor that's really good at answering patient questions. What patients really wanna know almost the perfect doctor right. Can get the test questions right and can respond to the so of course I've also heard you say, you know you don't think AI is going to replace doctors. So how would we think about using this remarkable thing? Is this just, you know, something cool to think about? Like what's the real world application of these things if you don't think the goal is to replace the doctor?

- Yeah. So, you know, we're working with a bunch of partners who are helping to figure that out. And I think a any of us who've practiced for any length of time know that there are some things that are likely to be early targets for improvement here. There are ample studies that show that nurses and physicians and respiratory therapists and everyone spends a huge amount of time documenting and then being unable to find the information that is the most critical information later. So I think that we're likely to see from a number of companies, we're already seeing this in a number of areas. We're likely to see a lot of work on assisting people in tasks that take them away from the bedside and away from the cognitive or procedural or emotional work of being a clinician. I think that's gonna be number one. People talk about things like prior auth as an example. I think that we're likely to see tools that over time that help support clinicians in avoiding things like diagnostic anchoring or diagnostic delay. So any of us who practice for any length of time, we have had a nurse like tap you on the shoulder and go Hey doc, did you mean to do that? Hey doc, did you think about this? I've been saved right?

- Oh yes, we all have by the ICU nurse tapping you on the shoulder.

- Anybody primary care clinic do you really mean to send this person to? I mean all, all the places. And so AI has this chance to be vigilant, to not get tired, to be able to look for things that might be buried in the record that you may not have seen. And so I do think that over time we'll see it as an assistive tool. You know, my mom was an accountant and I worked for her for a couple of summers doing bookkeeping. And I am old enough that the time that we did bookkeeping this is for her, you know little small business clients that you had this big sheet of paper called a ledger and you wrote numbers down and you added things up on an adding machine. And then like somebody invented, you know Lotus 1, 2, 3 and eventually like QuickBooks or whatever all the accounting software is and the work of accountants changed but we didn't have fewer accountants. We also saw something I think we're likely to see with AI, which is we saw that the ability to keep high quality books became democratized. You didn't need to have an accountant, you could do that for yourself. And so I think that, and then accountants could make sure that things were on the right track. So I think we're likely to see things like that in healthcare but it's a new kind of technology and it's gonna be it's gonna be interesting to figure it out.

- Wonderful. So you're so good at explaining some of these words that are and concepts that are a little bit mystifying and you've used some that I'd love you to explain a little bit more. So we've heard about this concept called AI gaslighting, where the AI has learned to do things very well to a high degree of accuracy. And then all of a sudden is is giving you exactly the wrong answer, right? So explain how that comes about, how we guard against it and then we'll tackle hallucinating next.

- Yeah. There are a couple of related things here that are that are a little tricky to disentangle. So the models are predicting the next word. That's what they're doing at their core, and they they're hopping around that embedding space of like, oh usually people go here next, this looks like a math problem. This looks like you should give a medical citation. There are, if we step back for a second and talk about of the stages of these models, there's the foundation model stage where you have the model read everything to get its hands on. It learns a representation of the world. There's a stage of sometimes used, which is fine tuning with other data, and that can up weight some of the parameters in the model in something you care about. And then there's a set of prompt tuning, hey, model act like you're a TV host, hey, model act like you're, you know, teaching a sixth grade course and they will behave differently. And then there's a really important concept called reinforcement learning with human feedback. And that concept is the model gives an answer and somebody in the end says, good or bad, and maybe why? And the model can learn and the model can take that information and move it back through the chain. So it says, oh, pay more attention to this neuron and less attention to that neuron over time. And so there are, if you get reinforcement learning with human feedback wrong, then models can change over time. And if you, when you update anything in the model sometimes and you get better in one area, sometimes it'll get worse in others. Not that different than, you know like the longer I was into working in the ICU, the worse of a primary care doc I would've been.

- Oh, interesting. That's a great analogy. What about the hallucinations? I'm sure it, it goes along the similar lines but it's the one thing that I think those who've played around with any of the chatbots. You sort of know it's the thing we worry about as publishers now that people over rely on these wonderful tools and don't realize that they have to really just make sure that they're giving us citations that really exist and things like that.

- Yeah. I'll add that in in any domain, but in healthcare in particular, there's a concept called automation bias of people trust the thing that comes outta the machine. And this is a really important patient safety issue. Like with EHRs, they reduced many kinds of medical errors like no one dies of handwriting anymore, right? Which they used to do with some regularity but they increased the likelihood of other kinds of errors. And so the automation bias a really important thing. And when the model is responding and sounds like a person might sound that it's an even bigger risk. So hallucinations are really important and what they are is the model is just predicting the next word. And if there's one thing for people who are watching this to remember that the model it doesn't go look things up in PubMed. Yeah, it doesn't go ask a calculator now I'm gonna come back to that in a second. It just remember stuff out of that embedding space or the concept space. And so it'll be reading along it'll be predicting next word, doing a good job and then it'll say, oh, this looks like it should be a medical journal citation. That's the kind of thing that comes next. Here are words that are plausible for a medical journal citation and then it that will look just like a medical journal citation. It remains a big problem. It was a big problem in the earlier versions of them. There are a few ways from a technical standpoint that this is getting better but it remains an important issue. One example, it turns out that these things are bad at math. They're good at two plus two equals four because there's like a lot of that on the internet. But if you give it, you know, whatever 13,127 plus 18,123, they say, oh that looks like it should be a five digit number. Let me get a plausible five digit number. They don't ask. So what folks are doing to mitigate that is to say, oh this looks like a math problem. Ask a calculator and the calculator will get the answer. And then to put that in. Or this looks like you should do a journal citation. Go look it up in the source of record and then report back. And so that's one area. And for folks who want to look at more research in this the evolving areas are called grounding, consistency and attribution. And grounding will be things like, I wrote a paragraph let's take this sentence and let's say the model thinks it came from this journal article. You have a second model that says, can I find evidence that that the ideas in that sentence, not the words themselves but the ideas are actually reflected in this journal article. And then with high probability you can say, yeah this is where it came from. Lemme cite my sources, or no, I can't find evidence of that. And so we'll see continuing evolution in that. It's getting better but it remains a fundamental issue with these models.

- Wow, how interesting it's gonna be changing how we think about these tools in the publishing space as well. And what authors use or don't use correctly how they're using these tools right now. I wanna take you back to how you started or at least before you came to Google Health as a person in charge of quality and safety in a hospital setting. Fundamental to that is thinking about what we do in those settings to protect patients. We wanna, we want to take all of the benefits from new technologies, but we wanna at the core make sure that patients are really protected and get the best, but we mitigate all of the worst possible things that can happen. And one of the things both in the application of these technologies as well as in the development of these technologies is actually how we think about patient privacy, patient data, the data from patients that these machines are being trained on. How do you think about those, given that you've spent so much time thinking about patients and quality and safety, how do you think about that in an environment where these technologies are just moving at lightning speed, what we do to protect patients?

- Yeah, so privacy in the US with this great foundation with HIPAA in other parts of the world great foundation with things like GDPR. And so as health systems or clinicians are looking at tools like this it's really important to make sure that whatever system they're using has the ability to isolate all of these things. So it's fairly straightforward. How do you isolate in your cloud bucket the data that's related to a patient? These models create a new kind of risk which is that if you have a big model and you do some training on private data, that the model can leak the weights across customers. And so because if the model learns it may memorize something that's over in one area and then leak that out in a response later. And so it's very important for health systems to work with partners who have the technical ability to isolate that kind of learning. And you need technical ways to isolate that so that it stays within the boundaries of a health system. So on our end, we've done a lot of work on how do you have a HIPAA compliant stack? How do you sign business associates agreements? All of those things. But it's important for people to realize the risk of if you just go type PHI into a chat bot, you don't have those guarantees. You need to be using a hardened set of infrastructure. It sounds nuanced, but it's really important. We get asked a lot why we invest in publishing in journals like JAMA and Science and Nature and it's because we think many of these are first in human history problems. We think it's important to show our work and to get the math right and we think peer review helps with getting the math right. If we step back for a moment to that embedding space, we say basically like try to go read the internet. Well, the internet is, there are many dark corners of the internet. And even if you don't, even if you filter out the dark corners of the internet, our language is still full of bias. So they learn the bias in the language because they learn the language. The second thing, we have many products that more than a billion people use every month. And so that tells you by definition that the majority of our users are not in the United States just by math. How do we protect the privacy and safety of patients while we try to together figure out the chance to improve health on the planetary scale.

- Thank you so much for joining me today, Mike. I've really enjoyed this conversation. I hope you'll come back and tell us more. It sounds like you're moving at lightning speed and so it'll probably be like in three weeks from now we'll have something else to talk about.

- Thanks for having me.

- Thank you to our audience for watching and listening. We welcome comments on this series. We also welcome submissions in response to JAMA's AI and medicine call for papers. Until next time, stay informed and stay inspired. We hope you'll join us for future episodes of the AI in Clinical Practice series where we will continue to discuss the opportunities and challenges posed by AI.


Name Your Search

Save Search

Lookup An Activity


My Saved Searches

You currently have no searches saved.


My Saved Courses

You currently have no courses saved.