Gordon Guyatt, MD, MSc, discusses Discrimination and Calibration of Clinical Prediction Models with Ana Carolina Alba, MD, PhD.
Subscribe on your favorite podcast source to receive future episodes of this podcast.
Learn more about this JAMA Network podcast here
JN Learning™ is the home for CME and MOC from the JAMA Network. Search by specialty or US state and earn AMA PRA Category 1 Credit(s)™ from articles, audio, Clinical Challenges and more. Learn more about CME/MOC
Dr. Gordon Guyatt (00:02):
Hello, I would like to welcome you to the JAMA Evidence Users Guide to the Medical Literature podcast. I'm your host, Gordon Guyatt. I am a Distinguished Professor of Health Research Methods at McMaster University in Hamilton, Canada, and today I'm being joined by Dr. Carolina Alba. She's an Associate Professor in the Heart Failure and Transplant Program at the Ted Rogers Center of Health Research at the Toronto General Hospital, part of the University Health Network in Toronto. Dr. Alba was the lead author who led us in looking at prognosis and outcomes in the user's guide, and that's what we're going to be talking about today. Dr. Alba, welcome.
Dr. Carolina Alba (00:54):
Thank you Gordon and JAMA for providing us the opportunity to share results and explanation of this study.
So let's jump right in now. What's this article about?
So this users guide help clinicians to understand the available metrics for assessing model related performance in terms of discrimination, calibration and their performance of different prediction models. Informing patients about their prognosis is part of a daily dialogue between physicians and patients. However, assessing prognosis is very complex because multiple factors interpret in their risk of future events, but likely for physicians and patients predicted models can assist them in this activity. There is a wide priority of models that can be applied to estimate within different diseases and the accuracy of these models is very diverse. So the ultimate goal of this guide is to help clinicians to make optimal use of existing predicted models.
So you're going to tell us more about how to make the best use of predictive models, but why should clinicians bother in the first place to use models? As a matter of fact, why should they be so interested in prognosis?
So accurate prognostic information is vitally important for patients and physicians to make optimal health related and life decisions. For example, if a patient needs a very low risk of events, the absolute benefit by an effective benefit may be very small in relation to the potential harm burden or cost of that medication among higher risk patients. The opposite could be true. For example, the same treatment may offer substantial benefit. So providing accurate prognostic information can help patients and physicians in making this shared decision making and preventing testing or using costly and risky therapy in low risk patients and avoiding delays in treatment or the use of effective therapies in patients who are at very high risk of events and would benefit from applying an effective therapy.
Okay. You've done a great job of telling us why clinicians should be interested in prognosis and in the introduction you told us that we're going to be focusing on prognostic models, but prognostic models aren't the only ways to estimate a patient's prognosis. Can you take us through what the options are as to how clinicians can go about estimating a patient's prognosis?
So there are very different ways to estimate prognosis and they are associated with advantages and disadvantages. One may be to use just the physician judgment or patient intuitive estimate of their own risk. These test proven to be very limited. For example, from the physician point of view, in general we tend to overestimate risk substantially. So a second way could be to take the estimate or the average risk from observational status. For example, they may exist a registry describing prognosis or the risk events in patients with the specific diseases. We may take the risk from these registries, however, this registry many times fail to report risk across different patient characteristics.
So a physician could apply the fact of different factors are known to be associated with a risk of its events to these average risk to try to estimate how some patient characteristics, for example, age may change this average risk. But a better way is to use predicted models that combine the effect of these multiple risk factors into a single estimate. And by applying a mathematical formula behind the scene, these risk models can provide the absolute risk of future events for a patient with a combination of different predicted factors. So we call these mathematical equations, prediction or prognostic rules, guides or models.
You've told us that physician intuition is one way to estimate prognosis, but physician intuition is often misguided. You've also told us that there are observational studies, but observational studies have considerable limitations and you've implied, and I think you are right, that the best way for clinicians to estimate prognosis is to have a model available. But some models may be preferable, some may be very satisfactory and some may be unsatisfactory. How can clinicians differentiate a good model from a model that they should stay away from?
The idea model correctly identify every single patient who is going to have an event from every single patient who is not going to have an event without misclassifying any patients. In this exercise, however, this model does not exist unfortunately. So to the extent to which a model comes close to achieving, this goal can be characterized by two main properties. One is discrimination and the other one is calibration. So discrimination refers to how well the mold differentiates high from low risk patients. So discrimination depends on the distribution of patients characteristics. So it has two main limitations. One is the model can differentiate very well in a very heterogeneous population with widely different values of the predictors included in the model. For example, if the model relies only on age, the model will differentiate patients very well. If the age range of that population is very wide, for example, from 20 to 90 years of age.
However, the model will not be able to perform very well if the age range is very narrow, for example, only including patients between 50 and 60 years. And the other problem with discrimination is that a model would have very good discrimination and tell us that a patient is at higher risk of having an event in comparison to another patient who may be at lower risk. However, does not tell us anything about the absolute risk.
So a model could predict that the risk of a patient is 1% versus the other one who is at higher risk is 2% and that can show very good discrimination. But in fact when we follow the patients for some period of time, we observed that the true risk was 10 and 20%. So the more absolute risk prediction was very poor, very limited, and it does not help us to make decisions. So this brings us to the second most important characteristic, which is calibration. Calibrations tell us how similar depicted risk divide from the model is to the true risk in a group of patients classified risk strata to extent that the estimates are accurate, we say that the model is well calibrated.
You have identified the two key characteristics, discrimination, which tells you whether my risk is greater than your risk and calibration, which tells you if the model not only says my risk is greater than yours, but my risk is 2% and yours is 1%, if those are really the right numbers or if it's 20% and 10% where mine would be still twice as much as yours, good discrimination. But the calibration would be very off because the risk would actually be tenfold higher than the model tells you. So we've identified those two key characteristics, discrimination and calibration. Now the clinician is looking at the model. What tests will they find to assess discrimination and how should they use the tests of discrimination in deciding whether or not to use the model and how to use it.
Discrimination can be assessed in different ways for a binary outcome. The commonly reported metrics are the receiving operated characteristic curve or seed statistic for survival analysis. For example, this metric will tell us that after taking all possible peers of patients and comparing the predicted probabilities across all the different predictions, if the model cannot discriminate at all between patients who have events from those ones who did not have event, the seed statistic or ROC will be closer to 0.5, which means that the model is not better than chance if the model always produces a higher priority for patients having events in comparison to those ones not having events, the C statistic will be one and that means perfect discrimination. Usually the C statistic will fall between 0.5 and one. It is very rarely, almost one. So there is a generally accepted approach suggesting that if the receiver operating characteristic curve or ROC curve, the C statistic is less than 0.6, the more reflects very poor discrimination and may not add to the prediction and it may not have any clinical activity.
However, if the C statistic is between 0.6 and 0.75, there is possible the more may help to guide care if use in clinical practice or may help as well to inform patients about their prognosis. If the C is higher than 0.75, there is a general understanding that the moon has good discrimination and can provide useful clinical information. Such thresholds are just arbitrary and the concept of C is very hard to apply clinically because it doesn't take into account the consequences of the misclassification.
For example, more may misclassify patients who have events are lower risk or may be more frequently misclassifying patients who did not have an event as having higher risk. So the consequences of misclassifying patients with event or non-event will be different and the C does not take that into account and that should be something that should be taken into account and there are different metrics to do that but they are probably still developing and still hard to apply in the clinical setting. But those are general rules that if institution can follow to help you understand where a model provides good discrimination.
So you've told us how to identify good discrimination, this C statistic or area under the ROC curve and if we're over 0.75, we're in good shape except we may not be in such good shape because you've also told us that the discrimination can be good but the calibration may not. So how can physicians identify well calibrated models?
So calibration is the most important property of them all and usually unfortunately is under reported. So assessing ion can be done in two different ways. One is for the whole population, which we refer to that as average or mean calibration. And the other one that is probably the most accurate and important one is two report calibration at different risk strata. So an extended amount could have very good average ion but could be miscalibrated in the extreme for example, in low or high risk patients. And if we don't report calibration across different risk strata, we will miss that important information.
Television could be excellent for some patients, for example could be very good for patients who are at low risk of an event, for example, lower than 10%, but the more could significant overestimate risk in those ones who are at higher risk than 10%, we may want to know that information depending on the risk threshold that we are going to use to make a clinical decision. So informing television across different waste strata is the most important property of them all and should always be reported and look for when we are assessing the performance of a model.
Okay, great. Now you've shown us how to identify a model that we might consider using and the discrimination looks pretty good area under the ROC curve of .75 and the calibration looks pretty good, but now we have two models that look pretty good. How are we going to choose between the two of them?
Oh, that's a very difficult question to answer, but in general terms, a physician could qualitatively compare discrimination and calibration of the two models that are being compared. So if the model shows better discrimination and better calibration that the comparison, it's an easy pick. We know that the new model or the second model is better than the one that is showing worse discrimination and ion. However, sometimes these two metrics are very close or one is showing better discrimination and worse calibration and vice versa, which makes it difficult for physicians to choose. There are many different statistical techniques that can be used to compare the performance of tumors.
One is called with classification analysis. Results from risk classification analysis can be summarized in different metrics. Some of them even consider the weight of misclassifying patients with and without events and makes the selection between two models much easier based on the possible consequences of one versus the other misclassification. So there are different metrics that can be used and depending on the discrimination calibration of our model, we may report one or the other or one or the other could provide more useful information when applying the model to a clinical city.
So let me see if I've understood your key points. Number one, when we are considering whether to use a predictive model, the first thing is it worth the effort? Is the patient's prognosis something we need to know, the patient's interested in it, we need the information to best manage. If we're in that situation, we may do it by our intuition, but we're liable to be misguided. We may pick a single observational study, but that probably is not the best way. If there is a good model with adequate discrimination and calibration, that's probably where we should go if I got it right.
Great summary, Gordon. I think that one point to highlight is that if there is a model that has good accuracy, the most important metric to use to decide whether or not to use it is calibration.
Excellent additional point to make. Thank you very much for joining us. Very crucial aspect of medical care, the assessment of prognosis, which we may perhaps don't pay as much attention as we should. And there are a lot of models now coming up to help us to focus on prediction and you've shown us how to make the best use of them. So thanks very much for joining us.
Thank you, Gordon. And thank you to JAMA for the opportunity.
This episode was produced by Shelly Steffens at the JAMA Network. The audio team here also includes Jesse McQuarters, Daniel Morrow, Lisa Hardin, Audrey Foreman, Mary Lynn Ferkaluk. Dr. Robert Golub is the JAMA Executive Deputy Editor. To follow this and other JAMA Network podcast, please visit us online, JAMA Network Audio. Thanks very much for listening.
You currently have no searches saved.
You currently have no courses saved.