A simple
measure for calibration is the slope of the regression line
for observations against predictions. When predictions are too
extreme, the slope will be less than one. We may refer to the
slope as "calibration slope." The calibration slope is easily
visualized in the plots of predicted against observed values
(see graph: Illustration
of calibration above).
This form
of mis-calibration can statistically be tested by comparison
with a model with intercept = 0 and slope = 1 (Miller
et al., 1991) (Harrell
et al., 1996).
The calibration
plot may also provide useful information on the distribution
of predicted values, e.g. by plotting symbols for groups of
patients. The spread of predictions brings us to the other aspect
of model performance: discrimination.
Discrimination
refers to the ability to distinguish high risk subjects from
low risk subjects, and is commonly quantified by a measure of
concordance, the c statistic. For binary outcomes, c is identical
to the area under the receiver operating characteristic (ROC)
curve (Hanley
and McNeil, 1982). C varies between 0.5 and 1.0 for
sensible models; the higher the better. For an example of a
ROC curve see the graph, "Illustration of 2 ROC curve"
below (Steyerberg
et al., 2001).
The c statistic
is calculated as the fraction of patients with the outcome among
pairs of patients where one has the outcome and one not, the
patient with the highest prediction being classified as the
one with the outcome. Hence, when a model provides no information,
c=0.5. The c statistic has been generalized for survival analysis.
Figure
7.2: Calibration
|
|---|
|
Illustration
of 2 ROC curves. The curves refer to a development
population of 544 patients, where the area under the curve was
0.83, and a validation population of 172 patients, where the area under the
curve was 0.80. The underlying model predicts a benign histology in a
residual mass after chemotherapy for metastatic testicular cancer. More
details can be found elsewhere (Steyerberg
et al., 2001). |
|
The c statistic
is related in some aspects to R2.
Both approach 1 for perfectly discriminating models. An important
difference is that c is not dependent on the frequency of the
outcome, while R2 is smaller
when the outcome is infrequent, and larger when the outcome
is more frequent (Ash
and Shwartz, 1999).