Small Data Sets
A major problem with
the development of a predictive model is that the data set that
is available for analysis is relatively small. When dichotomous
outcomes are studied, the number of events is usually the determining
factor for the size of the data set.
Which data set is
the smallest for logistic regression analysis?
The 1:10 rule is
quite superficial and has been refined for logistic regression
modeling. It was found that the 1:10 rule is actually a minimum
for reliable modeling of fully pre-specified predictors (Steyerberg
et al., 2000a). When the 1:10 rule is violated, the
number of candidate predictors is in fact too large for the
data set, and overfitting will occur.
Another limit is
the 1:20 rule; when this rule is violated, shrinkage
of regression coefficients is required to obtain well-calibrated
predictions. When this rule is satisfied, shrinkage is not necessary
for pre-specified models.
A final limit is
the 1:50 rule; when this rule is satisfied, stepwise
selection with the default p-value of 5% can safely be applied,
since the power for selection of true predictors is large, and
the biases caused by stepwise selection are limited.
The 1:10, 1:20, and
1:50 criteria are shown schematically below. It is assumed that
the number of candidate predictors has already been limited
as far as possible. Elimination may for example be based on
external information, a narrow distribution of a covariable,
or a large number of missing values.
selection with p<0.05
of regression coefficients
- 1:20 events||Dangerous||Advisable
- 1:50 events||Not
Note that these rules
are not more than rules of thumb, based on limited research.
It may, for example, be possible to develop a reliable model
by combining individual patient data from a small study with
data on a group level from larger series in the literature (Steyerberg
et al., 2000b).