Small Data Sets
A major problem with
the development of a predictive model is that the data set that
is available for analysis is relatively small. When dichotomous
outcomes are studied, the number of events is usually the determining
factor for the size of the data set.
QUESTION
8.1
Which data set is
the smallest for logistic regression analysis?
The 1:10 rule is
quite superficial and has been refined for logistic regression
modeling. It was found that the 1:10 rule is actually a minimum
for reliable modeling of fully prespecified predictors (Steyerberg
et al., 2000a). When the 1:10 rule is violated, the
number of candidate predictors is in fact too large for the
data set, and overfitting will occur.
Another limit is
the 1:20 rule; when this rule is violated, shrinkage
of regression coefficients is required to obtain wellcalibrated
predictions. When this rule is satisfied, shrinkage is not necessary
for prespecified models.
A final limit is
the 1:50 rule; when this rule is satisfied, stepwise
selection with the default pvalue of 5% can safely be applied,
since the power for selection of true predictors is large, and
the biases caused by stepwise selection are limited.
The 1:10, 1:20, and
1:50 criteria are shown schematically below. It is assumed that
the number of candidate predictors has already been limited
as far as possible. Elimination may for example be based on
external information, a narrow distribution of a covariable,
or a large number of missing values.
Candidate
covariables
 Stepwise
selection with p<0.05
 Shrinkage
of regression coefficients


>1:10
events  Very
dangerous  Necessary 
1:10
 1:20 events  Dangerous  Advisable 
1:20
 1:50 events  Not
advisable  Not
necessary 
<1:50
events  No
problem  Not
necessary 

Note that these rules
are not more than rules of thumb, based on limited research.
It may, for example, be possible to develop a reliable model
by combining individual patient data from a small study with
data on a group level from larger series in the literature (Steyerberg
et al., 2000b).