Skip to Content
Interactive Textbook on Clinical Symptom Research Logo

Home Button

Statistical Models for Prognostication
Author Bio
Predictions: Statistical Models
Insight: Statistical Models
Ingredients: Statistical Models
Theoretical Aspects
Central Concepts
Regression Models
Currently selected section: Problems: Regression
Practical Advice
Example 1
Example 2
Chapter 8: Statistical Models for Prognostication: Problems with Regression Models

Small Data Sets

A major problem with the development of a predictive model is that the data set that is available for analysis is relatively small. When dichotomous outcomes are studied, the number of events is usually the determining factor for the size of the data set.


Which data set is the smallest for logistic regression analysis?

Selection A250 patients, 25 events
Selection B500 patients, 25 events
Selection C1000 patients, 40 events
Selection D2000 patients, 20 events

Some guidelines may be given regarding the number of candidate predictors that can reliably be studied in relation to the size of the data set. A well-known rule of thumb is the 1 in 10 rule (Harrell et al., 1984) (Harrell et al., 1996) (Peduzzi et al., 1996) (Laupacis et al., 1997).

  • For linear models, this means that 1 candidate predictor can be studied for every 10 patients.
  • For logistic or Cox models, approximately 1 candidate predictor can be studied for every 10 events. For example, when 20 patients die in a study of 2000 patients, 2 candidate predictors can reliably be studied.

The 1:10 rule is quite superficial and has been refined for logistic regression modeling. It was found that the 1:10 rule is actually a minimum for reliable modeling of fully pre-specified predictors (Steyerberg et al., 2000a). When the 1:10 rule is violated, the number of candidate predictors is in fact too large for the data set, and overfitting will occur.

Another limit is the 1:20 rule; when this rule is violated, shrinkage of regression coefficients is required to obtain well-calibrated predictions. When this rule is satisfied, shrinkage is not necessary for pre-specified models.

A final limit is the 1:50 rule; when this rule is satisfied, stepwise selection with the default p-value of 5% can safely be applied, since the power for selection of true predictors is large, and the biases caused by stepwise selection are limited.

The 1:10, 1:20, and 1:50 criteria are shown schematically below. It is assumed that the number of candidate predictors has already been limited as far as possible. Elimination may for example be based on external information, a narrow distribution of a covariable, or a large number of missing values.

Candidate covariables Stepwise selection with p<0.05 Shrinkage of regression coefficients
>1:10 eventsVery dangerous Necessary
1:10 - 1:20 eventsDangerousAdvisable
1:20 - 1:50 eventsNot advisableNot necessary
<1:50 eventsNo problemNot necessary

Note that these rules are not more than rules of thumb, based on limited research. It may, for example, be possible to develop a reliable model by combining individual patient data from a small study with data on a group level from larger series in the literature (Steyerberg et al., 2000b).


Previous Page

Get Adobe Reader