The coding
of both categorical and continuous covariables merits attention
in the development of a predictive model.
Categorical
covariables
may have just 2 values (i.e. binary),
e.g. gender (male/female), or presence of risk factors (yes/no),
in which case the coding is easy. Categorical covariables may
also have more values, either with or without ordering (nominal/ordinal).
Categorical covariables may be coded as factor variables, which
implies that a reference category is chosen and that other values
are contrasted against this category by "dummy" variables.
Dummy
variables indicate whether a value is in a certain category
or not. For example, a three-category variable, such as smoking
(current, ex-smoker, non-smoker), might be coded with two dummy
variables indicating whether the patient is a current smoker
and another indicating whether the patient is an ex-smoker,
with non-smokers as the reference category. Technically the
coding can be represented as follows:
| | Current
| Exsmk
|
|---|
Non-smoker
| 0
| 0
|
|---|
Ex-smoker
| 0
| 1
|
|---|
Current
| 1
| 0
|
For the
analysis, we might also combine the ex-smokers and current smokers
in a single category. This collapsing might be based on findings
in other studies or the observed frequency distribution (e.g.
few ex-smokers). This does not lead to bias in the estimated
coefficient.
The decision
to collapse may also be based on the observed regression coefficients
for "current" and "exsmk" (either in univariable or multivariable
analysis). The resulting regression coefficient will then no
longer be unbiased, since the decision to collapse is not independently
taken. (For more about the philosophy of such aspects, see Problems
with Regression Models: Model
Uncertainty).
Continuous
covariables
are frequently coded as linear terms in a regression model.
This linearity may be tested as described before (see: Theoretical
Aspects of Predictive Modeling: Linearity
Assumption). Also, one might beforehand specify a
certain transformation based on biological reasoning or prior
knowledge.
Another
possibility would be to include flexible functions with a pre-specified
number for the degrees of freedom, irrespective of the statistical
significance of certain non-linear terms. For example, if age
is known to be an important predictor, one might specify that
a restricted cubic spline function is fitted with 4 knots (Harrell
et al., 1988). This function has 3 degrees of freedom
(df). This implies that it has the flexibility to have two bendings,
as illustrated in the graph.
Figure
7.1: Restricted Cubic Splines
|
|---|
|
Illustration
of 3 restricted cubic spline functions with 4 knots
(3 degrees of freedom). The lines with markers represent the fitted
functions, which closely follow the underlying mathematically defined
functions. |
|