Should I use sum-to-zero contrasts?
A sum-to-zero contrast codes a categorical variable as deviations from a grand mean. Social scientists use them extensively. Should ecologists?
EDIT the first version of this post that went up had some bugs in it. Hopefully all fixed now.
More accurately, should I tell my students in Ecological Statistics to use them? Sum-to-zero contrasts are conceptually similar to centering a continuous variable by subtracting the mean from your predictor variable prior to analysis. Discussions of centering often end up conflated with scaling, which is dividing your predictor variable by a constant, usually the standard deviation, prior to analysis. Always scaling covariates prior to regression analysis is controversial advice. See for example Andrew Gelman’s blogpost and comments, or many crossvalidated questions such as this one which has links to many others. There is a good reference as well as some useful discussion in the comments of this question. In this post I focus on the effects of sum to zero contrasts for categorical variables and interactions with continuous variables.1
Here is my summary of the pros and cons of centering drawn from those references above.2
- Centering continuous variables eliminates collinearity between interaction and polynomial terms and the individual covariates that make them up.
- Centering does not affect inference about the covariates.
- Centering can improve the interpretability of the coefficients in a regression model, particularly because the intercept represents the value of the response at the mean of the predictor variables.
- Predicting out of sample data with a model fitted to centered data must be done carefully because the center of the out of sample data will be different from the fitted data.
- There may be some constant value other than the sample mean that makes more sense based on domain knowledge.
To make the discussion concrete, let me demonstrate with an example of the interaction between a continuous covariate and a categorical one. In the following I refer to the effect of individual covariates outside the interaction as the “simple effect” of that covariate.
The intercept of this model isn’t directly interpretable because it gives the average width at a length of zero, which is impossible. In addition, both the intercept and simple effect of length represent the change in width for only one species, setosa. The default contrast in R estimates coefficients for $k - 1$ levels of a factor. In the simple effect of a factor each coefficient is the difference between the first level (estimated by the intercept), and the named level. In the above, Speciesversicolor
has sepals that are 1.4 mm wider than setosa. The interaction coefficients
such as Sepal.Length:Speciesversicolor
tell us how much the effect of Sepal.Length in versicolor changes from that in setosa. So every mm of sepal length in versicolor increases sepal width by $0.8 - 0.48 = 0.32$ mm.
Maybe a plot will help.
I did something rather silly looking there because I wanted to see where the curves cross x = 0. That is where the estimates for the intercept and simple effect of species are calculated. The intercept is negative, and the line for setosa crosses x = 0 well below y = 0. The simple effect estimates of Species are both positive, with virginica being larger, and indeed the line for virginica crosses x = 0 at the highest point. Similarly, the simple effect of length is the slope of the line for setosa, and it is larger than the slopes of the other two species because the estimated interactions are both negative. But not centering really makes things ugly for direct interpretation of the estimated coefficients.
Centering the continuous variable gives us
Although the coefficients change because now the model estimates the differences between the species at the mean length, the t-statistics for the continuous covariate, including the interaction terms, do not change. The t-statistics for the intercept and simple effect of species do change.
That’s OK, because they are really testing something quite different from before. Just to confirm that the graph isn’t different.
What happens if we use sum to zero contrasts for species?
I can now directly interpret the intercept as the average width at the average length, averaged over species. Similarly the simple effect of length as the change in width averaged across species. This seems like a very useful set of coefficients to look at, particularly if my categorical covariate has many levels. You might ask (I did), what do the categorical coefficients mean? They represent deviations from the mean intercept for each species. OK, you say (I did), then where’s species virginica? After much head scratching, the answer is that it is the negative of the sum of the other two coefficients.
I just thought of something else. Are those “average” intercept and slope the same as what I would get if I only use cSepal.Length
?
Whoa! It is not the same, in fact it is radically different. Totally different conclusion about the “average” effect.
This is closer to the conclusion obtained with the interaction model.
I have seen assertions in some papers (particularly from social sciences), that using sum to zero contrasts (also called effects coding, I believe), allows the direct interpretation of lower order terms in an ANOVA table even in the presence of interactions.
If so, in this case I could say “Sepal Width differs significantly between species.” I’m not sure I believe that. The ANOVA table is identical between all 3 models, whether I use sum-to-zero contrasts or not. Why should the interpretation differ if I just change the parameterization of the model?
Explaining treatment contrasts to students is a pain. I’m not sure that these are any easier. I have a few thoughts about the effects of sum-to-zero contrasts and model averaging, but that will have to wait for a different post.
-
All the code for this post, including that not shown, can be found here. ↩
-
This stuff first appeared as a question on CrossValidated, but received no attention. ↩