1
An Introduction to Mixed Models for Experimental Psychology
Henrik Singmann and David Kellen
In order to increase statistical power and precision, many psychological experiments collect more than one data point from each participant, often across different experimental conditions. Such repeated-measures pose a problem to most standard statistical procedures such as ordinary least-squares regression or (between-subjects) analysis of variance (ANOVA) as those procedures assume that the data points are independent and identically distributed (henceforth iid). The iid assumption is composed of two parts: The assumption of identical distribution simply means that all observations are samples from the same underlying distribution. The independence assumption states that the probability of a data point taking on a specific value is independent of the values taken by all other data points. 1 In this chapter we are mainly concerned with the latter assumption.
It is easy to see that in the case of repeated measures the independence assumption is expected to be violated. Observations coming from the same participant are usually correlated; for example, they are more likely to be similar to each other than two observations coming from two different participants. For example, when measuring response latencies, a participant who is generally slower than his or her peers will respond comparatively slower across conditions, thus making the data points from this participant correlated and nonindependent (i.e., a participantâs rank in one condition is predictive of his or her rank in other conditions). More generally, one can expect violations of the iid assumption if data are collected from units of observations that are clustered in groups. Other examples of this are data from experiments collected in group settings, students within classrooms, or patients within hospitals. In such situations, one would expect that observations within each cluster (i.e., a specific group, classroom, or hospital) are more similar to each other than observations across clusters.
Unfortunately, compared to violations of other assumptions, such as the normality assumption or the assumption of variance heterogeneity in ANOVA, standard statistical procedures are usually not robust to violations of the independence assumption (Judd, Westfall, & Kenny, 2012; Kenny & Judd, 1986). In a frequentist statistical framework such violations often lead to considerably increased Type I errors (i.e., false positives). More generally, such violations can produce overconfident results (e.g., too narrow standard errors).
In this chapter we describe a class of statistical model that is able to account for most of the cases of nonindependence that are typically encountered in psychological experiments, linear mixed-effects models (LMM; e.g., Baayen, Davidson, & Bates, 2008), or mixed models for short. Mixed models are a generalization of ordinary regression that explicitly capture the dependency among data points via random-effects parameters. Compared to traditional analyses that ignore these dependencies, mixed models provide more accurate (and generalizable) estimates of the effects, improved statistical power, and noninflated Type I errors. The reason for the recent popularity of linear mixed models boils down to the computational resources required to implement them: In the absence of such resources, realistic data-analytic methods had to rely on simpler models that ignored the dependencies in the data and relied on closed-form estimates and asymptotic results. Fortunately, today we can easily implement most LMM using any recent computer with sufficient RAM.
The remainder of this chapter is structured as follows: First, we introduce the concepts underlying mixed models and how they allow accounting for different types of nonindependence that can occur in psychological data. Next, we discuss how to set up a mixed model and how to perform statistical inference with a mixed model. Then, we discuss how to estimate a mixed model using the lme4 (Bates, Mächler, Bolker, & Walker, 2015) as well as the afex (Singmann, Bolker, Westfall, & Aust, 2017) packages for the statistical programming language R (R Core Team, 2016). Finally, we provide an outlook of how to extend mixed models to handle nonnormal data (e.g., categorical responses).
Fixed Effects, Random Effects, and Nonindependence
The most important concept for understanding how to estimate and how to interpret mixed models is the distinction between fixed and random effects. 2 In experimental settings fixed effects are often of primary interest to the researcher and represent the overall or population-level average effect of a specific model term (i.e., main effect or interaction) or parameter on the dependent variable, irrespective of the random or stochastic variability that is present in the data. A statistically significant fixed effect should be interpreted in essentially the same way as a statistically significant test result for any given term in a standard ANOVA or regression model. Furthermore, for fixed effects one can easily test specific hypotheses among the factor levels (e.g., planned contrasts).
In contrast, random effects capture random or stochastic variability in the data that comes from different sources, such as participants or items. These sources of stochastic variability are the grouping variables or grouping factors for the random effects and always concern categorical variables (i.e., nominal variables such as condition, participant, item)âcontinuous variables cannot serve as grouping factors for random effects. In experimental settings, it is often useful to think about the random effects grouping factors as the part of the design a researcher wants to generalize over. For example, one is usually not interested in knowing whether or not two factor levels differ for a specific sample of participants (after all, this could be done simply by looking at the obtained means in a descriptive manner) but whether the data provide evidence that a difference holds in the population of participants the sample is drawn from. By specifying random effects in our model, we are able to factor out the idiosyncrasies of our sample and obtain a more general estimate of the fixed effects of interest. 3
The independence assumption of standard statistical models implies that one can only generalize across exactly one source of stochastic variability: The population from which each observation (i.e., row in most statistical software packages) is sampled. In psychology, the unit of observation is usually participants, but occasionally other units such as items are employed alternatively (e.g., having words as the unit of observation is fairly common in psycholinguistic research). Importantly, the notion that the unit of observation represents a random effect is usually only an implicit part of a statistical model. In contrast, mixed models require an explicit specification of the random-effects structure embedded in the experimental design. As described earlier, the benefit of this extra step is that one can adequately capture a variety of dependencies that standard models cannot.
In order to make the distinction and the role of random effects in mixed models clearer, let us consider a simple example (constructed after Baayen et al., 2008; Barr, Levy, Scheepers, & Tily, 2013). Assume you have obtained response latency data from I participants in K = 2 difficulty conditions, an easy condition that leads to fast responses and a hard condition that produces slow responses. For example, in both conditions, participants have to make binary judgments on the same groups of words: In the easy condition, they have to make animacy judgments (whether it is a living thing). In the hard condition, participants have to (a) judge whether the object the word refers to is larger than a soccer ball and (b) whether it appears in the Northern Hemisphere; participants should only press a specific key if both judgments are positive. Moreover, assume that each participant provides responses to the same J words in each difficulty condition. Thus, difficulty is a repeated-measures factor, more specifically a within-subjects factor with J replicates for each participant in each cell of the design, but also a within-words factor with I replicates for each word in each cell of the design. Note that the cells of designs are given by the combination of all (fixed) factor levels. In the present example there are two cells, corresponding to the easy condition and the difficult condition, but in a 2 Ă 2 design we would have four cells instead. 4
Figure 1.1 illustrates the response ...