Getting my head around regression analysis Pt 1: Setting the problem

Regression analysis and specifically mixed effect linear models (LMMs) is hard – harder than I thought based on what I learned in traditional statistics classes. ‘Modern’ mixed model approaches, although more powerful (as they can handle more complex designs, lack of balance, crossed random factors, some kinds of non-normally distributed responses, etc.), also require a new set of conceptual tools.

This first post covers my process of understanding how to apply the magic of multiple regression to my experimental data. The next post will cover how the analysis was done using R.

The Basics

Before tackling my specific problem as a mixed effect linear model, it is important to review the basic building block of linear regression.

Linear regression is a standard way to build a model of your variables. You want to do this when [source]:

  • You have two variables: one dependent variable and one independent variable. Both variables are interval.
  • You want to express the relationship between the dependent variable and independent variable in a form of a line. That is, you want to express the relationship like y = ax + b, where x and y are the independent and dependent variables, respectively.

Also, there are 4 key concepts in linear regression that should be clear before you attempt extended techniques like LMM or GLM [source]

1. Understand what centring does to your variables: Intercepts are pretty important in multilevel models, so centring is often required to make intercepts meaningful.

2. Work with categorical and continuous predictors: You will want to use both dummy and effect coding in different situations.  Likewise, you want to be able to understand what it means if you make a variable continuous or categorical.  What different information do you get from it and what does it mean?  Even if you’re a regular ANOVA user, it may make sense to treat time as continuous, not categorical.

3. Interactions: Make sure you can interpret interactions regardless of how many categorical and continuous variables they contain.  And make sure you can interpret an interaction regardless of whether the variables in the interaction are both continuous, both categorical, or one of each.

4. Polynomial terms: Random slopes can be hard enough to grasp.  Random curvature is worse, be comfortable with polynomial functions if you have complex data (e.g. the Wundt curve, the bell-shaped relationship of positive affect and complexity in music).

Finally, understand how all these concepts fit together. This means understanding what the estimates in your model mean and how to interpret them.

What is a mixed effect linear model?

Simply, they are statistical models of parameters that vary at more than one level. They are a generalised form of linear regression that builds multiple linear models to provide data on how predictors relate parameters.

Many kinds of data, including observational data collected in experiments, have a hierarchical or clustered structure. For example, children with the same parents tend to be more alike in their physical and mental characteristics than individuals chosen at random from the population at large. Individuals may be further nested within demographic and psychometric features. Multilevel data structures also arise in longitudinal studies where an individual’s responses over time are correlated with each other. In experimental data, LMM is a good way to position individual difference between participants. For example, some participants may be more comfortable with using touchscreens than the others, and thus, their performance in a task might have been better. If we tried to represent this with linear regression,  the model tries to represent the data with one line, this aggressively aggregates differences which may matter to the results being effective and contextually understood.

Multilevel regression, intuitively, allows us to have a model for each group represented in the within-subject factors. In this way, we can also consider the individual differences of the participants (they will be described as differences between the models). What multilevel regression actually does is something like between completely ignoring the within-subject factors (sticking with one model) and building a separate model for every single group (making n separate models for n participants). LMM controls for non-independence among the repeated observations for each individual by adding one or more random effects for individuals to the model. They take the form of additional residual terms, each of which has its own variance to be estimated. Roughly speaking, there are two strategies you can take for random effects: varying-intercept or varying-slope (or do both). Varying-intercept means differences in random effects are described as differences in intercepts. Varying-slope means vice versa: changing the coefficients of some factors.


Dependant/Response variable the variable that you measure and expect to vary given experimental manipulation.

Independent/Explanatory/exogenous variables and Fixed effects are all variables that we expect will have an effect on the dependent/response variable. Factors whose levels are experimentally determined or whose interest lies in the specific effects of each level, such as effects of covariates, differences among treatments and interactions.

Random effects are usually grouping factors for which we are trying to control. In repeated measures designs, they can be either crossed or hierarchical/nested, more on that later. Random effects are factors whose levels are sampled from a larger population, or whose interest lies in the variation among them rather than the specific effects of each level. The parameters of random effects are the standard deviations of variation at a particular level (e.g.among experimental blocks).

The precise definitions of ‘fixed’ and ‘random’ are controversial; the status of particular variables depends on experimental design and context.

My Research Problem

In an experiment comparing Desktop (DT) computer and VR interfaces in a collaborative music-making task, I think that individual users and the dyadic session dynamics affect the amount of speech when doing the task and that the amount of talk will also be affected by media (DT/VR). Basically, the mixture of people and experimental condition will both have effects, but I really want to know the specific effect of media on speech amount.

Data structure

The dependent variable is the frequency of coded speech per user, while demographic surveys produced multiple explanatory variables along with the independent variable of media. So, we also have a series of other variables that may affect the volume of communication. Altogether variables of interest for linear modelling include:

  • Media: media condition DT or VR.
  • User: repeated measure grouping by the participant ID.
  • Session: categorical dyad grouping e.g. A, B, C.
  • Utterance: A section of transcribed speech, a sentence or comparable. Frequencies of utterances used.
  • Pam: Personal acquaintance measure, a psychometric method of evaluating how much you know another person.
  • VrScore: level of experience with VR, simple one to seven scores.
  • MsiPa: Musical sophistication index perceptual ability factor for each user.
  • MsiMtMusical sophistication index musical training factor for each user.

Using the right tool

As I used a repeated measure design for the experiment, where each participant used both interfaces, Media is a within-subject factor. This means I need a statistical method that can account for it. A simple paired t-test or repeated measures ANOVA may be of use but it lacks the ability to include all of the explanatory variables, this leaves us with regression analysis. This decision tree highlights how to proceed with choosing the right form of regression analysis:

  1. If you have one independent variable and do not have any within-subject factor, consider Linear regression. If your dependent variable is binomial, Logistic regression may be more appropriate.
  2. If you have multiple independent variables and do not have any within-subject factor, consider Multiple linear regression.
  3. If you have any within-subject factor, consider Multi-level linear regression (mixed-effect linear model).
  4. For some special cases, consider the Generalized Linear Model (GLM) or Generalized Linear Mixed Model (GLMM).

So, at first I chose to use a mixed-effect linear model (LMM), as I am trying to fit a model that has two random intercepts, e.g. two groups. As such, we are trying to fit a model with nested random effects.

Crossed or Nested random effects

As each User only appears once in each Session, the data can be treated as nested. For nested random effects, the factor appears only within a particular level of another factor; for crossed effects, a given factor appears in more than one level of another factor (User’s appearing within more than one session). An easy rule of thumb is that if your random effects aren’t nested, then they are crossed!

Special Cases…GLM

After a bit of further reading, I found out that my dependent variable meant a standard LMM was not suitable. As the response variable is count data of speech, it violates the assumptions of normal LMMs. When your dependent variable is not continuous, unbounded, and measured on an interval or ratio scale, your model will never meet the assumptions of linear mixed models (LMMs). In steps the flexible, but highly sensitive, Generalised Linear Mixed Models (GLMM).  The difference between LMMs and GLMMs is that the response variables can come from different distributions besides Gaussian, for count data this is often of a Poisson distribution. There are a few issues to keep in mind, though.

  1. Rather than modelling the responses directly, some link function is often applied, such as a log link. For Poisson, the link function (the transformation of Y) is the natural log.  So all parameter estimates are on the log scale and need to be transformed for interpretation, the means applying inverse function of the link, for log this is exponential.
  2. It is often necessary to include an offset parameter in the model to account for the amount of risk each individual had to the event, practically this is a normalising factor such as the total number of utterance across repeated condition.
  3. One assumption of Poisson Models is that the mean and the variance are equal, but this assumption is often violated.  This can be dealt with by using a dispersion parameter if the difference is small or a negative binomial regression model if the difference is large.
  4. Sometimes there are many, many more zeros than even a Poisson Model would indicate.  This generally means there are two processes going on–there is some threshold that needs to be crossed before an event can occur.  A Zero-Inflated Poisson Model is a mixture model that simultaneously estimates the probability of crossing the threshold, and once crossed, how many events occur.

Moving forward

In the next post, I will cover how this analysis is done in the R environment using the lme4 package.