This R notebook can be used as a template to compute regression with multiple explanatory variables. It is usable in Rstudio. The content is adapated from the mooc Multilevel modelling online course from the Centre for Mutlilevel Modelling at the University of Bristol (http://www.bristol.ac.uk/cmm/learning/online-course/).

Packages

library(xlsx) #read xlsx files

Data importation

 #Text file
mydata <- read.table(file = "3.3.txt", sep = ",", header = TRUE) #put the file in the same directory as this notebook

#Excel file
#mydata <- read.xlsx("myfile.xlsx", sheetName = "mysheetname")

Data exploration

dim(mydata) #number of rows and columns
## [1] 33988     9
 str(mydata) #basic description of the dataframe
## 'data.frame':    33988 obs. of  9 variables:
##  $ caseid  : int  339 340 345 346 352 353 354 361 362 363 ...
##  $ score   : int  49 18 46 43 17 29 15 19 45 12 ...
##  $ cohort90: int  -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 ...
##  $ female  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sclass  : int  2 3 4 3 3 2 3 2 3 1 ...
##  $ sclass1 : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ sclass2 : int  1 0 0 0 0 1 0 1 0 0 ...
##  $ sclass3 : int  0 1 0 1 1 0 1 0 1 0 ...
##  $ sclass4 : int  0 0 1 0 0 0 0 0 0 0 ...
mydata[1:20, ] #first 20 rows of the data
##    caseid score cohort90 female sclass sclass1 sclass2 sclass3 sclass4
## 1     339    49       -6      0      2       0       1       0       0
## 2     340    18       -6      0      3       0       0       1       0
## 3     345    46       -6      0      4       0       0       0       1
## 4     346    43       -6      0      3       0       0       1       0
## 5     352    17       -6      0      3       0       0       1       0
## 6     353    29       -6      0      2       0       1       0       0
## 7     354    15       -6      0      3       0       0       1       0
## 8     361    19       -6      0      2       0       1       0       0
## 9     362    45       -6      0      3       0       0       1       0
## 10    363    12       -6      0      1       1       0       0       0
## 11   6824     0       -4      0      1       1       0       0       0
## 12   6826     0       -4      0      3       0       0       1       0
## 13   6827    20       -4      0      2       0       1       0       0
## 14   6828    32       -4      0      1       1       0       0       0
## 15   6829     0       -4      0      2       0       1       0       0
## 16   6834    24       -4      0      3       0       0       1       0
## 17   6836    23       -4      0      2       0       1       0       0
## 18  13206     7       -2      0      3       0       0       1       0
## 19  13209    38       -2      0      3       0       0       1       0
## 20  13215    46       -2      0      1       1       0       0       0

Multiple linear regression model

\[y_i = \beta_0 + \beta_1 * x_{1i} + \beta_2 * x_{2i} + \beta_3 * x_{3i} + ... + \beta_p * x_{pi} + e_i\] Having viewed the data we will now fit a multiple regression model that explained score (ratio variable : point score calculated from awards in Standard grades. Scores range from 0 to 75, with a higher score indicating a higher attainment) containing cohort (ordinal variable: calculated by subtracting 1990 from each value, ranging from -6 -corresponding to 1984- to 8 -1998-, with 1990 coded as zero), gender (categorical variable: girls 1; boys 0) and social class (categorical variable : managerial and professional 1; intermediate 2; working 3; unclassified 4).

\[score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * female_{i} + \beta_3 * sclass1_{i} + \\ \beta_4 * sclass2_{i} + \beta_5 * sclass4_{i} + e_i\] sclass3 is taken as a reference for the categorical variable social class (this necessitates the creation of dummy variables) See Regression with a Single Categorical Explanatory Variable

Fit of the model

fit <- lm(score ~ cohort90 + female + sclass1 + sclass2 + sclass4, 
data = mydata) 
summary(fit)
## 
## Call:
## lm(formula = score ~ cohort90 + female + sclass1 + sclass2 + 
##     sclass4, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.604 -10.645   0.535  10.576  53.209 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.70118    0.17240  137.47   <2e-16 ***
## cohort90     1.21136    0.01487   81.48   <2e-16 ***
## female       2.03169    0.15828   12.84   <2e-16 ***
## sclass1     13.18046    0.20472   64.38   <2e-16 ***
## sclass2      6.97933    0.20891   33.41   <2e-16 ***
## sclass4     -4.06504    0.29353  -13.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.56 on 33982 degrees of freedom
## Multiple R-squared:  0.2933, Adjusted R-squared:  0.2932 
## F-statistic:  2821 on 5 and 33982 DF,  p-value: < 2.2e-16

The estimated intercept is 23.70 which is interpreted as the predicted attainment score for sclass3 (working class) when all explanatory variables are equal to zero, i.e. cohort90 = 0 (1990), female = 0 (boy), sclass1 = 0, sclass2 = 0, sclass4 = 0. It is basically the score for a boy in the 1990 cohort with a working class parent.

\[ \hat{score}_i = 23.70 + 1.21 * cohort90 + 2.03 * female + \\ 13.18 * sclass1 + 6.98 * sclass2 - 4.07 * sclass4\]

Predictions

predscore <- predict(fit) #Predictions for every observation in the data

Residual (unexplained) variance

model <- summary(fit)
model$sigma^2
## [1] 211.887

% Explained variance vs. Unexplained variance

model$r.squared 
## [1] 0.2933134
1 - model$r.squared
## [1] 0.7066866

Hypothesis testing

The test statistic is calculated by dividing the estimated slope by its standard error. For example, we can test the effect of gender in the model. All these data are reported in summary(fit).

#sclass3 vs. female
model$coefficients[3,1] / model$coefficients[3,2]
## [1] 12.83632
#significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance. Gender increases the score of an individual in the 1990 cohort with a working class parent.

Assessing interaction effects (example 1)

mydata <- read.table(file = "3.4.txt", sep=",", header = TRUE)

Model without interaction effects

Multiple regression model with cohort and gender effects

\[score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * female_{i} + e_i\] ### Fit of the model

fit <- lm(score ~ cohort90 + female, data = mydata) 
summary(fit)
## 
## Call:
## lm(formula = score ~ cohort90 + female, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.162 -11.739   0.629  12.175  47.175 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.72097    0.12449  238.75   <2e-16 ***
## cohort90     1.31592    0.01599   82.30   <2e-16 ***
## female       1.91324    0.17147   11.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.77 on 33985 degrees of freedom
## Multiple R-squared:  0.1702, Adjusted R-squared:  0.1702 
## F-statistic:  3486 on 2 and 33985 DF,  p-value: < 2.2e-16

\[\hat{score_i} = 29.72 + 1.32 * cohort90_{i} + 1.91 * female_{i}\] For female = 0 (boys)

\[\hat{score_i} = 29.72 + 1.32 * cohort90_{i} + 0 = 29.72 + 1.32 * cohort90_{i}\] For female = 1 (girls)

\[\hat{score_i} = 29.72 + 1.32 * cohort90_{i} + 1.91 * 1 = 29.72 + 1.91 + 1.32 * cohort90_{i} \\ = 31.63 + 1.32 * cohort90_{i} \] Boys and girs have different intercepts but the same slope

Plot of regression lines

predscore <- predict(fit) #Predictions for every observation in the data

# matrix preddata with a row for each combination of the explanatory variables
preddata <- cbind(mydata$cohort90, mydata$female, predscore) 
preddata <- unique(preddata) 
colnames(preddata) <- c("cohort90", "female", "predscore") 
preddata <- data.frame(preddata) 
preddata
##       cohort90 female predscore
## 1           -6      0  21.82547
## 11          -4      0  24.45730
## 18          -2      0  27.08913
## 23           0      0  29.72097
## 27           6      0  37.61647
## 31           8      0  40.24831
## 16056       -6      1  23.73871
## 16072       -4      1  26.37054
## 16079       -2      1  29.00238
## 16087        0      1  31.63421
## 16092        6      1  39.52972
## 16099        8      1  42.16155
plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", 
ylab = "fitted values") 
lines(preddata$cohort90[preddata$female == 0], 
preddata$predscore[preddata$female == 0], lty = "solid") 
lines(preddata$cohort90[preddata$female == 1], 
preddata$predscore[preddata$female == 1], lty = "dashed") 
legend("bottomright", c("Boys", "Girl"), lty = c("solid", "dashed"), inset = 0.05) 

Model with interaction effects

Adding a column being the product of the two explanatory variables

Multiple regression model with cohort, gender and interaction between cohort and gender effects

\[score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * female_{i} + \beta_3 * cohort90Xfemale_{i} + e_i\]

Fit of the model

fit <- lm(score ~ cohort90 + female + cohort90:female, data = mydata) 
summary(fit)
## 
## Call:
## lm(formula = score ~ cohort90 + female + cohort90:female, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.539 -11.725   0.584  12.193  46.830 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     29.72546    0.12448 238.806  < 2e-16 ***
## cohort90         1.25927    0.02335  53.941  < 2e-16 ***
## female           1.88607    0.17164  10.988  < 2e-16 ***
## cohort90:female  0.10665    0.03203   3.329 0.000871 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.77 on 33984 degrees of freedom
## Multiple R-squared:  0.1705, Adjusted R-squared:  0.1704 
## F-statistic:  2328 on 3 and 33984 DF,  p-value: < 2.2e-16

\[\hat{score_i} = 29.72 + 1.26 * cohort90_{i} + 1.89 * female_{i} + 0.11 * cohort90Xfemale_i\]

For female = 1 (girls)

\[\hat{score_i} = 29.72 + 1.26 * cohort90_{i} + 1.86 * 0 + 0.11 * cohort90_{i} * 0 \\ = 29.72 + 1.26 * cohort90_{i}\]

For female = 0 (boys)

\[\hat{score_i} = 29.72 + 1.26 * cohort90_{i} + 1.86 * 1 + 0.11 * cohort90_{i} * 1 \\ = (29.72 + 1.86) + (1.26 + 0.11) * cohort90_{i} \\ = 31.58 + 1.37 * cohort90_{i}\]

Boys and girs have now different intercepts and different slopes

Plot of regression lines

predscore <- predict(fit) 
preddata <- cbind(mydata$cohort90, mydata$female, predscore) 
preddata <- unique(preddata) 
colnames(preddata) <- c("cohort90", "female", "predscore") 
preddata <- data.frame(preddata) 
preddata
##       cohort90 female predscore
## 1           -6      0  22.16981
## 11          -4      0  24.68836
## 18          -2      0  27.20691
## 23           0      0  29.72546
## 27           6      0  37.28110
## 31           8      0  39.79965
## 16056       -6      1  23.41596
## 16072       -4      1  26.14782
## 16079       -2      1  28.87967
## 16087        0      1  31.61153
## 16092        6      1  39.80709
## 16099        8      1  42.53895
plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", 
ylab = "fitted values") 
lines(preddata$cohort90[preddata$female == 0], 
preddata$predscore[preddata$female == 0], lty = "solid") 
lines(preddata$cohort90[preddata$female == 1], 
preddata$predscore[preddata$female == 1], lty = "dashed") 
legend("bottomright", c("Boys", "Girl"), lty = c("solid", "dashed"), inset = 0.05) 

model <- summary(fit)
model$sigma^2
## [1] 248.7021

% Explained variance vs. Unexplained variance

model$r.squared 
## [1] 0.1704787
1 - model$r.squared
## [1] 0.8295213

Hypothesis testing

The test statistic is calculated by dividing the estimated slope by its standard error. For example, we can test the effect of gender in the model. All these data are reported in summary(fit).

model$coefficients[4,1] / model$coefficients[4,2]
## [1] 3.329339
#significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance. There is an interaction effect between gender and cohort so that the difference between girls and boys increases across cohorts in favor of girls

Assessing interaction effects (example 2)

We will now consider whether the effect of cohort on attainment depends on parental social class.

Creation of dummy variables

mydata$cohort90Xsclass1 <- mydata$cohort90*mydata$sclass1 
mydata$cohort90Xsclass2 <- mydata$cohort90*mydata$sclass2 
mydata$cohort90Xsclass4 <- mydata$cohort90*mydata$sclass4 

Multiple regression model with cohort, social class and interaction between cohort and social class effects effects

\[score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * sclass1_{i} + \beta_3 * sclass2_{i} \\ \beta_4 * sclass4_{i} + \beta_5 * cohort90_{i}Xsclass1_{i} + \beta_6 * cohort90_{i}Xsclass2_{i} \\ + \beta_7 * cohort90_{i}Xsclass4_{i} + e_i\]

Fit of the model

fit <- lm(score ~ cohort90 + sclass1 + sclass2 + sclass4 + cohort90Xsclass1 + cohort90Xsclass2 + cohort90Xsclass4, data = mydata)
summary(fit)
## 
## Call:
## lm(formula = score ~ cohort90 + sclass1 + sclass2 + sclass4 + 
##     cohort90Xsclass1 + cohort90Xsclass2 + cohort90Xsclass4, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -46.54 -10.62   0.46  10.51  53.75 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      24.89051    0.15042 165.471  < 2e-16 ***
## cohort90          1.37829    0.02883  47.803  < 2e-16 ***
## sclass1          13.22412    0.20583  64.246  < 2e-16 ***
## sclass2           6.86521    0.20941  32.783  < 2e-16 ***
## sclass4          -4.39256    0.29465 -14.908  < 2e-16 ***
## cohort90Xsclass1 -0.32508    0.03863  -8.415  < 2e-16 ***
## cohort90Xsclass2 -0.24468    0.03987  -6.137 8.49e-10 ***
## cohort90Xsclass4  0.18398    0.05352   3.438 0.000587 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.56 on 33980 degrees of freedom
## Multiple R-squared:  0.2928, Adjusted R-squared:  0.2926 
## F-statistic:  2010 on 7 and 33980 DF,  p-value: < 2.2e-16

\[\hat{score_i} = 24.90 + 1.38 * cohort90_{i} + 13.22 * sclass1_{i} + 6.86 * sclass2_{i} \\ - 4.39 * sclass4_{i} - 0.33 * cohort90Xsclass1_{i} - 0.24 * cohort90Xsclass2_{i} \\ - 0.18 * cohort90Xsclass4_{i} \]

For sclass = 1

\[\hat{score_i} = 24.90 + 1.38 * cohort90_{i} + 13.22 - 0.33 * cohort90_{i} * 1 \\ = (24.90 + 13.22) + (1.38 - 0.33) * cohort90_{i} \\ = 38.12 + 1.05 * cohort90_{i}\]

For sclass = 2

\[\hat{score_i} = 31.756 + 1.133 * cohort90_i\]

For sclass = 3

\[\hat{score_i} = 24.891 + 1.378 * cohort90_i\]

For sclass = 4

\[\hat{score_i} = 20.498 + 1.562 * cohort90_i\]

Plot of regression lines

predscore <- predict(fit) 
preddata <- cbind(mydata$cohort90, mydata$sclass, predscore) 
preddata <- unique(preddata) 
colnames(preddata) <- c("cohort90", "sclass", "predscore") 
preddata <- data.frame(preddata) 
preddata 
##     cohort90 sclass predscore
## 1         -6      2  24.95404
## 2         -6      3  16.62074
## 3         -6      4  11.12430
## 10        -6      1  31.79533
## 11        -4      1  33.90177
## 12        -4      3  19.37733
## 13        -4      2  27.22127
## 18        -2      3  22.13392
## 20        -2      1  36.00820
## 23         0      1  38.11463
## 24         0      3  24.89051
## 27         6      4  29.87160
## 28         6      1  44.43392
## 29         6      3  33.16028
## 30         6      2  38.55741
## 31         8      1  46.54035
## 32         8      3  35.91687
## 33         8      2  40.82464
## 52        -2      2  29.48850
## 59         0      2  31.75572
## 64         0      4  20.49795
## 96        -4      4  14.24885
## 111       -2      4  17.37340
## 132        8      4  32.99614
plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", ylab = "fitted values", ylim = c(10, 50)) 
lines(preddata$cohort90[preddata$sclass == 1], preddata$predscore[preddata$sclass == 1],lty = "solid") 
lines(preddata$cohort90[preddata$sclass == 2], preddata$predscore[preddata$sclass == 2],lty = "dashed") 
lines(preddata$cohort90[preddata$sclass == 3], preddata$predscore[preddata$sclass == 3],lty = "dotted") 
lines(preddata$cohort90[preddata$sclass == 4], preddata$predscore[preddata$sclass == 4],lty = "dotdash")
legend("bottomright", c("sclass = 1", "sclass = 2", "sclass = 3", "sclass=4"), lty = c("solid","dashed","dotted", "dotdash"), inset = 0.05) 

The graph shows a reduction in social class achievement gaps over successive cohorts.

Plot of regression lines (with sclass3 as reference)

preddata$predscore <- preddata$predscore - (24.891 + 1.378*preddata$cohort90)
plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", ylab = "fitted values", ylim = c(-10, 15), xlim = c(-6, 10)) 
lines(preddata$cohort90[preddata$sclass == 1], preddata$predscore[preddata$sclass == 1], lty = "solid") 
lines(preddata$cohort90[preddata$sclass == 2], preddata$predscore[preddata$sclass == 2], lty = "dashed") 
lines(preddata$cohort90[preddata$sclass == 3], preddata$predscore[preddata$sclass == 3],lty = "dotted") 
lines(preddata$cohort90[preddata$sclass == 4], preddata$predscore[preddata$sclass == 4],lty = "dotdash")
legend("bottomright", c("sclass = 1", "sclass = 2", "sclass = 3", "sclass=4"), lty = c("solid","dashed","dotted", "dotdash"), inset = 0.05) 

The horizontal line with a fitted value of 0 represents the reference category social class 3. This graph shows clearly the reduction in the social class achievement gaps over time.

Hypothesis testing

We want to test the significance of the interaction model against the main effects model The null hypothesis for a significance test of the cohort-by-class interaction is that the coefficients of all interactions terms are equal to zero, i.e. H0 : \(\beta_5 = \beta_6 = \beta_7 = 0\ vs. H1 : \beta_5 = \beta_6 = \beta_7 = fitted\ values\).

The standard way of testing whether a set of regression coefficients are equal to zero is to use a nested F-test.

The calculation of the F-statistic can be done with the anova command. The command tests the restricted model (main effects models) against the unrestricted model (interaction model).

fitrestricted <- lm(score ~ cohort90 + sclass1 + sclass2 + sclass4, data = mydata) 
anova(fitrestricted, fit) 
## Analysis of Variance Table
## 
## Model 1: score ~ cohort90 + sclass1 + sclass2 + sclass4
## Model 2: score ~ cohort90 + sclass1 + sclass2 + sclass4 + cohort90Xsclass1 + 
##     cohort90Xsclass2 + cohort90Xsclass4
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1  33983 7235256                                  
## 2  33980 7205635  3     29620 46.561 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is tiny, so we can safely reject the null and conclude that there is evidence of an interaction effect between cohort and social class.