This R notebook can be used as a template to compute regression with multiple explanatory variables. It is usable in Rstudio. The content is adapated from the mooc Multilevel modelling online course from the Centre for Mutlilevel Modelling at the University of Bristol (http://www.bristol.ac.uk/cmm/learning/online-course/).

# Packages

library(xlsx) #read xlsx files

# Data importation

 #Text file
mydata <- read.table(file = "3.3.txt", sep = ",", header = TRUE) #put the file in the same directory as this notebook

#Excel file
#mydata <- read.xlsx("myfile.xlsx", sheetName = "mysheetname")

# Data exploration

dim(mydata) #number of rows and columns
## [1] 33988     9
 str(mydata) #basic description of the dataframe
## 'data.frame':    33988 obs. of  9 variables:
##  $caseid : int 339 340 345 346 352 353 354 361 362 363 ... ##$ score   : int  49 18 46 43 17 29 15 19 45 12 ...
##  $cohort90: int -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 ... ##$ female  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $sclass : int 2 3 4 3 3 2 3 2 3 1 ... ##$ sclass1 : int  0 0 0 0 0 0 0 0 0 1 ...
##  $sclass2 : int 1 0 0 0 0 1 0 1 0 0 ... ##$ sclass3 : int  0 1 0 1 1 0 1 0 1 0 ...
##  $sclass4 : int 0 0 1 0 0 0 0 0 0 0 ... mydata[1:20, ] #first 20 rows of the data ## caseid score cohort90 female sclass sclass1 sclass2 sclass3 sclass4 ## 1 339 49 -6 0 2 0 1 0 0 ## 2 340 18 -6 0 3 0 0 1 0 ## 3 345 46 -6 0 4 0 0 0 1 ## 4 346 43 -6 0 3 0 0 1 0 ## 5 352 17 -6 0 3 0 0 1 0 ## 6 353 29 -6 0 2 0 1 0 0 ## 7 354 15 -6 0 3 0 0 1 0 ## 8 361 19 -6 0 2 0 1 0 0 ## 9 362 45 -6 0 3 0 0 1 0 ## 10 363 12 -6 0 1 1 0 0 0 ## 11 6824 0 -4 0 1 1 0 0 0 ## 12 6826 0 -4 0 3 0 0 1 0 ## 13 6827 20 -4 0 2 0 1 0 0 ## 14 6828 32 -4 0 1 1 0 0 0 ## 15 6829 0 -4 0 2 0 1 0 0 ## 16 6834 24 -4 0 3 0 0 1 0 ## 17 6836 23 -4 0 2 0 1 0 0 ## 18 13206 7 -2 0 3 0 0 1 0 ## 19 13209 38 -2 0 3 0 0 1 0 ## 20 13215 46 -2 0 1 1 0 0 0 # Multiple linear regression model $y_i = \beta_0 + \beta_1 * x_{1i} + \beta_2 * x_{2i} + \beta_3 * x_{3i} + ... + \beta_p * x_{pi} + e_i$ Having viewed the data we will now fit a multiple regression model that explained score (ratio variable : point score calculated from awards in Standard grades. Scores range from 0 to 75, with a higher score indicating a higher attainment) containing cohort (ordinal variable: calculated by subtracting 1990 from each value, ranging from -6 -corresponding to 1984- to 8 -1998-, with 1990 coded as zero), gender (categorical variable: girls 1; boys 0) and social class (categorical variable : managerial and professional 1; intermediate 2; working 3; unclassified 4). $score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * female_{i} + \beta_3 * sclass1_{i} + \\ \beta_4 * sclass2_{i} + \beta_5 * sclass4_{i} + e_i$ sclass3 is taken as a reference for the categorical variable social class (this necessitates the creation of dummy variables) See Regression with a Single Categorical Explanatory Variable ## Fit of the model fit <- lm(score ~ cohort90 + female + sclass1 + sclass2 + sclass4, data = mydata) summary(fit) ## ## Call: ## lm(formula = score ~ cohort90 + female + sclass1 + sclass2 + ## sclass4, data = mydata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -48.604 -10.645 0.535 10.576 53.209 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 23.70118 0.17240 137.47 <2e-16 *** ## cohort90 1.21136 0.01487 81.48 <2e-16 *** ## female 2.03169 0.15828 12.84 <2e-16 *** ## sclass1 13.18046 0.20472 64.38 <2e-16 *** ## sclass2 6.97933 0.20891 33.41 <2e-16 *** ## sclass4 -4.06504 0.29353 -13.85 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 14.56 on 33982 degrees of freedom ## Multiple R-squared: 0.2933, Adjusted R-squared: 0.2932 ## F-statistic: 2821 on 5 and 33982 DF, p-value: < 2.2e-16 The estimated intercept is 23.70 which is interpreted as the predicted attainment score for sclass3 (working class) when all explanatory variables are equal to zero, i.e. cohort90 = 0 (1990), female = 0 (boy), sclass1 = 0, sclass2 = 0, sclass4 = 0. It is basically the score for a boy in the 1990 cohort with a working class parent. $\hat{score}_i = 23.70 + 1.21 * cohort90 + 2.03 * female + \\ 13.18 * sclass1 + 6.98 * sclass2 - 4.07 * sclass4$ ## Predictions predscore <- predict(fit) #Predictions for every observation in the data ## Residual (unexplained) variance model <- summary(fit) model$sigma^2
## [1] 211.887

### % Explained variance vs. Unexplained variance

model$r.squared  ## [1] 0.2933134 1 - model$r.squared
## [1] 0.7066866

## Hypothesis testing

The test statistic is calculated by dividing the estimated slope by its standard error. For example, we can test the effect of gender in the model. All these data are reported in summary(fit).

#sclass3 vs. female
model$coefficients[3,1] / model$coefficients[3,2]
## [1] 12.83632
#significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance. Gender increases the score of an individual in the 1990 cohort with a working class parent.

# Assessing interaction effects (example 1)

mydata <- read.table(file = "3.4.txt", sep=",", header = TRUE)

## Model without interaction effects

Multiple regression model with cohort and gender effects

$score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * female_{i} + e_i$ ### Fit of the model

fit <- lm(score ~ cohort90 + female, data = mydata)
summary(fit)
##
## Call:
## lm(formula = score ~ cohort90 + female, data = mydata)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -42.162 -11.739   0.629  12.175  47.175
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.72097    0.12449  238.75   <2e-16 ***
## cohort90     1.31592    0.01599   82.30   <2e-16 ***
## female       1.91324    0.17147   11.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.77 on 33985 degrees of freedom
## Multiple R-squared:  0.1702, Adjusted R-squared:  0.1702
## F-statistic:  3486 on 2 and 33985 DF,  p-value: < 2.2e-16

$\hat{score_i} = 29.72 + 1.32 * cohort90_{i} + 1.91 * female_{i}$ For female = 0 (boys)

$\hat{score_i} = 29.72 + 1.32 * cohort90_{i} + 0 = 29.72 + 1.32 * cohort90_{i}$ For female = 1 (girls)

$\hat{score_i} = 29.72 + 1.32 * cohort90_{i} + 1.91 * 1 = 29.72 + 1.91 + 1.32 * cohort90_{i} \\ = 31.63 + 1.32 * cohort90_{i}$ Boys and girs have different intercepts but the same slope

### Plot of regression lines

predscore <- predict(fit) #Predictions for every observation in the data

# matrix preddata with a row for each combination of the explanatory variables
preddata <- cbind(mydata$cohort90, mydata$female, predscore)
preddata <- unique(preddata)
colnames(preddata) <- c("cohort90", "female", "predscore")
preddata <- data.frame(preddata)
preddata
##       cohort90 female predscore
## 1           -6      0  21.82547
## 11          -4      0  24.45730
## 18          -2      0  27.08913
## 23           0      0  29.72097
## 27           6      0  37.61647
## 31           8      0  40.24831
## 16056       -6      1  23.73871
## 16072       -4      1  26.37054
## 16079       -2      1  29.00238
## 16087        0      1  31.63421
## 16092        6      1  39.52972
## 16099        8      1  42.16155
plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90",
ylab = "fitted values")
lines(preddata$cohort90[preddata$female == 0],
preddata$predscore[preddata$female == 0], lty = "solid")
lines(preddata$cohort90[preddata$female == 1],
preddata$predscore[preddata$female == 1], lty = "dashed")
legend("bottomright", c("Boys", "Girl"), lty = c("solid", "dashed"), inset = 0.05) 

## Model with interaction effects

### Adding a column being the product of the two explanatory variables

Multiple regression model with cohort, gender and interaction between cohort and gender effects

$score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * female_{i} + \beta_3 * cohort90Xfemale_{i} + e_i$

### Fit of the model

fit <- lm(score ~ cohort90 + female + cohort90:female, data = mydata)
summary(fit)
##
## Call:
## lm(formula = score ~ cohort90 + female + cohort90:female, data = mydata)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -42.539 -11.725   0.584  12.193  46.830
##
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)     29.72546    0.12448 238.806  < 2e-16 ***
## cohort90         1.25927    0.02335  53.941  < 2e-16 ***
## female           1.88607    0.17164  10.988  < 2e-16 ***
## cohort90:female  0.10665    0.03203   3.329 0.000871 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.77 on 33984 degrees of freedom
## Multiple R-squared:  0.1705, Adjusted R-squared:  0.1704
## F-statistic:  2328 on 3 and 33984 DF,  p-value: < 2.2e-16

$\hat{score_i} = 29.72 + 1.26 * cohort90_{i} + 1.89 * female_{i} + 0.11 * cohort90Xfemale_i$

For female = 1 (girls)

$\hat{score_i} = 29.72 + 1.26 * cohort90_{i} + 1.86 * 0 + 0.11 * cohort90_{i} * 0 \\ = 29.72 + 1.26 * cohort90_{i}$

For female = 0 (boys)

$\hat{score_i} = 29.72 + 1.26 * cohort90_{i} + 1.86 * 1 + 0.11 * cohort90_{i} * 1 \\ = (29.72 + 1.86) + (1.26 + 0.11) * cohort90_{i} \\ = 31.58 + 1.37 * cohort90_{i}$

Boys and girs have now different intercepts and different slopes

### Plot of regression lines

predscore <- predict(fit)
preddata <- cbind(mydata$cohort90, mydata$female, predscore)
preddata <- unique(preddata)
colnames(preddata) <- c("cohort90", "female", "predscore")
preddata <- data.frame(preddata)
preddata
##       cohort90 female predscore
## 1           -6      0  22.16981
## 11          -4      0  24.68836
## 18          -2      0  27.20691
## 23           0      0  29.72546
## 27           6      0  37.28110
## 31           8      0  39.79965
## 16056       -6      1  23.41596
## 16072       -4      1  26.14782
## 16079       -2      1  28.87967
## 16087        0      1  31.61153
## 16092        6      1  39.80709
## 16099        8      1  42.53895
plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90",
ylab = "fitted values")
lines(preddata$cohort90[preddata$female == 0],
preddata$predscore[preddata$female == 0], lty = "solid")
lines(preddata$cohort90[preddata$female == 1],
preddata$predscore[preddata$female == 1], lty = "dashed")
legend("bottomright", c("Boys", "Girl"), lty = c("solid", "dashed"), inset = 0.05) 

model <- summary(fit)
model$sigma^2 ## [1] 248.7021 ### % Explained variance vs. Unexplained variance model$r.squared 
## [1] 0.1704787
1 - model$r.squared ## [1] 0.8295213 ## Hypothesis testing The test statistic is calculated by dividing the estimated slope by its standard error. For example, we can test the effect of gender in the model. All these data are reported in summary(fit). model$coefficients[4,1] / model$coefficients[4,2] ## [1] 3.329339 #significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance. There is an interaction effect between gender and cohort so that the difference between girls and boys increases across cohorts in favor of girls # Assessing interaction effects (example 2) We will now consider whether the effect of cohort on attainment depends on parental social class. ## Creation of dummy variables mydata$cohort90Xsclass1 <- mydata$cohort90*mydata$sclass1
mydata$cohort90Xsclass2 <- mydata$cohort90*mydata$sclass2 mydata$cohort90Xsclass4 <- mydata$cohort90*mydata$sclass4 

Multiple regression model with cohort, social class and interaction between cohort and social class effects effects

$score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * sclass1_{i} + \beta_3 * sclass2_{i} \\ \beta_4 * sclass4_{i} + \beta_5 * cohort90_{i}Xsclass1_{i} + \beta_6 * cohort90_{i}Xsclass2_{i} \\ + \beta_7 * cohort90_{i}Xsclass4_{i} + e_i$

## Fit of the model

fit <- lm(score ~ cohort90 + sclass1 + sclass2 + sclass4 + cohort90Xsclass1 + cohort90Xsclass2 + cohort90Xsclass4, data = mydata)
summary(fit)
##
## Call:
## lm(formula = score ~ cohort90 + sclass1 + sclass2 + sclass4 +
##     cohort90Xsclass1 + cohort90Xsclass2 + cohort90Xsclass4, data = mydata)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -46.54 -10.62   0.46  10.51  53.75
##
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)      24.89051    0.15042 165.471  < 2e-16 ***
## cohort90          1.37829    0.02883  47.803  < 2e-16 ***
## sclass1          13.22412    0.20583  64.246  < 2e-16 ***
## sclass2           6.86521    0.20941  32.783  < 2e-16 ***
## sclass4          -4.39256    0.29465 -14.908  < 2e-16 ***
## cohort90Xsclass1 -0.32508    0.03863  -8.415  < 2e-16 ***
## cohort90Xsclass2 -0.24468    0.03987  -6.137 8.49e-10 ***
## cohort90Xsclass4  0.18398    0.05352   3.438 0.000587 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.56 on 33980 degrees of freedom
## Multiple R-squared:  0.2928, Adjusted R-squared:  0.2926
## F-statistic:  2010 on 7 and 33980 DF,  p-value: < 2.2e-16

$\hat{score_i} = 24.90 + 1.38 * cohort90_{i} + 13.22 * sclass1_{i} + 6.86 * sclass2_{i} \\ - 4.39 * sclass4_{i} - 0.33 * cohort90Xsclass1_{i} - 0.24 * cohort90Xsclass2_{i} \\ - 0.18 * cohort90Xsclass4_{i}$

For sclass = 1

$\hat{score_i} = 24.90 + 1.38 * cohort90_{i} + 13.22 - 0.33 * cohort90_{i} * 1 \\ = (24.90 + 13.22) + (1.38 - 0.33) * cohort90_{i} \\ = 38.12 + 1.05 * cohort90_{i}$

For sclass = 2

$\hat{score_i} = 31.756 + 1.133 * cohort90_i$

For sclass = 3

$\hat{score_i} = 24.891 + 1.378 * cohort90_i$

For sclass = 4

$\hat{score_i} = 20.498 + 1.562 * cohort90_i$

## Plot of regression lines

predscore <- predict(fit)
preddata <- cbind(mydata$cohort90, mydata$sclass, predscore)
preddata <- unique(preddata)
colnames(preddata) <- c("cohort90", "sclass", "predscore")
preddata <- data.frame(preddata)
preddata 
##     cohort90 sclass predscore
## 1         -6      2  24.95404
## 2         -6      3  16.62074
## 3         -6      4  11.12430
## 10        -6      1  31.79533
## 11        -4      1  33.90177
## 12        -4      3  19.37733
## 13        -4      2  27.22127
## 18        -2      3  22.13392
## 20        -2      1  36.00820
## 23         0      1  38.11463
## 24         0      3  24.89051
## 27         6      4  29.87160
## 28         6      1  44.43392
## 29         6      3  33.16028
## 30         6      2  38.55741
## 31         8      1  46.54035
## 32         8      3  35.91687
## 33         8      2  40.82464
## 52        -2      2  29.48850
## 59         0      2  31.75572
## 64         0      4  20.49795
## 96        -4      4  14.24885
## 111       -2      4  17.37340
## 132        8      4  32.99614
plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", ylab = "fitted values", ylim = c(10, 50))
lines(preddata$cohort90[preddata$sclass == 1], preddata$predscore[preddata$sclass == 1],lty = "solid")
lines(preddata$cohort90[preddata$sclass == 2], preddata$predscore[preddata$sclass == 2],lty = "dashed")
lines(preddata$cohort90[preddata$sclass == 3], preddata$predscore[preddata$sclass == 3],lty = "dotted")
lines(preddata$cohort90[preddata$sclass == 4], preddata$predscore[preddata$sclass == 4],lty = "dotdash")
legend("bottomright", c("sclass = 1", "sclass = 2", "sclass = 3", "sclass=4"), lty = c("solid","dashed","dotted", "dotdash"), inset = 0.05) 

The graph shows a reduction in social class achievement gaps over successive cohorts.

## Plot of regression lines (with sclass3 as reference)

preddata$predscore <- preddata$predscore - (24.891 + 1.378*preddata$cohort90) plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", ylab = "fitted values", ylim = c(-10, 15), xlim = c(-6, 10)) lines(preddata$cohort90[preddata$sclass == 1], preddata$predscore[preddata$sclass == 1], lty = "solid") lines(preddata$cohort90[preddata$sclass == 2], preddata$predscore[preddata$sclass == 2], lty = "dashed") lines(preddata$cohort90[preddata$sclass == 3], preddata$predscore[preddata$sclass == 3],lty = "dotted") lines(preddata$cohort90[preddata$sclass == 4], preddata$predscore[preddata\$sclass == 4],lty = "dotdash")
legend("bottomright", c("sclass = 1", "sclass = 2", "sclass = 3", "sclass=4"), lty = c("solid","dashed","dotted", "dotdash"), inset = 0.05) 

The horizontal line with a fitted value of 0 represents the reference category social class 3. This graph shows clearly the reduction in the social class achievement gaps over time.

## Hypothesis testing

We want to test the significance of the interaction model against the main effects model The null hypothesis for a significance test of the cohort-by-class interaction is that the coefficients of all interactions terms are equal to zero, i.e. H0 : $$\beta_5 = \beta_6 = \beta_7 = 0\ vs. H1 : \beta_5 = \beta_6 = \beta_7 = fitted\ values$$.

The standard way of testing whether a set of regression coefficients are equal to zero is to use a nested F-test.

The calculation of the F-statistic can be done with the anova command. The command tests the restricted model (main effects models) against the unrestricted model (interaction model).

fitrestricted <- lm(score ~ cohort90 + sclass1 + sclass2 + sclass4, data = mydata) 
anova(fitrestricted, fit) 
## Analysis of Variance Table
##
## Model 1: score ~ cohort90 + sclass1 + sclass2 + sclass4
## Model 2: score ~ cohort90 + sclass1 + sclass2 + sclass4 + cohort90Xsclass1 +
##     cohort90Xsclass2 + cohort90Xsclass4
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)
## 1  33983 7235256
## 2  33980 7205635  3     29620 46.561 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is tiny, so we can safely reject the null and conclude that there is evidence of an interaction effect between cohort and social class.