This R notebook can be used as a template to compute regression with multiple explanatory variables. It is usable in Rstudio. The content is adapated from the mooc Multilevel modelling online course from the Centre for Mutlilevel Modelling at the University of Bristol (http://www.bristol.ac.uk/cmm/learning/online-course/).
library(xlsx) #read xlsx files
#Text file
mydata <- read.table(file = "3.3.txt", sep = ",", header = TRUE) #put the file in the same directory as this notebook
#Excel file
#mydata <- read.xlsx("myfile.xlsx", sheetName = "mysheetname")
dim(mydata) #number of rows and columns
## [1] 33988 9
str(mydata) #basic description of the dataframe
## 'data.frame': 33988 obs. of 9 variables:
## $ caseid : int 339 340 345 346 352 353 354 361 362 363 ...
## $ score : int 49 18 46 43 17 29 15 19 45 12 ...
## $ cohort90: int -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 ...
## $ female : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sclass : int 2 3 4 3 3 2 3 2 3 1 ...
## $ sclass1 : int 0 0 0 0 0 0 0 0 0 1 ...
## $ sclass2 : int 1 0 0 0 0 1 0 1 0 0 ...
## $ sclass3 : int 0 1 0 1 1 0 1 0 1 0 ...
## $ sclass4 : int 0 0 1 0 0 0 0 0 0 0 ...
mydata[1:20, ] #first 20 rows of the data
## caseid score cohort90 female sclass sclass1 sclass2 sclass3 sclass4
## 1 339 49 -6 0 2 0 1 0 0
## 2 340 18 -6 0 3 0 0 1 0
## 3 345 46 -6 0 4 0 0 0 1
## 4 346 43 -6 0 3 0 0 1 0
## 5 352 17 -6 0 3 0 0 1 0
## 6 353 29 -6 0 2 0 1 0 0
## 7 354 15 -6 0 3 0 0 1 0
## 8 361 19 -6 0 2 0 1 0 0
## 9 362 45 -6 0 3 0 0 1 0
## 10 363 12 -6 0 1 1 0 0 0
## 11 6824 0 -4 0 1 1 0 0 0
## 12 6826 0 -4 0 3 0 0 1 0
## 13 6827 20 -4 0 2 0 1 0 0
## 14 6828 32 -4 0 1 1 0 0 0
## 15 6829 0 -4 0 2 0 1 0 0
## 16 6834 24 -4 0 3 0 0 1 0
## 17 6836 23 -4 0 2 0 1 0 0
## 18 13206 7 -2 0 3 0 0 1 0
## 19 13209 38 -2 0 3 0 0 1 0
## 20 13215 46 -2 0 1 1 0 0 0
\[y_i = \beta_0 + \beta_1 * x_{1i} + \beta_2 * x_{2i} + \beta_3 * x_{3i} + ... + \beta_p * x_{pi} + e_i\] Having viewed the data we will now fit a multiple regression model that explained score (ratio variable : point score calculated from awards in Standard grades. Scores range from 0 to 75, with a higher score indicating a higher attainment) containing cohort (ordinal variable: calculated by subtracting 1990 from each value, ranging from -6 -corresponding to 1984- to 8 -1998-, with 1990 coded as zero), gender (categorical variable: girls 1; boys 0) and social class (categorical variable : managerial and professional 1; intermediate 2; working 3; unclassified 4).
\[score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * female_{i} + \beta_3 * sclass1_{i} + \\ \beta_4 * sclass2_{i} + \beta_5 * sclass4_{i} + e_i\] sclass3 is taken as a reference for the categorical variable social class (this necessitates the creation of dummy variables) See Regression with a Single Categorical Explanatory Variable
fit <- lm(score ~ cohort90 + female + sclass1 + sclass2 + sclass4,
data = mydata)
summary(fit)
##
## Call:
## lm(formula = score ~ cohort90 + female + sclass1 + sclass2 +
## sclass4, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.604 -10.645 0.535 10.576 53.209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.70118 0.17240 137.47 <2e-16 ***
## cohort90 1.21136 0.01487 81.48 <2e-16 ***
## female 2.03169 0.15828 12.84 <2e-16 ***
## sclass1 13.18046 0.20472 64.38 <2e-16 ***
## sclass2 6.97933 0.20891 33.41 <2e-16 ***
## sclass4 -4.06504 0.29353 -13.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.56 on 33982 degrees of freedom
## Multiple R-squared: 0.2933, Adjusted R-squared: 0.2932
## F-statistic: 2821 on 5 and 33982 DF, p-value: < 2.2e-16
The estimated intercept is 23.70 which is interpreted as the predicted attainment score for sclass3 (working class) when all explanatory variables are equal to zero, i.e. cohort90 = 0 (1990), female = 0 (boy), sclass1 = 0, sclass2 = 0, sclass4 = 0. It is basically the score for a boy in the 1990 cohort with a working class parent.
\[ \hat{score}_i = 23.70 + 1.21 * cohort90 + 2.03 * female + \\ 13.18 * sclass1 + 6.98 * sclass2 - 4.07 * sclass4\]
predscore <- predict(fit) #Predictions for every observation in the data
model <- summary(fit)
model$sigma^2
## [1] 211.887
model$r.squared
## [1] 0.2933134
1 - model$r.squared
## [1] 0.7066866
The test statistic is calculated by dividing the estimated slope by its standard error. For example, we can test the effect of gender in the model. All these data are reported in summary(fit).
#sclass3 vs. female
model$coefficients[3,1] / model$coefficients[3,2]
## [1] 12.83632
#significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance. Gender increases the score of an individual in the 1990 cohort with a working class parent.
mydata <- read.table(file = "3.4.txt", sep=",", header = TRUE)
Multiple regression model with cohort and gender effects
\[score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * female_{i} + e_i\] ### Fit of the model
fit <- lm(score ~ cohort90 + female, data = mydata)
summary(fit)
##
## Call:
## lm(formula = score ~ cohort90 + female, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.162 -11.739 0.629 12.175 47.175
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.72097 0.12449 238.75 <2e-16 ***
## cohort90 1.31592 0.01599 82.30 <2e-16 ***
## female 1.91324 0.17147 11.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.77 on 33985 degrees of freedom
## Multiple R-squared: 0.1702, Adjusted R-squared: 0.1702
## F-statistic: 3486 on 2 and 33985 DF, p-value: < 2.2e-16
\[\hat{score_i} = 29.72 + 1.32 * cohort90_{i} + 1.91 * female_{i}\] For female = 0 (boys)
\[\hat{score_i} = 29.72 + 1.32 * cohort90_{i} + 0 = 29.72 + 1.32 * cohort90_{i}\] For female = 1 (girls)
\[\hat{score_i} = 29.72 + 1.32 * cohort90_{i} + 1.91 * 1 = 29.72 + 1.91 + 1.32 * cohort90_{i} \\ = 31.63 + 1.32 * cohort90_{i} \] Boys and girs have different intercepts but the same slope
predscore <- predict(fit) #Predictions for every observation in the data
# matrix preddata with a row for each combination of the explanatory variables
preddata <- cbind(mydata$cohort90, mydata$female, predscore)
preddata <- unique(preddata)
colnames(preddata) <- c("cohort90", "female", "predscore")
preddata <- data.frame(preddata)
preddata
## cohort90 female predscore
## 1 -6 0 21.82547
## 11 -4 0 24.45730
## 18 -2 0 27.08913
## 23 0 0 29.72097
## 27 6 0 37.61647
## 31 8 0 40.24831
## 16056 -6 1 23.73871
## 16072 -4 1 26.37054
## 16079 -2 1 29.00238
## 16087 0 1 31.63421
## 16092 6 1 39.52972
## 16099 8 1 42.16155
plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90",
ylab = "fitted values")
lines(preddata$cohort90[preddata$female == 0],
preddata$predscore[preddata$female == 0], lty = "solid")
lines(preddata$cohort90[preddata$female == 1],
preddata$predscore[preddata$female == 1], lty = "dashed")
legend("bottomright", c("Boys", "Girl"), lty = c("solid", "dashed"), inset = 0.05)
Multiple regression model with cohort, gender and interaction between cohort and gender effects
\[score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * female_{i} + \beta_3 * cohort90Xfemale_{i} + e_i\]
fit <- lm(score ~ cohort90 + female + cohort90:female, data = mydata)
summary(fit)
##
## Call:
## lm(formula = score ~ cohort90 + female + cohort90:female, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.539 -11.725 0.584 12.193 46.830
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.72546 0.12448 238.806 < 2e-16 ***
## cohort90 1.25927 0.02335 53.941 < 2e-16 ***
## female 1.88607 0.17164 10.988 < 2e-16 ***
## cohort90:female 0.10665 0.03203 3.329 0.000871 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.77 on 33984 degrees of freedom
## Multiple R-squared: 0.1705, Adjusted R-squared: 0.1704
## F-statistic: 2328 on 3 and 33984 DF, p-value: < 2.2e-16
\[\hat{score_i} = 29.72 + 1.26 * cohort90_{i} + 1.89 * female_{i} + 0.11 * cohort90Xfemale_i\]
For female = 1 (girls)
\[\hat{score_i} = 29.72 + 1.26 * cohort90_{i} + 1.86 * 0 + 0.11 * cohort90_{i} * 0 \\ = 29.72 + 1.26 * cohort90_{i}\]
For female = 0 (boys)
\[\hat{score_i} = 29.72 + 1.26 * cohort90_{i} + 1.86 * 1 + 0.11 * cohort90_{i} * 1 \\ = (29.72 + 1.86) + (1.26 + 0.11) * cohort90_{i} \\ = 31.58 + 1.37 * cohort90_{i}\]
Boys and girs have now different intercepts and different slopes
predscore <- predict(fit)
preddata <- cbind(mydata$cohort90, mydata$female, predscore)
preddata <- unique(preddata)
colnames(preddata) <- c("cohort90", "female", "predscore")
preddata <- data.frame(preddata)
preddata
## cohort90 female predscore
## 1 -6 0 22.16981
## 11 -4 0 24.68836
## 18 -2 0 27.20691
## 23 0 0 29.72546
## 27 6 0 37.28110
## 31 8 0 39.79965
## 16056 -6 1 23.41596
## 16072 -4 1 26.14782
## 16079 -2 1 28.87967
## 16087 0 1 31.61153
## 16092 6 1 39.80709
## 16099 8 1 42.53895
plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90",
ylab = "fitted values")
lines(preddata$cohort90[preddata$female == 0],
preddata$predscore[preddata$female == 0], lty = "solid")
lines(preddata$cohort90[preddata$female == 1],
preddata$predscore[preddata$female == 1], lty = "dashed")
legend("bottomright", c("Boys", "Girl"), lty = c("solid", "dashed"), inset = 0.05)
model <- summary(fit)
model$sigma^2
## [1] 248.7021
model$r.squared
## [1] 0.1704787
1 - model$r.squared
## [1] 0.8295213
The test statistic is calculated by dividing the estimated slope by its standard error. For example, we can test the effect of gender in the model. All these data are reported in summary(fit).
model$coefficients[4,1] / model$coefficients[4,2]
## [1] 3.329339
#significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance. There is an interaction effect between gender and cohort so that the difference between girls and boys increases across cohorts in favor of girls
We will now consider whether the effect of cohort on attainment depends on parental social class.
mydata$cohort90Xsclass1 <- mydata$cohort90*mydata$sclass1
mydata$cohort90Xsclass2 <- mydata$cohort90*mydata$sclass2
mydata$cohort90Xsclass4 <- mydata$cohort90*mydata$sclass4
Multiple regression model with cohort, social class and interaction between cohort and social class effects effects
\[score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * sclass1_{i} + \beta_3 * sclass2_{i} \\ \beta_4 * sclass4_{i} + \beta_5 * cohort90_{i}Xsclass1_{i} + \beta_6 * cohort90_{i}Xsclass2_{i} \\ + \beta_7 * cohort90_{i}Xsclass4_{i} + e_i\]
fit <- lm(score ~ cohort90 + sclass1 + sclass2 + sclass4 + cohort90Xsclass1 + cohort90Xsclass2 + cohort90Xsclass4, data = mydata)
summary(fit)
##
## Call:
## lm(formula = score ~ cohort90 + sclass1 + sclass2 + sclass4 +
## cohort90Xsclass1 + cohort90Xsclass2 + cohort90Xsclass4, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.54 -10.62 0.46 10.51 53.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.89051 0.15042 165.471 < 2e-16 ***
## cohort90 1.37829 0.02883 47.803 < 2e-16 ***
## sclass1 13.22412 0.20583 64.246 < 2e-16 ***
## sclass2 6.86521 0.20941 32.783 < 2e-16 ***
## sclass4 -4.39256 0.29465 -14.908 < 2e-16 ***
## cohort90Xsclass1 -0.32508 0.03863 -8.415 < 2e-16 ***
## cohort90Xsclass2 -0.24468 0.03987 -6.137 8.49e-10 ***
## cohort90Xsclass4 0.18398 0.05352 3.438 0.000587 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.56 on 33980 degrees of freedom
## Multiple R-squared: 0.2928, Adjusted R-squared: 0.2926
## F-statistic: 2010 on 7 and 33980 DF, p-value: < 2.2e-16
\[\hat{score_i} = 24.90 + 1.38 * cohort90_{i} + 13.22 * sclass1_{i} + 6.86 * sclass2_{i} \\ - 4.39 * sclass4_{i} - 0.33 * cohort90Xsclass1_{i} - 0.24 * cohort90Xsclass2_{i} \\ - 0.18 * cohort90Xsclass4_{i} \]
For sclass = 1
\[\hat{score_i} = 24.90 + 1.38 * cohort90_{i} + 13.22 - 0.33 * cohort90_{i} * 1 \\ = (24.90 + 13.22) + (1.38 - 0.33) * cohort90_{i} \\ = 38.12 + 1.05 * cohort90_{i}\]
For sclass = 2
\[\hat{score_i} = 31.756 + 1.133 * cohort90_i\]
For sclass = 3
\[\hat{score_i} = 24.891 + 1.378 * cohort90_i\]
For sclass = 4
\[\hat{score_i} = 20.498 + 1.562 * cohort90_i\]
predscore <- predict(fit)
preddata <- cbind(mydata$cohort90, mydata$sclass, predscore)
preddata <- unique(preddata)
colnames(preddata) <- c("cohort90", "sclass", "predscore")
preddata <- data.frame(preddata)
preddata
## cohort90 sclass predscore
## 1 -6 2 24.95404
## 2 -6 3 16.62074
## 3 -6 4 11.12430
## 10 -6 1 31.79533
## 11 -4 1 33.90177
## 12 -4 3 19.37733
## 13 -4 2 27.22127
## 18 -2 3 22.13392
## 20 -2 1 36.00820
## 23 0 1 38.11463
## 24 0 3 24.89051
## 27 6 4 29.87160
## 28 6 1 44.43392
## 29 6 3 33.16028
## 30 6 2 38.55741
## 31 8 1 46.54035
## 32 8 3 35.91687
## 33 8 2 40.82464
## 52 -2 2 29.48850
## 59 0 2 31.75572
## 64 0 4 20.49795
## 96 -4 4 14.24885
## 111 -2 4 17.37340
## 132 8 4 32.99614
plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", ylab = "fitted values", ylim = c(10, 50))
lines(preddata$cohort90[preddata$sclass == 1], preddata$predscore[preddata$sclass == 1],lty = "solid")
lines(preddata$cohort90[preddata$sclass == 2], preddata$predscore[preddata$sclass == 2],lty = "dashed")
lines(preddata$cohort90[preddata$sclass == 3], preddata$predscore[preddata$sclass == 3],lty = "dotted")
lines(preddata$cohort90[preddata$sclass == 4], preddata$predscore[preddata$sclass == 4],lty = "dotdash")
legend("bottomright", c("sclass = 1", "sclass = 2", "sclass = 3", "sclass=4"), lty = c("solid","dashed","dotted", "dotdash"), inset = 0.05)
The graph shows a reduction in social class achievement gaps over successive cohorts.
preddata$predscore <- preddata$predscore - (24.891 + 1.378*preddata$cohort90)
plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", ylab = "fitted values", ylim = c(-10, 15), xlim = c(-6, 10))
lines(preddata$cohort90[preddata$sclass == 1], preddata$predscore[preddata$sclass == 1], lty = "solid")
lines(preddata$cohort90[preddata$sclass == 2], preddata$predscore[preddata$sclass == 2], lty = "dashed")
lines(preddata$cohort90[preddata$sclass == 3], preddata$predscore[preddata$sclass == 3],lty = "dotted")
lines(preddata$cohort90[preddata$sclass == 4], preddata$predscore[preddata$sclass == 4],lty = "dotdash")
legend("bottomright", c("sclass = 1", "sclass = 2", "sclass = 3", "sclass=4"), lty = c("solid","dashed","dotted", "dotdash"), inset = 0.05)
The horizontal line with a fitted value of 0 represents the reference category social class 3. This graph shows clearly the reduction in the social class achievement gaps over time.
We want to test the significance of the interaction model against the main effects model The null hypothesis for a significance test of the cohort-by-class interaction is that the coefficients of all interactions terms are equal to zero, i.e. H0 : \(\beta_5 = \beta_6 = \beta_7 = 0\ vs. H1 : \beta_5 = \beta_6 = \beta_7 = fitted\ values\).
The standard way of testing whether a set of regression coefficients are equal to zero is to use a nested F-test.
The calculation of the F-statistic can be done with the anova command. The command tests the restricted model (main effects models) against the unrestricted model (interaction model).
fitrestricted <- lm(score ~ cohort90 + sclass1 + sclass2 + sclass4, data = mydata)
anova(fitrestricted, fit)
## Analysis of Variance Table
##
## Model 1: score ~ cohort90 + sclass1 + sclass2 + sclass4
## Model 2: score ~ cohort90 + sclass1 + sclass2 + sclass4 + cohort90Xsclass1 +
## cohort90Xsclass2 + cohort90Xsclass4
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 33983 7235256
## 2 33980 7205635 3 29620 46.561 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is tiny, so we can safely reject the null and conclude that there is evidence of an interaction effect between cohort and social class.