---
title: "Regression with More than One Explanatory  Variable (Multiple Regression)"
subtitle: Analysis template
author: Sunny Avry
date: "November 22th, 2020"
output:
  html_document:
    toc: false
    theme: united
---

This R notebook can be used as a template to compute regression with multiple explanatory variables. It is usable in Rstudio. The content is adapated from the mooc Multilevel modelling online course from the Centre for Mutlilevel Modelling at the University of Bristol (http://www.bristol.ac.uk/cmm/learning/online-course/).

# Packages
```{r message=FALSE, warning=FALSE}
library(xlsx) #read xlsx files
```
# Data importation
```{r}
 #Text file
mydata <- read.table(file = "3.3.txt", sep = ",", header = TRUE) #put the file in the same directory as this notebook

#Excel file
#mydata <- read.xlsx("myfile.xlsx", sheetName = "mysheetname")
```
# Data exploration

```{r}
dim(mydata) #number of rows and columns
```
```{r}
 str(mydata) #basic description of the dataframe
```

```{r}
mydata[1:20, ] #first 20 rows of the data
```
# Multiple linear regression model

$$y_i = \beta_0 + \beta_1 * x_{1i} + \beta_2 * x_{2i} + \beta_3 * x_{3i} + ... + \beta_p * x_{pi} + e_i$$
Having viewed the data we will now fit a multiple regression model that explained score (ratio variable : point score calculated from awards in Standard grades. Scores range  from  0  to  75,  with  a  higher  score  indicating  a  higher attainment) containing cohort (ordinal variable: calculated by subtracting 1990 from each value, ranging from -6 -corresponding to 1984- to 8 -1998-, with 1990 coded as zero), gender (categorical variable: girls 1; boys 0) and social class (categorical variable :  managerial and professional 1; intermediate 2; working 3; unclassified 4).

$$score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * female_{i} + \beta_3 * sclass1_{i} + \\
\beta_4 * sclass2_{i} + \beta_5 * sclass4_{i} + e_i$$
sclass3 is taken as a reference for the categorical variable social class (this necessitates the creation of dummy variables) See [Regression with a Single Categorical Explanatory Variable](https://seodle.info/toolbox/R_regression_with_a_single_categorical_explanatory_variable.html#categorical-variable-with-more-than-2-categories)

## Fit of the model

```{r}
fit <- lm(score ~ cohort90 + female + sclass1 + sclass2 + sclass4, 
data = mydata) 
summary(fit)
```

The estimated intercept is 23.70 which is interpreted as the predicted attainment score for sclass3 (working class) when all explanatory variables are equal to zero, i.e. cohort90 = 0 (1990), female = 0 (boy), sclass1 = 0, sclass2 = 0, sclass4 = 0. It is basically the score for a boy in the 1990 cohort with a working class parent.

$$ \hat{score}_i = 23.70 + 1.21 * cohort90 + 2.03 * female + \\  
13.18 * sclass1 + 6.98 * sclass2 - 4.07 * sclass4$$ 


## Predictions

```{r}
predscore <- predict(fit) #Predictions for every observation in the data
```
## Residual (unexplained) variance

```{r}
model <- summary(fit)
model$sigma^2
```

### % Explained variance vs. Unexplained variance

```{r}
model$r.squared 
1 - model$r.squared
```

## Hypothesis testing

The test statistic is calculated by dividing the estimated slope by its standard error. For example, we can test the effect of gender in the model. All these data are reported in summary(fit).

```{r}

#sclass3 vs. female
model$coefficients[3,1] / model$coefficients[3,2]

#significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance. Gender increases the score of an individual in the 1990 cohort with a working class parent.
```

# Assessing interaction effects (example 1)

```{r}
mydata <- read.table(file = "3.4.txt", sep=",", header = TRUE)

```
## Model without interaction effects

Multiple regression model with cohort and gender effects 

$$score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * female_{i} + e_i$$
### Fit of the model

```{r}
fit <- lm(score ~ cohort90 + female, data = mydata) 
summary(fit)
```

$$\hat{score_i} = 29.72 + 1.32 * cohort90_{i} + 1.91 * female_{i}$$
For female = 0 (boys)

$$\hat{score_i} = 29.72 + 1.32 * cohort90_{i} + 0 = 29.72 + 1.32 * cohort90_{i}$$
For female = 1 (girls)

$$\hat{score_i} = 29.72 + 1.32 * cohort90_{i} + 1.91 * 1 = 29.72 + 1.91 + 1.32 * cohort90_{i} \\ 
= 31.63 + 1.32 * cohort90_{i}
$$
Boys and girs have different intercepts but the same slope

### Plot of regression lines

```{r}
predscore <- predict(fit) #Predictions for every observation in the data

# matrix preddata with a row for each combination of the explanatory variables
preddata <- cbind(mydata$cohort90, mydata$female, predscore) 
preddata <- unique(preddata) 
colnames(preddata) <- c("cohort90", "female", "predscore") 
preddata <- data.frame(preddata) 
preddata
```

```{r}
plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", 
ylab = "fitted values") 
lines(preddata$cohort90[preddata$female == 0], 
preddata$predscore[preddata$female == 0], lty = "solid") 
lines(preddata$cohort90[preddata$female == 1], 
preddata$predscore[preddata$female == 1], lty = "dashed") 
legend("bottomright", c("Boys", "Girl"), lty = c("solid", "dashed"), inset = 0.05) 
```

## Model with interaction effects

### Adding a column being the product of the two explanatory variables

Multiple regression model with cohort, gender and interaction between cohort and gender effects

$$score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * female_{i} + \beta_3 * cohort90Xfemale_{i} +  e_i$$

### Fit of the model

```{r}
fit <- lm(score ~ cohort90 + female + cohort90:female, data = mydata) 
summary(fit)
```

$$\hat{score_i} = 29.72 + 1.26 * cohort90_{i} + 1.89 * female_{i} + 0.11 * cohort90Xfemale_i$$

For female = 1 (girls)

$$\hat{score_i} = 29.72 + 1.26 * cohort90_{i} + 1.86 * 0 + 0.11 * cohort90_{i} * 0 \\
= 29.72 + 1.26 * cohort90_{i}$$

For female = 0 (boys)

$$\hat{score_i} = 29.72 + 1.26 * cohort90_{i} + 1.86 * 1 + 0.11 * cohort90_{i} * 1 \\
= (29.72 + 1.86) + (1.26 + 0.11) * cohort90_{i} \\ 
= 31.58 + 1.37 * cohort90_{i}$$

Boys and girs have now different intercepts and different slopes

### Plot of regression lines

```{r}
predscore <- predict(fit) 
preddata <- cbind(mydata$cohort90, mydata$female, predscore) 
preddata <- unique(preddata) 
colnames(preddata) <- c("cohort90", "female", "predscore") 
preddata <- data.frame(preddata) 
preddata
```

```{r}
plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", 
ylab = "fitted values") 
lines(preddata$cohort90[preddata$female == 0], 
preddata$predscore[preddata$female == 0], lty = "solid") 
lines(preddata$cohort90[preddata$female == 1], 
preddata$predscore[preddata$female == 1], lty = "dashed") 
legend("bottomright", c("Boys", "Girl"), lty = c("solid", "dashed"), inset = 0.05) 
```

```{r}
model <- summary(fit)
model$sigma^2
```

### % Explained variance vs. Unexplained variance

```{r}
model$r.squared 
1 - model$r.squared
```

## Hypothesis testing

The test statistic is calculated by dividing the estimated slope by its standard error. For example, we can test the effect of gender in the model. All these data are reported in summary(fit).

```{r}

model$coefficients[4,1] / model$coefficients[4,2]

#significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance. There is an interaction effect between gender and cohort so that the difference between girls and boys increases across cohorts in favor of girls
```

# Assessing interaction effects (example 2)

We  will  now  consider  whether  the  effect  of  cohort  on  attainment  depends  on 
parental  social  class.

## Creation of dummy variables
```{r}
mydata$cohort90Xsclass1 <- mydata$cohort90*mydata$sclass1 
mydata$cohort90Xsclass2 <- mydata$cohort90*mydata$sclass2 
mydata$cohort90Xsclass4 <- mydata$cohort90*mydata$sclass4 
```

Multiple regression model with cohort, social class and interaction between cohort and social class effects effects 

$$score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * sclass1_{i} + \beta_3 * sclass2_{i} \\
\beta_4 * sclass4_{i} + \beta_5 * cohort90_{i}Xsclass1_{i} + \beta_6 * cohort90_{i}Xsclass2_{i} \\
+ \beta_7 * cohort90_{i}Xsclass4_{i} + e_i$$

## Fit of the model

```{r}
fit <- lm(score ~ cohort90 + sclass1 + sclass2 + sclass4 + cohort90Xsclass1 + cohort90Xsclass2 + cohort90Xsclass4, data = mydata)
summary(fit)
```

$$\hat{score_i} = 24.90 + 1.38 * cohort90_{i} + 13.22 * sclass1_{i} + 6.86 * sclass2_{i} \\
- 4.39 * sclass4_{i} - 0.33 * cohort90Xsclass1_{i} - 0.24 * cohort90Xsclass2_{i} \\
- 0.18 * cohort90Xsclass4_{i} $$
 
 For sclass = 1
 
 $$\hat{score_i} = 24.90 + 1.38 * cohort90_{i} + 13.22 - 0.33 * cohort90_{i} * 1   \\
 = (24.90 + 13.22) + (1.38 - 0.33) * cohort90_{i} \\
 = 38.12 + 1.05 * cohort90_{i}$$
 
 For sclass = 2
 
 $$\hat{score_i} = 31.756 + 1.133 * cohort90_i$$
 
 
 For sclass = 3
 
 $$\hat{score_i} = 24.891 + 1.378 * cohort90_i$$
 
 For sclass = 4

$$\hat{score_i} = 20.498 + 1.562 * cohort90_i$$

## Plot of regression lines

```{r}
predscore <- predict(fit) 
preddata <- cbind(mydata$cohort90, mydata$sclass, predscore) 
preddata <- unique(preddata) 
colnames(preddata) <- c("cohort90", "sclass", "predscore") 
preddata <- data.frame(preddata) 
preddata 
```

```{r}
plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", ylab = "fitted values", ylim = c(10, 50)) 
lines(preddata$cohort90[preddata$sclass == 1], preddata$predscore[preddata$sclass == 1],lty = "solid") 
lines(preddata$cohort90[preddata$sclass == 2], preddata$predscore[preddata$sclass == 2],lty = "dashed") 
lines(preddata$cohort90[preddata$sclass == 3], preddata$predscore[preddata$sclass == 3],lty = "dotted") 
lines(preddata$cohort90[preddata$sclass == 4], preddata$predscore[preddata$sclass == 4],lty = "dotdash")
legend("bottomright", c("sclass = 1", "sclass = 2", "sclass = 3", "sclass=4"), lty = c("solid","dashed","dotted", "dotdash"), inset = 0.05) 

```

The graph shows a reduction in social class achievement gaps over successive cohorts. 

## Plot of regression lines (with sclass3 as reference)

```{r}
preddata$predscore <- preddata$predscore - (24.891 + 1.378*preddata$cohort90)
```
```{r}
plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", ylab = "fitted values", ylim = c(-10, 15), xlim = c(-6, 10)) 
lines(preddata$cohort90[preddata$sclass == 1], preddata$predscore[preddata$sclass == 1], lty = "solid") 
lines(preddata$cohort90[preddata$sclass == 2], preddata$predscore[preddata$sclass == 2], lty = "dashed") 
lines(preddata$cohort90[preddata$sclass == 3], preddata$predscore[preddata$sclass == 3],lty = "dotted") 
lines(preddata$cohort90[preddata$sclass == 4], preddata$predscore[preddata$sclass == 4],lty = "dotdash")
legend("bottomright", c("sclass = 1", "sclass = 2", "sclass = 3", "sclass=4"), lty = c("solid","dashed","dotted", "dotdash"), inset = 0.05) 
```

The horizontal line with a fitted value of 0 represents the reference category social class 3. This graph shows clearly the reduction in the social class achievement gaps over time.  

## Hypothesis testing

We want to test the significance of the interaction model against the main effects model
The null hypothesis for a significance test of the cohort-by-class interaction is that 
the coefficients of all interactions terms are equal to zero, i.e. H0 : $\beta_5 = \beta_6 = \beta_7 = 0\ vs. H1 : \beta_5 = \beta_6 = \beta_7 = fitted\ values$.  

The standard way of testing whether a set of regression coefficients are equal to 
zero is to use a nested F-test.

The  calculation  of  the  F-statistic  can  be  done  with  the  anova  command.  The 
command  tests  the  restricted  model (main effects models)  against  the  unrestricted  model (interaction model).

```{r}
fitrestricted <- lm(score ~ cohort90 + sclass1 + sclass2 + sclass4, data = mydata) 
```
```{r}
anova(fitrestricted, fit) 
```

The p-value is tiny, so we can safely reject the null and conclude that there is evidence 
of an interaction effect between cohort and social class.