--- title: "Regression with More than One Explanatory Variable (Multiple Regression)" subtitle: Analysis template author: Sunny Avry date: "November 22th, 2020" output: html_document: toc: false theme: united --- This R notebook can be used as a template to compute regression with multiple explanatory variables. It is usable in Rstudio. The content is adapated from the mooc Multilevel modelling online course from the Centre for Mutlilevel Modelling at the University of Bristol (http://www.bristol.ac.uk/cmm/learning/online-course/). # Packages ```{r message=FALSE, warning=FALSE} library(xlsx) #read xlsx files ``` # Data importation ```{r} #Text file mydata <- read.table(file = "3.3.txt", sep = ",", header = TRUE) #put the file in the same directory as this notebook #Excel file #mydata <- read.xlsx("myfile.xlsx", sheetName = "mysheetname") ``` # Data exploration ```{r} dim(mydata) #number of rows and columns ``` ```{r} str(mydata) #basic description of the dataframe ``` ```{r} mydata[1:20, ] #first 20 rows of the data ``` # Multiple linear regression model $$y_i = \beta_0 + \beta_1 * x_{1i} + \beta_2 * x_{2i} + \beta_3 * x_{3i} + ... + \beta_p * x_{pi} + e_i$$ Having viewed the data we will now fit a multiple regression model that explained score (ratio variable : point score calculated from awards in Standard grades. Scores range from 0 to 75, with a higher score indicating a higher attainment) containing cohort (ordinal variable: calculated by subtracting 1990 from each value, ranging from -6 -corresponding to 1984- to 8 -1998-, with 1990 coded as zero), gender (categorical variable: girls 1; boys 0) and social class (categorical variable : managerial and professional 1; intermediate 2; working 3; unclassified 4). $$score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * female_{i} + \beta_3 * sclass1_{i} + \\ \beta_4 * sclass2_{i} + \beta_5 * sclass4_{i} + e_i$$ sclass3 is taken as a reference for the categorical variable social class (this necessitates the creation of dummy variables) See [Regression with a Single Categorical Explanatory Variable](https://seodle.info/toolbox/R_regression_with_a_single_categorical_explanatory_variable.html#categorical-variable-with-more-than-2-categories) ## Fit of the model ```{r} fit <- lm(score ~ cohort90 + female + sclass1 + sclass2 + sclass4, data = mydata) summary(fit) ``` The estimated intercept is 23.70 which is interpreted as the predicted attainment score for sclass3 (working class) when all explanatory variables are equal to zero, i.e. cohort90 = 0 (1990), female = 0 (boy), sclass1 = 0, sclass2 = 0, sclass4 = 0. It is basically the score for a boy in the 1990 cohort with a working class parent. $$ \hat{score}_i = 23.70 + 1.21 * cohort90 + 2.03 * female + \\ 13.18 * sclass1 + 6.98 * sclass2 - 4.07 * sclass4$$ ## Predictions ```{r} predscore <- predict(fit) #Predictions for every observation in the data ``` ## Residual (unexplained) variance ```{r} model <- summary(fit) model$sigma^2 ``` ### % Explained variance vs. Unexplained variance ```{r} model$r.squared 1 - model$r.squared ``` ## Hypothesis testing The test statistic is calculated by dividing the estimated slope by its standard error. For example, we can test the effect of gender in the model. All these data are reported in summary(fit). ```{r} #sclass3 vs. female model$coefficients[3,1] / model$coefficients[3,2] #significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance. Gender increases the score of an individual in the 1990 cohort with a working class parent. ``` # Assessing interaction effects (example 1) ```{r} mydata <- read.table(file = "3.4.txt", sep=",", header = TRUE) ``` ## Model without interaction effects Multiple regression model with cohort and gender effects $$score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * female_{i} + e_i$$ ### Fit of the model ```{r} fit <- lm(score ~ cohort90 + female, data = mydata) summary(fit) ``` $$\hat{score_i} = 29.72 + 1.32 * cohort90_{i} + 1.91 * female_{i}$$ For female = 0 (boys) $$\hat{score_i} = 29.72 + 1.32 * cohort90_{i} + 0 = 29.72 + 1.32 * cohort90_{i}$$ For female = 1 (girls) $$\hat{score_i} = 29.72 + 1.32 * cohort90_{i} + 1.91 * 1 = 29.72 + 1.91 + 1.32 * cohort90_{i} \\ = 31.63 + 1.32 * cohort90_{i} $$ Boys and girs have different intercepts but the same slope ### Plot of regression lines ```{r} predscore <- predict(fit) #Predictions for every observation in the data # matrix preddata with a row for each combination of the explanatory variables preddata <- cbind(mydata$cohort90, mydata$female, predscore) preddata <- unique(preddata) colnames(preddata) <- c("cohort90", "female", "predscore") preddata <- data.frame(preddata) preddata ``` ```{r} plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", ylab = "fitted values") lines(preddata$cohort90[preddata$female == 0], preddata$predscore[preddata$female == 0], lty = "solid") lines(preddata$cohort90[preddata$female == 1], preddata$predscore[preddata$female == 1], lty = "dashed") legend("bottomright", c("Boys", "Girl"), lty = c("solid", "dashed"), inset = 0.05) ``` ## Model with interaction effects ### Adding a column being the product of the two explanatory variables Multiple regression model with cohort, gender and interaction between cohort and gender effects $$score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * female_{i} + \beta_3 * cohort90Xfemale_{i} + e_i$$ ### Fit of the model ```{r} fit <- lm(score ~ cohort90 + female + cohort90:female, data = mydata) summary(fit) ``` $$\hat{score_i} = 29.72 + 1.26 * cohort90_{i} + 1.89 * female_{i} + 0.11 * cohort90Xfemale_i$$ For female = 1 (girls) $$\hat{score_i} = 29.72 + 1.26 * cohort90_{i} + 1.86 * 0 + 0.11 * cohort90_{i} * 0 \\ = 29.72 + 1.26 * cohort90_{i}$$ For female = 0 (boys) $$\hat{score_i} = 29.72 + 1.26 * cohort90_{i} + 1.86 * 1 + 0.11 * cohort90_{i} * 1 \\ = (29.72 + 1.86) + (1.26 + 0.11) * cohort90_{i} \\ = 31.58 + 1.37 * cohort90_{i}$$ Boys and girs have now different intercepts and different slopes ### Plot of regression lines ```{r} predscore <- predict(fit) preddata <- cbind(mydata$cohort90, mydata$female, predscore) preddata <- unique(preddata) colnames(preddata) <- c("cohort90", "female", "predscore") preddata <- data.frame(preddata) preddata ``` ```{r} plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", ylab = "fitted values") lines(preddata$cohort90[preddata$female == 0], preddata$predscore[preddata$female == 0], lty = "solid") lines(preddata$cohort90[preddata$female == 1], preddata$predscore[preddata$female == 1], lty = "dashed") legend("bottomright", c("Boys", "Girl"), lty = c("solid", "dashed"), inset = 0.05) ``` ```{r} model <- summary(fit) model$sigma^2 ``` ### % Explained variance vs. Unexplained variance ```{r} model$r.squared 1 - model$r.squared ``` ## Hypothesis testing The test statistic is calculated by dividing the estimated slope by its standard error. For example, we can test the effect of gender in the model. All these data are reported in summary(fit). ```{r} model$coefficients[4,1] / model$coefficients[4,2] #significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance. There is an interaction effect between gender and cohort so that the difference between girls and boys increases across cohorts in favor of girls ``` # Assessing interaction effects (example 2) We will now consider whether the effect of cohort on attainment depends on parental social class. ## Creation of dummy variables ```{r} mydata$cohort90Xsclass1 <- mydata$cohort90*mydata$sclass1 mydata$cohort90Xsclass2 <- mydata$cohort90*mydata$sclass2 mydata$cohort90Xsclass4 <- mydata$cohort90*mydata$sclass4 ``` Multiple regression model with cohort, social class and interaction between cohort and social class effects effects $$score_i = \beta_0 + \beta_1 * cohort90_{i} + \beta_2 * sclass1_{i} + \beta_3 * sclass2_{i} \\ \beta_4 * sclass4_{i} + \beta_5 * cohort90_{i}Xsclass1_{i} + \beta_6 * cohort90_{i}Xsclass2_{i} \\ + \beta_7 * cohort90_{i}Xsclass4_{i} + e_i$$ ## Fit of the model ```{r} fit <- lm(score ~ cohort90 + sclass1 + sclass2 + sclass4 + cohort90Xsclass1 + cohort90Xsclass2 + cohort90Xsclass4, data = mydata) summary(fit) ``` $$\hat{score_i} = 24.90 + 1.38 * cohort90_{i} + 13.22 * sclass1_{i} + 6.86 * sclass2_{i} \\ - 4.39 * sclass4_{i} - 0.33 * cohort90Xsclass1_{i} - 0.24 * cohort90Xsclass2_{i} \\ - 0.18 * cohort90Xsclass4_{i} $$ For sclass = 1 $$\hat{score_i} = 24.90 + 1.38 * cohort90_{i} + 13.22 - 0.33 * cohort90_{i} * 1 \\ = (24.90 + 13.22) + (1.38 - 0.33) * cohort90_{i} \\ = 38.12 + 1.05 * cohort90_{i}$$ For sclass = 2 $$\hat{score_i} = 31.756 + 1.133 * cohort90_i$$ For sclass = 3 $$\hat{score_i} = 24.891 + 1.378 * cohort90_i$$ For sclass = 4 $$\hat{score_i} = 20.498 + 1.562 * cohort90_i$$ ## Plot of regression lines ```{r} predscore <- predict(fit) preddata <- cbind(mydata$cohort90, mydata$sclass, predscore) preddata <- unique(preddata) colnames(preddata) <- c("cohort90", "sclass", "predscore") preddata <- data.frame(preddata) preddata ``` ```{r} plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", ylab = "fitted values", ylim = c(10, 50)) lines(preddata$cohort90[preddata$sclass == 1], preddata$predscore[preddata$sclass == 1],lty = "solid") lines(preddata$cohort90[preddata$sclass == 2], preddata$predscore[preddata$sclass == 2],lty = "dashed") lines(preddata$cohort90[preddata$sclass == 3], preddata$predscore[preddata$sclass == 3],lty = "dotted") lines(preddata$cohort90[preddata$sclass == 4], preddata$predscore[preddata$sclass == 4],lty = "dotdash") legend("bottomright", c("sclass = 1", "sclass = 2", "sclass = 3", "sclass=4"), lty = c("solid","dashed","dotted", "dotdash"), inset = 0.05) ``` The graph shows a reduction in social class achievement gaps over successive cohorts. ## Plot of regression lines (with sclass3 as reference) ```{r} preddata$predscore <- preddata$predscore - (24.891 + 1.378*preddata$cohort90) ``` ```{r} plot(preddata$cohort90, preddata$predscore, type = "n", xlab = "cohort90", ylab = "fitted values", ylim = c(-10, 15), xlim = c(-6, 10)) lines(preddata$cohort90[preddata$sclass == 1], preddata$predscore[preddata$sclass == 1], lty = "solid") lines(preddata$cohort90[preddata$sclass == 2], preddata$predscore[preddata$sclass == 2], lty = "dashed") lines(preddata$cohort90[preddata$sclass == 3], preddata$predscore[preddata$sclass == 3],lty = "dotted") lines(preddata$cohort90[preddata$sclass == 4], preddata$predscore[preddata$sclass == 4],lty = "dotdash") legend("bottomright", c("sclass = 1", "sclass = 2", "sclass = 3", "sclass=4"), lty = c("solid","dashed","dotted", "dotdash"), inset = 0.05) ``` The horizontal line with a fitted value of 0 represents the reference category social class 3. This graph shows clearly the reduction in the social class achievement gaps over time. ## Hypothesis testing We want to test the significance of the interaction model against the main effects model The null hypothesis for a significance test of the cohort-by-class interaction is that the coefficients of all interactions terms are equal to zero, i.e. H0 : $\beta_5 = \beta_6 = \beta_7 = 0\ vs. H1 : \beta_5 = \beta_6 = \beta_7 = fitted\ values$. The standard way of testing whether a set of regression coefficients are equal to zero is to use a nested F-test. The calculation of the F-statistic can be done with the anova command. The command tests the restricted model (main effects models) against the unrestricted model (interaction model). ```{r} fitrestricted <- lm(score ~ cohort90 + sclass1 + sclass2 + sclass4, data = mydata) ``` ```{r} anova(fitrestricted, fit) ``` The p-value is tiny, so we can safely reject the null and conclude that there is evidence of an interaction effect between cohort and social class.