--- title: "Regression with a Single Categorical Explanatory Variable" subtitle: Analysis template author: Sunny Avry date: "January 27th, 2020" output: html_document: toc: false theme: united --- This R notebook can be used as a template to compute regression with a single categorical explanatory variable. It is usable in Rstudio. The content is adapated from the mooc Multilevel modelling online course from the Centre for Mutlilevel Modelling at the University of Bristol (http://www.bristol.ac.uk/cmm/learning/online-course/). # Packages ```{r message=FALSE, warning=FALSE} library(xlsx) #read xlsx files ``` # Data importation ```{r} #Text file mydata <- read.table(file = "3.2.txt", sep = ",", header = TRUE) #put the file in the same directory as this notebook #Excel file #mydata <- read.xlsx("myfile.xlsx", sheetName = "mysheetname") ``` # Data exploration ```{r} dim(mydata) #number of rows and columns ``` ```{r} str(mydata) #basic description of the dataframe ``` ```{r} mydata[1:20, ] #first 20 rows of the data ``` # Dummy categorical variable Having viewed the data we will now examine score (ratio variable : point score calculated from awards in Standard grades. Scores range from 0 to 75, with a higher score indicating a higher attainment ) and female (categorical variable : girls 1; boys 0 ## Dependent variable (score) ### Histogram of the data ```{r} #continuous variable hist(mydata$score, xlim = c(0,80)) ``` ### Summary of the data ```{r} summary(mydata$score) ``` ### Standard deviation of the data ```{r} sd(mydata$score) ``` ## Independent (explanatory) variable (female) ### Frequencies, percentages and cumulated percentages ```{r} mytable <- table(mydata$female) mytablecomb <- cbind(mytable, prop.table(mytable), cumsum(prop.table(mytable))) colnames(mytablecomb) <- c("Freq", "Perc", "Cum") mytablecomb ``` ## Relationship between explanatory and dependant variables ### Frequence, mean and sd of the dependent variable for each value of the explanatory variable ```{r} l <- tapply(mydata$score, factor(mydata$female), length) m <- tapply(mydata$score, factor(mydata$female), mean) s <- tapply(mydata$score, factor(mydata$female), sd) tableScore <- cbind("Freq" = l, "mean(score)" = m, "sd(score)" = s) tableScore <- rbind(tableScore, c(sum(tableScore[, 1]), mean(mydata$score), sd(mydata$score))) rownames(tableScore)[dim(tableScore)[1]] <- "Total" tableScore ``` ## Linear regression $$dependent_i = \beta_0 + \beta_1 * explanatory_i + e_i$$ ```{r} fit <- lm(score ~ female, data = mydata) summary(fit) ``` $$ \hat{score}_i = 29.83 + 2.41 * female_i$$ ### Residual (unexplained) variance ```{r} model <- summary(fit) model$sigma^2 ``` ### % Explained variance ```{r} model$r.squared ``` ## Hypothesis testing The test statistic is calculated by dividing the estimated slope by its standard error ```{r} model$coefficients[2,1] / model$coefficients[2,2] #significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance ``` # Categorical variable with more than 2 categories We will now examine score (ratio variable : point score calculated from awards in Standard grades. Scores range from 0 to 75, with a higher score indicating a higher attainment) and parental social class (categorical variable : managerial and professional 1; intermediate 2; working 3; unclassified 4) ## Dependent variable (score) ### Histogram of the data ```{r} #continuous variable hist(mydata$score, xlim = c(0,80)) ``` ### Summary of the data ```{r} summary(mydata$score) ``` ### Standard deviation of the data ```{r} sd(mydata$score) ``` ## Independent (explanatory) variable (parental social class) ### Frequencies, percentages and cumulated percentages ```{r} mytable <- table(mydata$sclass) mytablecomb <- cbind(mytable, prop.table(mytable), cumsum(prop.table(mytable))) colnames(mytablecomb) <- c("Freq", "Perc", "Cum") mytablecomb ``` ## Relationship between explanatory and dependant variables ### Frequence, mean and sd of the dependent variable for each value of the explanatory variable ```{r} l <- tapply(mydata$score, factor(mydata$sclass), length) m <- tapply(mydata$score, factor(mydata$sclass), mean) s <- tapply(mydata$score, factor(mydata$sclass), sd) tableScore <- cbind("Freq" = l, "mean(score)" = m, "sd(score)" = s) tableScore <- rbind(tableScore, c(sum(tableScore[, 1]), mean(mydata$score), sd(mydata$score))) rownames(tableScore)[dim(tableScore)[1]] <- "Total" tableScore ``` ### Creation of dummy variables ```{r} mydata$sclass1 <- mydata$sclass == 1 mydata$sclass2 <- mydata$sclass == 2 mydata$sclass3 <- mydata$sclass == 3 mydata$sclass4 <- mydata$sclass == 4 mydata[1:20, ] ``` ## Linear regression ### A first model with sclass3 as reference $$score_i = \beta_0 + \beta_1 * sclass1_i + \beta_2 * sclass2_i + \beta_3 * sclass4_i + e_i$$ ```{r} fit <- lm(score ~ sclass1 + sclass2 + sclass4, data = mydata) summary(fit) ``` $$ \hat{score}_i = 24.1 + 15.14 * sclass1_i + 7.77 * sclass2_i - 2.75 * sclass4_i$$ ### Residual (unexplained) variance ```{r} model <- summary(fit) model$sigma^2 ``` ### % Explained variance ```{r} model$r.squared ``` ## Hypothesis testing The test statistic is calculated by dividing the estimated slope by its standard error ```{r} #sclass3 vs.sclass1 model$coefficients[2,1] / model$coefficients[2,2] #sclass3 vs.sclass2 model$coefficients[3,1] / model$coefficients[3,2] #sclass3 vs.sclass4 model$coefficients[4,1] / model$coefficients[4,2] #significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance ``` ### A second model with sclass2 as reference $$score_i = \beta_0 + \beta_1 * sclass1_i + \beta_2 * sclass3_i + \beta_3 * sclass4_i + e_i$$ ```{r} fit <- lm(score ~ sclass1 + sclass3 + sclass4, data = mydata) summary(fit) ``` $$ \hat{score}_i = 31.87 + 7.37 * sclass1_i - 7.77 * sclass3_i - 10.52 * sclass4_i$$ ### Residual (unexplained) variance ```{r} model <- summary(fit) model$sigma^2 ``` ### % Explained variance ```{r} model$r.squared ``` ## Hypothesis testing The test statistic is calculated by dividing the estimated slope by its standard error ```{r} #sclass2 vs.sclass1 model$coefficients[2,1] / model$coefficients[2,2] #sclass2 vs.sclass3 model$coefficients[3,1] / model$coefficients[3,2] #sclass2 vs.sclass4 model$coefficients[4,1] / model$coefficients[4,2] #significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance ``` ## Non-linear regression We will now examine score (ratio variable : point score calculated from awards in Standard grades. Scores range from 0 to 75, with a higher score indicating a higher attainment ) and cohort90 taken as a categorical variable (cohorts: 1984, 1986, 1988, 1990, 1996 and 1998). The cohort90 variable is calculated by subtracting 1990 from each value. Thus values range from -6 (corresponding to 1984) to 8 (1998), with 1990 coded as zero. ### Frequencies ```{r} table(mydata$cohort90) ``` ### Creation of dummy variables ```{r} mydata$cohort90yr84 <- mydata$cohort90 == -6 mydata$cohort90yr86 <- mydata$cohort90 == -4 mydata$cohort90yr88 <- mydata$cohort90 == -2 mydata$cohort90yr90 <- mydata$cohort90 == 0 mydata$cohort90yr96 <- mydata$cohort90 == 6 mydata$cohort90yr98 <- mydata$cohort90 == 8 mydata[1:20, ] ``` ### A model with cohort90yr88 as a reference $$score_i = \beta_0 + \beta_1 * cohort90yr84_i + \beta_2 * cohort90yr86_i + \\ \beta_3 * cohort90yr90_i + \beta_4 * cohort90yr96_i + \beta_5 * cohort90yr98_i + e_i$$ ```{r} fit <- lm(score ~ cohort90yr84 + cohort90yr86 + cohort90yr90 + cohort90yr96 + cohort90yr98, data = mydata) summary(fit) ``` $$ \hat{score}_i = 28.52 + -4.87 * cohort90yr84_i - 3.75 * cohort90yr86_i + \\ 0.58 * cohort90yr86_i + 10.91 * cohort90yr96_i + 12.81 * cohort90yr98_i$$ The coefficients increase with year, as assumed in a linear model. We can treat cohort90 as continuous in all subsequent analyses. # Sources - Multilevel modelling online course from the Centre for Mutlilevel Modelling of the University of Bristol (http://www.bristol.ac.uk/cmm/learning/online-course/).