---
title: "Regression with a Single Categorical Explanatory Variable"
subtitle: Analysis template
author: Sunny Avry
date: "January 27th, 2020"
output:
  html_document:
    toc: false
    theme: united
---

This R notebook can be used as a template to compute regression with a single categorical explanatory variable. It is usable in Rstudio. The content is adapated from the mooc Multilevel modelling online course from the Centre for Mutlilevel Modelling at the University of Bristol (http://www.bristol.ac.uk/cmm/learning/online-course/). 

# Packages
```{r message=FALSE, warning=FALSE}
library(xlsx) #read xlsx files
```
# Data importation
```{r}
 #Text file
mydata <- read.table(file = "3.2.txt", sep = ",", header = TRUE) #put the file in the same directory as this notebook

#Excel file
#mydata <- read.xlsx("myfile.xlsx", sheetName = "mysheetname")
```
# Data exploration

```{r}
dim(mydata) #number of rows and columns
```
```{r}
 str(mydata) #basic description of the dataframe
```

```{r}
mydata[1:20, ] #first 20 rows of the data
```

# Dummy categorical variable

Having viewed the data we will now examine score (ratio variable : point score calculated from awards in Standard grades. Scores range  from  0  to  75,  with  a  higher  score  indicating  a  higher attainment ) and female (categorical variable : girls 1; boys 0 

## Dependent variable (score)
### Histogram of the data
```{r}
#continuous variable
 hist(mydata$score, xlim = c(0,80)) 
```
### Summary of the data
```{r}
summary(mydata$score)
```

### Standard deviation of the data
```{r}
sd(mydata$score)
```

## Independent (explanatory) variable (female)

### Frequencies, percentages and cumulated percentages
```{r}
mytable <- table(mydata$female)
mytablecomb <- cbind(mytable, prop.table(mytable), 
cumsum(prop.table(mytable))) 
colnames(mytablecomb) <- c("Freq", "Perc", "Cum") 
mytablecomb
```

## Relationship between explanatory and dependant variables 

### Frequence, mean and sd of the dependent variable for each value of the explanatory variable
```{r}
l <- tapply(mydata$score, factor(mydata$female), length) 
m <- tapply(mydata$score, factor(mydata$female), mean) 
s <- tapply(mydata$score, factor(mydata$female), sd) 
tableScore <- cbind("Freq"  = l, "mean(score)" = m, "sd(score)" = s) 

tableScore <- rbind(tableScore, c(sum(tableScore[, 1]), mean(mydata$score), 
sd(mydata$score))) 

rownames(tableScore)[dim(tableScore)[1]] <- "Total" 

tableScore
```

## Linear regression

$$dependent_i = \beta_0 + \beta_1 * explanatory_i + e_i$$   

```{r}
fit <- lm(score ~ female, data = mydata) 
summary(fit) 
```

$$ \hat{score}_i = 29.83 + 2.41 * female_i$$

### Residual (unexplained) variance

```{r}
model <- summary(fit)
model$sigma^2
```

### % Explained variance

```{r}
model$r.squared
```

## Hypothesis testing

The test statistic is calculated by dividing the estimated slope by its standard error

```{r}
model$coefficients[2,1] / model$coefficients[2,2]

#significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance
```

# Categorical variable with more than 2 categories

We will now examine score (ratio variable : point score calculated from awards in Standard grades. Scores range from  0  to  75,  with  a  higher  score  indicating  a  higher attainment) and parental social class (categorical variable :  managerial and professional 1; intermediate 2; working 3; unclassified 4) 

## Dependent variable (score)
### Histogram of the data
```{r}
#continuous variable
 hist(mydata$score, xlim = c(0,80)) 
```
### Summary of the data
```{r}
summary(mydata$score)
```

### Standard deviation of the data
```{r}
sd(mydata$score)
```
## Independent (explanatory) variable (parental social class)

### Frequencies, percentages and cumulated percentages
```{r}
mytable <- table(mydata$sclass)
mytablecomb <- cbind(mytable, prop.table(mytable), 
cumsum(prop.table(mytable))) 
colnames(mytablecomb) <- c("Freq", "Perc", "Cum") 
mytablecomb
```

## Relationship between explanatory and dependant variables 

### Frequence, mean and sd of the dependent variable for each value of the explanatory variable
```{r}
l <- tapply(mydata$score, factor(mydata$sclass), length) 
m <- tapply(mydata$score, factor(mydata$sclass), mean) 
s <- tapply(mydata$score, factor(mydata$sclass), sd) 
tableScore <- cbind("Freq"  = l, "mean(score)" = m, "sd(score)" = s) 

tableScore <- rbind(tableScore, c(sum(tableScore[, 1]), mean(mydata$score), 
sd(mydata$score))) 

rownames(tableScore)[dim(tableScore)[1]] <- "Total" 

tableScore
```

### Creation of dummy variables

```{r}
mydata$sclass1 <- mydata$sclass == 1 
  
mydata$sclass2 <- mydata$sclass == 2 
  
mydata$sclass3 <- mydata$sclass == 3 
  
mydata$sclass4 <- mydata$sclass == 4 

mydata[1:20, ] 

```

## Linear regression

### A first model with sclass3 as reference

$$score_i = \beta_0 + \beta_1 * sclass1_i + \beta_2 * sclass2_i + \beta_3 * sclass4_i + e_i$$   

```{r}
fit <- lm(score ~ sclass1 + sclass2 + sclass4, data = mydata) 
summary(fit) 
```

$$ \hat{score}_i = 24.1 + 15.14 * sclass1_i + 7.77 * sclass2_i - 2.75 * sclass4_i$$
### Residual (unexplained) variance

```{r}
model <- summary(fit)
model$sigma^2
```

### % Explained variance

```{r}
model$r.squared
```

## Hypothesis testing

The test statistic is calculated by dividing the estimated slope by its standard error

```{r}
#sclass3 vs.sclass1
model$coefficients[2,1] / model$coefficients[2,2]

#sclass3 vs.sclass2
model$coefficients[3,1] / model$coefficients[3,2]

#sclass3 vs.sclass4
model$coefficients[4,1] / model$coefficients[4,2]

#significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance
```

### A second model with sclass2 as reference

$$score_i = \beta_0 + \beta_1 * sclass1_i + \beta_2 * sclass3_i + \beta_3 * sclass4_i + e_i$$   

```{r}
fit <- lm(score ~ sclass1 + sclass3 + sclass4, data = mydata) 
summary(fit) 
```

$$ \hat{score}_i = 31.87 + 7.37 * sclass1_i - 7.77 * sclass3_i - 10.52 * sclass4_i$$
### Residual (unexplained) variance

```{r}
model <- summary(fit)
model$sigma^2
```

### % Explained variance

```{r}
model$r.squared
```

## Hypothesis testing

The test statistic is calculated by dividing the estimated slope by its standard error

```{r}
#sclass2 vs.sclass1
model$coefficients[2,1] / model$coefficients[2,2]

#sclass2 vs.sclass3
model$coefficients[3,1] / model$coefficients[3,2]

#sclass2 vs.sclass4
model$coefficients[4,1] / model$coefficients[4,2]

#significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance
```

## Non-linear regression

We will now examine score (ratio variable : point score calculated from awards in Standard grades. Scores range  from  0  to  75,  with  a  higher  score  indicating  a  higher attainment ) and cohort90 taken as a categorical variable (cohorts: 1984, 1986, 1988, 1990, 1996 and 1998). The cohort90 variable is calculated by subtracting 1990 from each value. Thus values range from -6 (corresponding to 1984) to 8 (1998), with 1990 coded as zero. 

### Frequencies
```{r}
 table(mydata$cohort90) 
```

### Creation of dummy variables
```{r}
mydata$cohort90yr84 <- mydata$cohort90 == -6 
  
mydata$cohort90yr86 <- mydata$cohort90 == -4 
  
mydata$cohort90yr88 <- mydata$cohort90 == -2 
  
mydata$cohort90yr90 <- mydata$cohort90 == 0 
  
mydata$cohort90yr96 <- mydata$cohort90 == 6 
  
mydata$cohort90yr98 <- mydata$cohort90 == 8 

mydata[1:20, ] 
```

### A model with  cohort90yr88 as a reference

$$score_i = \beta_0 + \beta_1 * cohort90yr84_i + \beta_2 * cohort90yr86_i + \\  \beta_3 * cohort90yr90_i + \beta_4 * cohort90yr96_i + \beta_5 * cohort90yr98_i + e_i$$   

```{r}
fit <- lm(score ~ cohort90yr84 + cohort90yr86 + cohort90yr90 + cohort90yr96 + cohort90yr98, data = mydata)
summary(fit) 
```

$$ \hat{score}_i = 28.52 + -4.87 * cohort90yr84_i - 3.75 * cohort90yr86_i + \\  
0.58 * cohort90yr86_i + 10.91 * cohort90yr96_i + 12.81 * cohort90yr98_i$$  


The  coefficients  increase  with  year,  as  assumed  in  a  linear  model. We can treat cohort90 as continuous in all subsequent analyses. 

# Sources

- Multilevel modelling online course from the Centre for Mutlilevel Modelling of the University of Bristol (http://www.bristol.ac.uk/cmm/learning/online-course/).