---
title: "Regression with a Single Ordinal Explanatory Variable"
subtitle: Analysis template
author: Sunny Avry
date: "November 21th, 2020"
output:
  html_document:
    toc: false
    theme: united
---

This R notebook can be used as a template to compute regression with a single ordinal explanatory variable. It is usable in Rstudio. The content is adapated from the mooc Multilevel modelling online course from the Centre for Mutlilevel Modelling at the University of Bristol (http://www.bristol.ac.uk/cmm/learning/online-course/). 

# Packages
```{r message=FALSE, warning=FALSE}
library(xlsx) #read xlsx files
```
# Data importation
```{r}
 #Text file
mydata <- read.table(file = "3.1.txt", sep = ",", header = TRUE) #put the file in the same directory as this notebook

#Excel file
#mydata <- read.xlsx("myfile.xlsx", sheetName = "mysheetname")
```
# Data exploration

```{r}
dim(mydata) #number of rows and columns
```
```{r}
 str(mydata) #basic description of the dataframe
```

```{r}
mydata[1:20, ] #first 20 rows of the data
```

Having viewed the data we will now examine score (ratio variable : point score calculated from awards in Standard grades. Scores range  from  0  to  75,  with  a  higher  score  indicating  a  higher attainment ) and cohort90 (ordinal variable : cohorts: 1984, 1986, 1988, 1990, 1996 and 1998). The cohort90 variable is calculated by subtracting 1990 from each value. Thus values range from -6 (corresponding to 1984) to 8 (1998), with 1990 coded as zero, 

# Dependent variable (score)
## Histogram of the data
```{r}
#continuous variable
 hist(mydata$score, xlim = c(0,80)) 
```
## Summary of the data
```{r}
summary(mydata$score)
```

## Standard deviation of the data
```{r}
sd(mydata$score)
```

# Independent (explanatory) variable (cohort90)

## Frequencies, percentages and cumulated percentages
```{r}
mytable <- table(mydata$cohort90)
mytablecomb <- cbind(mytable, prop.table(mytable), 
cumsum(prop.table(mytable))) 
colnames(mytablecomb) <- c("Freq", "Perc", "Cum") 
mytablecomb
```

# Relationship between explanatory and dependant variables 

## Scatterplot
```{r}
plot(mydata$cohort90, mydata$score, ylim = c(0,80)) 
```

## Frequence, mean and sd of the dependent variable for each value of the explanatory variable
```{r}
l <- tapply(mydata$score, factor(mydata$cohort90), length) 
m <- tapply(mydata$score, factor(mydata$cohort90), mean) 
s <- tapply(mydata$score, factor(mydata$cohort90), sd) 
tableScore <- cbind("Freq"  = l, "mean(score)" = m, "sd(score)" = s) 
tableScore
```

## Person correlation

```{r}
cor(mydata$score, mydata$cohort90) 
```

# Linear regression

$$dependent_i = \beta_0 + \beta_1 * explanatory_i + e_i$$   
```{r}
fit <- lm(score ~ cohort90, data = mydata) 
summary(fit) 
```

$$ \hat{score}_i = 30.73 + 1.32 * cohort90_i$$

## Predictions

```{r}
predscore <- predict(fit) #Predictions for every observation in the data
```
## Regression line

```{r}
plot(mydata$cohort90, mydata$score, ylim = c(0,80)) 
abline(lm(score ~ cohort90, data = mydata))
```

## Residual (unexplained) variance

```{r}
model <- summary(fit)
model$sigma^2
```

## % Explained variance

```{r}
model$r.squared
```

# Hypothesis testing

The test statistic is calculated by dividing the estimated slope by its standard error

```{r}
model$coefficients[2,1] / model$coefficients[2,2]

#significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance

```

# Sources

- Multilevel modelling online course from the Centre for Mutlilevel Modelling of the University of Bristol (http://www.bristol.ac.uk/cmm/learning/online-course/).