This R notebook can be used as a template to compute regression with a single ordinal explanatory variable. It is usable in Rstudio. The content is adapated from the mooc Multilevel modelling online course from the Centre for Mutlilevel Modelling at the University of Bristol (http://www.bristol.ac.uk/cmm/learning/online-course/).
library(xlsx) #read xlsx files
#Text file
mydata <- read.table(file = "3.1.txt", sep = ",", header = TRUE) #put the file in the same directory as this notebook
#Excel file
#mydata <- read.xlsx("myfile.xlsx", sheetName = "mysheetname")
dim(mydata) #number of rows and columns
## [1] 33988 5
str(mydata) #basic description of the dataframe
## 'data.frame': 33988 obs. of 5 variables:
## $ caseid : int 339 340 345 346 352 353 354 361 362 363 ...
## $ score : int 49 18 46 43 17 29 15 19 45 12 ...
## $ cohort90: int -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 ...
## $ female : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sclass : int 2 3 4 3 3 2 3 2 3 1 ...
mydata[1:20, ] #first 20 rows of the data
## caseid score cohort90 female sclass
## 1 339 49 -6 0 2
## 2 340 18 -6 0 3
## 3 345 46 -6 0 4
## 4 346 43 -6 0 3
## 5 352 17 -6 0 3
## 6 353 29 -6 0 2
## 7 354 15 -6 0 3
## 8 361 19 -6 0 2
## 9 362 45 -6 0 3
## 10 363 12 -6 0 1
## 11 6824 0 -4 0 1
## 12 6826 0 -4 0 3
## 13 6827 20 -4 0 2
## 14 6828 32 -4 0 1
## 15 6829 0 -4 0 2
## 16 6834 24 -4 0 3
## 17 6836 23 -4 0 2
## 18 13206 7 -2 0 3
## 19 13209 38 -2 0 3
## 20 13215 46 -2 0 1
Having viewed the data we will now examine score (ratio variable : point score calculated from awards in Standard grades. Scores range from 0 to 75, with a higher score indicating a higher attainment ) and cohort90 (ordinal variable : cohorts: 1984, 1986, 1988, 1990, 1996 and 1998). The cohort90 variable is calculated by subtracting 1990 from each value. Thus values range from -6 (corresponding to 1984) to 8 (1998), with 1990 coded as zero,
#continuous variable
hist(mydata$score, xlim = c(0,80))
## Summary of the data
summary(mydata$score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 19.00 33.00 31.09 45.00 75.00
sd(mydata$score)
## [1] 17.31437
mytable <- table(mydata$cohort90)
mytablecomb <- cbind(mytable, prop.table(mytable),
cumsum(prop.table(mytable)))
colnames(mytablecomb) <- c("Freq", "Perc", "Cum")
mytablecomb
## Freq Perc Cum
## -6 6478 0.1905967 0.1905967
## -4 6325 0.1860951 0.3766918
## -2 5245 0.1543192 0.5310109
## 0 4371 0.1286042 0.6596152
## 6 4244 0.1248676 0.7844828
## 8 7325 0.2155172 1.0000000
plot(mydata$cohort90, mydata$score, ylim = c(0,80))
l <- tapply(mydata$score, factor(mydata$cohort90), length)
m <- tapply(mydata$score, factor(mydata$cohort90), mean)
s <- tapply(mydata$score, factor(mydata$cohort90), sd)
tableScore <- cbind("Freq" = l, "mean(score)" = m, "sd(score)" = s)
tableScore
## Freq mean(score) sd(score)
## -6 6478 23.65545 18.07995
## -4 6325 24.77265 17.37533
## -2 5245 28.52450 15.93629
## 0 4371 29.10043 15.76355
## 6 4244 39.43473 13.55147
## 8 7325 41.33065 13.00926
cor(mydata$score, mydata$cohort90)
## [1] 0.4088625
\[dependent_i = \beta_0 + \beta_1 * explanatory_i + e_i\]
fit <- lm(score ~ cohort90, data = mydata)
summary(fit)
##
## Call:
## lm(formula = score ~ cohort90, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.31 -11.73 0.56 12.20 46.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.72873 0.08582 358.04 <2e-16 ***
## cohort90 1.32214 0.01601 82.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.8 on 33986 degrees of freedom
## Multiple R-squared: 0.1672, Adjusted R-squared: 0.1671
## F-statistic: 6822 on 1 and 33986 DF, p-value: < 2.2e-16
\[ \hat{score}_i = 30.73 + 1.32 * cohort90_i\]
predscore <- predict(fit) #Predictions for every observation in the data
plot(mydata$cohort90, mydata$score, ylim = c(0,80))
abline(lm(score ~ cohort90, data = mydata))
model <- summary(fit)
model$sigma^2
## [1] 249.6798
model$r.squared
## [1] 0.1671685
The test statistic is calculated by dividing the estimated slope by its standard error
model$coefficients[2,1] / model$coefficients[2,2]
## [1] 82.59405
#significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance