This R notebook can be used as a template to compute regression with a single ordinal explanatory variable. It is usable in Rstudio. The content is adapated from the mooc Multilevel modelling online course from the Centre for Mutlilevel Modelling at the University of Bristol (http://www.bristol.ac.uk/cmm/learning/online-course/).

Packages

library(xlsx) #read xlsx files

Data importation

 #Text file
mydata <- read.table(file = "3.1.txt", sep = ",", header = TRUE) #put the file in the same directory as this notebook

#Excel file
#mydata <- read.xlsx("myfile.xlsx", sheetName = "mysheetname")

Data exploration

dim(mydata) #number of rows and columns
## [1] 33988     5
 str(mydata) #basic description of the dataframe
## 'data.frame':    33988 obs. of  5 variables:
##  $ caseid  : int  339 340 345 346 352 353 354 361 362 363 ...
##  $ score   : int  49 18 46 43 17 29 15 19 45 12 ...
##  $ cohort90: int  -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 ...
##  $ female  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sclass  : int  2 3 4 3 3 2 3 2 3 1 ...
mydata[1:20, ] #first 20 rows of the data
##    caseid score cohort90 female sclass
## 1     339    49       -6      0      2
## 2     340    18       -6      0      3
## 3     345    46       -6      0      4
## 4     346    43       -6      0      3
## 5     352    17       -6      0      3
## 6     353    29       -6      0      2
## 7     354    15       -6      0      3
## 8     361    19       -6      0      2
## 9     362    45       -6      0      3
## 10    363    12       -6      0      1
## 11   6824     0       -4      0      1
## 12   6826     0       -4      0      3
## 13   6827    20       -4      0      2
## 14   6828    32       -4      0      1
## 15   6829     0       -4      0      2
## 16   6834    24       -4      0      3
## 17   6836    23       -4      0      2
## 18  13206     7       -2      0      3
## 19  13209    38       -2      0      3
## 20  13215    46       -2      0      1

Having viewed the data we will now examine score (ratio variable : point score calculated from awards in Standard grades. Scores range from 0 to 75, with a higher score indicating a higher attainment ) and cohort90 (ordinal variable : cohorts: 1984, 1986, 1988, 1990, 1996 and 1998). The cohort90 variable is calculated by subtracting 1990 from each value. Thus values range from -6 (corresponding to 1984) to 8 (1998), with 1990 coded as zero,

Dependent variable (score)

Histogram of the data

#continuous variable
 hist(mydata$score, xlim = c(0,80)) 

## Summary of the data

summary(mydata$score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   19.00   33.00   31.09   45.00   75.00

Standard deviation of the data

sd(mydata$score)
## [1] 17.31437

Independent (explanatory) variable (cohort90)

Frequencies, percentages and cumulated percentages

mytable <- table(mydata$cohort90)
mytablecomb <- cbind(mytable, prop.table(mytable), 
cumsum(prop.table(mytable))) 
colnames(mytablecomb) <- c("Freq", "Perc", "Cum") 
mytablecomb
##    Freq      Perc       Cum
## -6 6478 0.1905967 0.1905967
## -4 6325 0.1860951 0.3766918
## -2 5245 0.1543192 0.5310109
## 0  4371 0.1286042 0.6596152
## 6  4244 0.1248676 0.7844828
## 8  7325 0.2155172 1.0000000

Relationship between explanatory and dependant variables

Scatterplot

plot(mydata$cohort90, mydata$score, ylim = c(0,80)) 

Frequence, mean and sd of the dependent variable for each value of the explanatory variable

l <- tapply(mydata$score, factor(mydata$cohort90), length) 
m <- tapply(mydata$score, factor(mydata$cohort90), mean) 
s <- tapply(mydata$score, factor(mydata$cohort90), sd) 
tableScore <- cbind("Freq"  = l, "mean(score)" = m, "sd(score)" = s) 
tableScore
##    Freq mean(score) sd(score)
## -6 6478    23.65545  18.07995
## -4 6325    24.77265  17.37533
## -2 5245    28.52450  15.93629
## 0  4371    29.10043  15.76355
## 6  4244    39.43473  13.55147
## 8  7325    41.33065  13.00926

Person correlation

cor(mydata$score, mydata$cohort90) 
## [1] 0.4088625

Linear regression

\[dependent_i = \beta_0 + \beta_1 * explanatory_i + e_i\]

fit <- lm(score ~ cohort90, data = mydata) 
summary(fit) 
## 
## Call:
## lm(formula = score ~ cohort90, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41.31 -11.73   0.56  12.20  46.20 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.72873    0.08582  358.04   <2e-16 ***
## cohort90     1.32214    0.01601   82.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.8 on 33986 degrees of freedom
## Multiple R-squared:  0.1672, Adjusted R-squared:  0.1671 
## F-statistic:  6822 on 1 and 33986 DF,  p-value: < 2.2e-16

\[ \hat{score}_i = 30.73 + 1.32 * cohort90_i\]

Predictions

predscore <- predict(fit) #Predictions for every observation in the data

Regression line

plot(mydata$cohort90, mydata$score, ylim = c(0,80)) 
abline(lm(score ~ cohort90, data = mydata))

Residual (unexplained) variance

model <- summary(fit)
model$sigma^2
## [1] 249.6798

% Explained variance

model$r.squared
## [1] 0.1671685

Hypothesis testing

The test statistic is calculated by dividing the estimated slope by its standard error

model$coefficients[2,1] / model$coefficients[2,2]
## [1] 82.59405
#significant if larger than 1.96 or -1.96, the critical value at the 5% level of significance

Sources