Readability Model

1/1/0001
5-minute read

Readability

The term “readability” describes the level of difficulty a written passage is. Factors of word choice, sentence length, and grammatical attributes contribute to the overall difficulty of a passage. In traditional school settings, leveled books exist for children to expand their reading skills within their designated grade level. These books are written within guidelines that fall within state- and nation-wide English Language Arts standards. What if there is a passage of text that does not have a designated grade level? A teacher could manually read through and check the passage for indicators. This takes significant time and an expansive knowledge on grammar. This project explores the use of linear regression and text features to model readability scores.

Significance

The readability of a passage determines how well a reader interacts with the overall content of a passage. If they are struggling to understand the basic words and structure, they will not be able to comprehend and absorb the overall story line. All readers are unique in the way that they approach a text. The impact that a text’s readability and features has varies from student to student (1). It is important for teachers and literacy practitioners to have not only an understanding of their students' reading level, but the readability score of passages being presented. This score assists teachers and practitioners in providing reading assistance and resources calibrated to the individual student.

Model Description

The readability model featured in this series was developed in EDUC 654: Machine Learning taught by Dr. Cengiz Zopluoglu at the University of Oregon. Text features as well as the model were created with text data from the CommonLit Readability Kaggle Competition. In the previous post, I displayed how to gather text features to assist with predicting readability of a passage. In the following post, I will demonstrate how the readability prediction model was constructed.

Ridge Line Penalty

Performance evaluation metrics were used to evaluate which logistic regression model produced the best results. Models analyzed included logistic regression, logistic regression with ridge penalty, logistic regression with lasso penalty, and logistic regression with elastic net. Information collected from each model tested included the R-squared measurement, the Mean Absolute Error (MAE), and the Root Mean Squared Error (RMSE). Out of these four models, the model with logistic regression with ridge penalty performed the best with the following statistics:

R-squared	MAE	RMSE
0.7274006	0.4347177	0.5355444

Ridge penalty terms were added to the loss function to avoid large coefficients. By including a ridge penalty, we reduced model variance in exchange of adding bias Lecture 5a.

Building the Model

To build the model, we began by downloading the necessary packages.

require(caret)
require(recipes)
require(finalfit)
require(glmnet)
require(finalfit)

The data set is then imported. Readability is the data set obtained from the CommonLit Readability Kaggle Competition. We set the seed to allow for reproducibility.

readability <- read.csv('https://raw.githubusercontent.com/uo-datasci-specialization/c4-ml-fall-2021/main/data/readability_features.csv',header=TRUE)

set.seed(10152021)

Training and Test Sets

A crucial set in building the model is training part of the available data as the model and then testing the accuracy of the model with the other part of the data. We begin with creating a list of random numbers ranging from 1 to number of rows from the data set and put 90% of the data into the training data. The other 10% serves as our test data.

loc      <- sample(1:nrow(readability), round(nrow(readability) * 0.9))
read_tr  <- readability[loc, ]
read_te  <- readability[-loc, ]

Blueprint

We use the recipe package in R to develop a blueprint for our model. In this package, a recipe consists of a description of the steps applied to a data set to assist with preparing it for data analysis. This function requires a data set, vars, and roles. In this case, vars is a character string of column names corresponding to the variables within the data set and roles is a character string, the same length as vars, that describes the roles each variable will take within the model.

Once these parameters were established, steps were added to the blueprint. These are similar to steps in a cooking recipe. For example, when making banana bread, you gather the ingredients and add them to a mixing bowl in a particular order. Shown below, we added our steps in an order that was ideal for our model.

  blueprint <- recipe(x     = readability,
                      vars  = colnames(readability),
                      roles = c(rep('predictor',990),'outcome')) %>%
    step_zv(all_numeric()) %>%
    step_nzv(all_numeric()) %>%
    step_impute_mean(all_numeric()) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_corr(all_numeric(),threshold=0.9)

The table below gives a basic description of each of the steps used within this blueprint. For more information and step possibilities, please visit the recipe package link above. Click on the links in the table to find out more about those steps. .

Step	Description
step_zv	removes variables containing only a single value
step_nzv	removes highly sparse and unbalanced variables
step_impute_mean	substitutes missing values of numeric variables with the mean of training set variables
step_normalize	normalizes data to have an sd of 1 and a mean of 0
step_corr	removes variables that have large absolute correlations with other variables

Cross-Validation and Grid Tuning

Cross-validation was conducted to judge the overall performance and accuracy of the model.

# Cross validation settings
  
# Randomly shuffle the data

read_tr = read_tr[sample(nrow(read_tr)),]

# Create 10 folds with equal size

folds = cut(seq(1,nrow(read_tr)),breaks=10,labels=FALSE)
  
# Create the list for each fold 
      
my.indices <- vector('list',10)
for(i in 1:10){
  my.indices[[i]] <- which(folds!=i)
}
      
cv <- trainControl(method = "cv",
                  index  = my.indices)

By optimizing the degree of ridge penalty via tuning, we can typically get models with better performance than a logistic regression with no regularization. In our case, the optimal lambda, after being tested, was determined as 0.57. This proved to tune the best grid for the model.

grid <- data.frame(alpha = 0, lambda = 0.57) 
grid

The final step in building the model is to train it.

ridge <- caret::train(blueprint, 
                      data      = read_tr, 
                      method    = "glmnet", 
                      trControl = cv,
                      tuneGrid  = grid)

Please move on to the next post in this series Model Validation with easyCBM to read about validating the model developed here with a set of text and accompanying student data.For more information regarding logistic regression, refer to Dr. Zopluoglu’s lectures Regularization in Linear Regression and Logistic Regression and Regularization.

Resources

Francis, D. J., Kulesz, P. A., & Benoit, J. S. (2018). Extending the simple view of reading to account for variation within readers and across texts: The complete view of reading (CVRi). Remedial and Special Education, 39(5), 274-288. doi:/doi.org/10.1177/07419325187729

Posts in this Series

readability modelproduction linearregression