Bioinformatics data skills workshop - Session 10: Machine learning in R with caret

Source: https://github.com/markziemann/bioinformatics_intro_workshop

Caret.

Background

Caret is short for Classification And REgression Training and is designed to wrangle functions from many different machine learning (ML) packages into one, and providing “glue” functions. It has functions for data preparation, visualisations, pre-processing, filtering, classification performance metrics, hyperparameter tuning, class imbalance correction, fine tuning classifier thresholds, training and resampling multiple models.

According to the Caret documentation it supports over 200 different model types including a few well known ones:

Model Type	Method	Description
Linear Models	“lm,” “glm,” “glmnet”	Linear regression, generalized linear models, elastic net
Tree-based Models	“rpart,” “rf,” “xgbTree,” “gbm”	Decision trees, random forests, gradient boosting
Support Vector Machines	“svmLinear,” “svmRadial”	Linear and RBF kernel SVMs
K-Nearest Neighbors	“knn”	Simple distance-based classifier
Neural Networks	“nnet,” “mlp”	Shallow neural networks
Naive Bayes	“naive_bayes,” “nb”	Probabilistic classifiers
Ensemble Methods	“stacking,” “bagFDA,” “ada”	Stacking, bagging, boosting

A typical machine learning workflos looks a bit like this:

Typical ML workflow

After data cleaning, the data is split into training (~30%) and testing (~70%). In the world of omics, this division is based on samples, but different approaches can be applied. For example we may want to classify genes or proteins based on some features. The data needs to be labeled - for example disease and normal.

It is important that the training data is not used in the testing stage, as that leads to false performance metrics. Also, as the process of splitting the data is random and may influence the performance, you could consider repeating with different splits to see how the models and their performances vary. This is the basis of k-fold cross validation.

Model testing is done based on different performance metrics.

Accuracy (+/- 95% CI): How often a model is correct.
Sensitivity (recall, or true positive rate): Proportion of positives that are classified as positive.
Specificity: Proportion of negatives that are classified as negative.
Positive Prediction Value (precision): Proportion of positive classifications that are correct.
Negative Prediction Value: Proportion of negative classifications that are correct.
F1: Harmonic mean between precision and recall.
False positive rate (FPR): Proportion of negative cases that are incorrectly classified as positive.
Receiver-operator characteristic (ROC) curve. This gives a visual representation of FPR on the x-axis and TPR on the y-axis. It shows model classification performance across all possible thresholds. We summarise the performance by calculating the area under the ROC curve (AUROC).

A ROC schematic

A real ROC Curve

Explore an example ROC curve here.

Set up packages

As caret has a lot of dependencies, it is best to use a pre-made container image.

I have set up an apptainer image which is available from my lab’s website:

wget https://ziemann-lab.net/public/bioinfo_workshop/r-caret.sif

If for some reason this image isn’t available, you can build it from the docker image:

apptainer pull r-caret.sif docker://mziemann/r-caret:latest

If make your own Docker image with the provided Dockerfile.

Once that is sorted, we can start a shell using the image (for HPC).

apptainer run --writable-tmpfs r-caret.sif /bin/bash

If you’re working on a PC and want the Rstudio IDE, then you can use this command, then use the browser to join localhost:PORT

apptainer run --writable-tmpfs r-caret.sif

Then in R try to load the caret library.

library("caret")

Basic implementation

Here we will use the “glm” method with cross validation to analyse the iris data. The model will predict the species based on petal/sepal length/width.

library("caret")
library("pROC")
set.seed(123)

# 1. Prepare data (binary classification)
iris2 <- subset(iris, Species != "setosa")
iris2$Species <- factor(iris2$Species)  # ensure 2-class factor

# 2. Train/test split
trainIndex <- createDataPartition(iris2$Species, p = 0.8, list = FALSE)
trainData  <- iris2[trainIndex, ]
testData   <- iris2[-trainIndex, ]

# 3. Define training control with ROC metric
ctrl <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,  # enables ROC metric
  savePredictions = TRUE
)

# 4. Train a generalised linear model
fit <- train(
  Species ~ .,
  data = trainData,
  method = "glm",
  metric = "ROC",
  trControl = ctrl,
  preProcess = c("center", "scale")
)
fit

# 5. Predict class probabilities and make predictions
probs <- predict(fit, testData, type = "prob")[,2]
pred <- predict(fit, testData)
confusionMatrix(pred, testData$Species)

# 6. Compute ROC curve
roc_glm <- roc(testData$Species, probs)
plot(roc_glm, main = "ROC Curve - GLM (caret)")
auc(roc_glm)

# Plot the result
pdf("roc_glm.pdf")
plot(roc_glm, main = "ROC Curve - GLM (caret)")
auc(roc_glm)
dev.off()

Another example

Follow the example in the Caret Vignette, which I’ve summarised below.

It involves the PLS (partial least squares) model.


library(caret)
library(mlbench)
data(Sonar)

set.seed(107)

inTrain <- createDataPartition(
  y = Sonar$Class,
  p = 0.75,
  list = FALSE
)

training <- Sonar[ inTrain,]
testing  <- Sonar[-inTrain,]

nrow(training)
nrow(testing)

ctrl <- trainControl(
  method = "repeatedcv",
  repeats = 3,
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

plsFit <- train(
  Class ~ .,
  data = training,
  method = "pls",
  preProc = c("center", "scale"),
  tuneLength = 15,
  trControl = ctrl,
  metric = "ROC"
)
plsFit

plsClasses <- predict(plsFit, newdata = testing)
str(plsClasses)

plsProbs <- predict(plsFit, newdata = testing, type = "prob")
head(plsProbs)

confusionMatrix(data = plsClasses, testing$Class)

roc_pls <- roc(testing$Class, plsProbs$R)
plot(roc_pls, main = "ROC Curve - PLS (caret)")
auc(roc_pls)

pdf("roc_pls.pdf")
plot(roc_pls, main = "ROC Curve - PLS (caret)")
auc(roc_pls)
dev.off()

Now repeat this, but use a different model such as regularised discriminant analysis (rda) using the code below.


rdaGrid = data.frame(gamma = (0:4)/4, lambda = 3/4)

set.seed(123)

rdaFit <- train(
  Class ~ .,
  data = training,
  method = "rda",
  tuneGrid = rdaGrid,
  trControl = ctrl,
  metric = "ROC"
)

rdaFit
rdaClasses <- predict(rdaFit, newdata = testing)
confusionMatrix(rdaClasses, testing$Class)

rdaProbs <- predict(rdaFit, newdata = testing, type = "prob")
head(rdaProbs)

roc_rda <- roc(testing$Class, rdaProbs$R)
plot(roc_rda, main = "ROC Curve - RDA (caret)")
auc(roc_rda)

pdf("roc_rda.pdf")
plot(roc_rda, main = "ROC Curve - RDA (caret)")
auc(roc_rda)
dev.off()

Now repeat this once more, but using the random forest method using the code below.


ctrl <- trainControl(
  method = "cv",           # Cross-validation
  number = 5,             # 5-fold CV
  savePredictions = TRUE,
  classProbs = TRUE,       # Save class probabilities
  summaryFunction = twoClassSummary
)

rfFit <- train(
  Class ~ .,                    # Predict Class using all features
  data = training,
  method = "rf",                # Random Forest
  trControl = ctrl,
  tuneLength = 5,               # Try 5 different mtry values
  metric = "ROC",               # Optimize for ROC
  importance = TRUE
)

rfFit
rfClasses <- predict(rfFit, newdata = testing)
confusionMatrix(rfClasses, testing$Class)

rfProbs <- predict(rfFit, newdata = testing, type = "prob")
head(rfProbs)

roc_rf <- roc(testing$Class, rfProbs$R)
plot(roc_rf, main = "ROC Curve - RF (caret)")
auc(roc_rf)

Now make a single chart with different series overlaid, including the models you ran already on the sonar data (PLS, RDA and RF) as well as a a new one (try XGBoost or GLM).


plot(roc_pls,col="black",main="Comparison of models on Sonar data")
lines(roc_rda,col="blue")
lines(roc_rf,col="red")
lines(roc_glm,col="orange") # need to make/run this model

# then run a GLM on sonar data and 

leglab=c(paste("PLS",signif(auc(roc_pls),3)),
  paste("RDA",signif(auc(roc_rda),3)),
  paste("RF",signif(auc(roc_rf),3)),
  paste("GLM","XX") )

legend(0.3, 0.3, legend=leglab,
       col=c("black", "blue","red","orange"), cex=1, lty=1, lwd=2,
       title="Line types")

Omics data implementation

Let’s try to use this workflow to classify cancer and normal lung tissues based on RNA-seq data.


library("caret")
library("mlbench")
library("DESeq2")
library("pROC")

download.file("https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE158420&format=file&file=GSE158420%5Fcounts%2Etxt%2Egz",
  destfile="GSE158420.counts.txt.gz")

x <- read.csv("GSE158420.counts.txt.gz",header=TRUE,sep="\t")

xa <- aggregate(. ~ X,x,sum)

rownames(xa) <- xa$X

xa$X=NULL

# remove genes with <10 reads on average
table(rowMeans(xa)>=10)

xaf <- xa[(rowMeans(xa)>=10),]
rpm <-  apply(X=xaf,MARGIN=2,FUN=function(x) { x/sum(x) * 1e6} )

# run DE to get informative genes
ss <- data.frame(colnames(xaf))
ss$tumor <- factor(grepl("T",ss[,1]))
ss$patient <- factor(gsub("N","",gsub("T","",ss[,1])))
dds <- DESeqDataSetFromMatrix(countData = xaf , colData=ss, design = ~ patient + tumor)
res <- DESeq(dds)
z <- results(res)
vsd <- vst(dds, blind=FALSE)
zz <- cbind(as.data.frame(z),assay(vsd))
dge <- as.data.frame(zz[order(zz$pvalue),])

topgenes <- head(rownames(dge),100)

topmx <- rpm[which(rownames(rpm) %in% topgenes),]

dim(topmx)

topmx2 <- as.data.frame(t(topmx))

# labels cant be TRUE/FALSE, factor is better
topmx2$tumor <- factor(gsub("FALSE","control",gsub("TRUE","tumor",as.character(grepl("T",rownames(topmx2))))))

set.seed(107)

inTrain <- createDataPartition(
  y = topmx2$tumor,
  p = 0.75,
  list = FALSE
)

training <- topmx2[ inTrain,]
testing  <- topmx2[-inTrain,]

dim(training)
dim(testing)

ctrl <- trainControl(
  method = "repeatedcv",
  repeats = 3,
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

plsFit <- caret::train(
  tumor ~ .,
  data = training,
  method = "pls",
  preProc = c("center", "scale"),
  tuneLength = 15,
  trControl = ctrl,
  metric = "ROC"
)
plsFit


plsClasses <- predict(plsFit, newdata = testing)
str(plsClasses)

plsProbs <- predict(plsFit, newdata = testing, type = "prob")
head(plsProbs)

confusionMatrix(data = plsClasses, testing$tumor)

roc_pls <- roc(testing$tumor, plsProbs$tumor)
plot(roc_pls, main = "ROC Curve - Cancer PLS (caret)")
auc(roc_pls)

pdf("roc_pls_cancer.pdf")
plot(roc_pls, main = "ROC Curve - PLS (caret)")
auc(roc_pls)
dev.off()

There are multiple ways to extend upon this work.

Explore different model types.
Try using a smaller training set, which will give a bigger testing set, which will be more accurate.
See if a smaller number of genes can be just as accurate as the 100 used here.
See if the model is still accurate with other lung cancer RNA-seq datasets.