Source: https://github.com/markziemann/bioinformatics_intro_workshop
Caret.
Caret is short for Classification And REgression Training and is designed to wrangle functions from many different machine learning (ML) packages into one, and providing “glue” functions. It has functions for data preparation, visualisations, pre-processing, filtering, classification performance metrics, hyperparameter tuning, class imbalance correction, fine tuning classifier thresholds, training and resampling multiple models.
According to the Caret documentation it supports over 200 different model types including a few well known ones:
| Model Type | Method | Description |
|---|---|---|
| Linear Models | “lm,” “glm,” “glmnet” | Linear regression, generalized linear models, elastic net |
| Tree-based Models | “rpart,” “rf,” “xgbTree,” “gbm” | Decision trees, random forests, gradient boosting |
| Support Vector Machines | “svmLinear,” “svmRadial” | Linear and RBF kernel SVMs |
| K-Nearest Neighbors | “knn” | Simple distance-based classifier |
| Neural Networks | “nnet,” “mlp” | Shallow neural networks |
| Naive Bayes | “naive_bayes,” “nb” | Probabilistic classifiers |
| Ensemble Methods | “stacking,” “bagFDA,” “ada” | Stacking, bagging, boosting |
A typical machine learning workflos looks a bit like this:
Typical ML workflow
After data cleaning, the data is split into training (~30%) and testing (~70%). In the world of omics, this division is based on samples, but different approaches can be applied. For example we may want to classify genes or proteins based on some features. The data needs to be labeled - for example disease and normal.
It is important that the training data is not used in the testing stage, as that leads to false performance metrics. Also, as the process of splitting the data is random and may influence the performance, you could consider repeating with different splits to see how the models and their performances vary. This is the basis of k-fold cross validation.
Model testing is done based on different performance metrics.
Accuracy (+/- 95% CI): How often a model is correct.
Sensitivity (recall, or true positive rate): Proportion of positives that are classified as positive.
Specificity: Proportion of negatives that are classified as negative.
Positive Prediction Value (precision): Proportion of positive classifications that are correct.
Negative Prediction Value: Proportion of negative classifications that are correct.
F1: Harmonic mean between precision and recall.
False positive rate (FPR): Proportion of negative cases that are incorrectly classified as positive.
Receiver-operator characteristic (ROC) curve. This gives a visual representation of FPR on the x-axis and TPR on the y-axis. It shows model classification performance across all possible thresholds. We summarise the performance by calculating the area under the ROC curve (AUROC).
A ROC schematic
A real ROC Curve
Explore an example ROC curve here.
As caret has a lot of dependencies, it is best to use a pre-made container image.
I have set up an apptainer image which is available from my lab’s website:
wget https://ziemann-lab.net/public/bioinfo_workshop/r-caret.sif
If for some reason this image isn’t available, you can build it from the docker image:
apptainer pull r-caret.sif docker://mziemann/r-caret:latest
If make your own Docker image with the provided Dockerfile.
Once that is sorted, we can start a shell using the image (for HPC).
apptainer run --writable-tmpfs r-caret.sif /bin/bash
If you’re working on a PC and want the Rstudio IDE, then you can use this command, then use the browser to join localhost:PORT
apptainer run --writable-tmpfs r-caret.sif
Then in R try to load the caret library.
library("caret")
Here we will use the “glm” method with cross validation to analyse the iris data. The model will predict the species based on petal/sepal length/width.
library("caret")
library("pROC")
set.seed(123)
# 1. Prepare data (binary classification)
iris2 <- subset(iris, Species != "setosa")
iris2$Species <- factor(iris2$Species) # ensure 2-class factor
# 2. Train/test split
trainIndex <- createDataPartition(iris2$Species, p = 0.8, list = FALSE)
trainData <- iris2[trainIndex, ]
testData <- iris2[-trainIndex, ]
# 3. Define training control with ROC metric
ctrl <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary, # enables ROC metric
savePredictions = TRUE
)
# 4. Train a generalised linear model
fit <- train(
Species ~ .,
data = trainData,
method = "glm",
metric = "ROC",
trControl = ctrl,
preProcess = c("center", "scale")
)
fit
# 5. Predict class probabilities and make predictions
probs <- predict(fit, testData, type = "prob")[,2]
pred <- predict(fit, testData)
confusionMatrix(pred, testData$Species)
# 6. Compute ROC curve
roc_glm <- roc(testData$Species, probs)
plot(roc_glm, main = "ROC Curve - GLM (caret)")
auc(roc_glm)
# Plot the result
pdf("roc_glm.pdf")
plot(roc_glm, main = "ROC Curve - GLM (caret)")
auc(roc_glm)
dev.off()
Follow the example in the Caret Vignette, which I’ve summarised below.
It involves the PLS (partial least squares) model.
library(caret)
library(mlbench)
data(Sonar)
set.seed(107)
inTrain <- createDataPartition(
y = Sonar$Class,
p = 0.75,
list = FALSE
)
training <- Sonar[ inTrain,]
testing <- Sonar[-inTrain,]
nrow(training)
nrow(testing)
ctrl <- trainControl(
method = "repeatedcv",
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
plsFit <- train(
Class ~ .,
data = training,
method = "pls",
preProc = c("center", "scale"),
tuneLength = 15,
trControl = ctrl,
metric = "ROC"
)
plsFit
plsClasses <- predict(plsFit, newdata = testing)
str(plsClasses)
plsProbs <- predict(plsFit, newdata = testing, type = "prob")
head(plsProbs)
confusionMatrix(data = plsClasses, testing$Class)
roc_pls <- roc(testing$Class, plsProbs$R)
plot(roc_pls, main = "ROC Curve - PLS (caret)")
auc(roc_pls)
pdf("roc_pls.pdf")
plot(roc_pls, main = "ROC Curve - PLS (caret)")
auc(roc_pls)
dev.off()
Now repeat this, but use a different model such as regularised discriminant analysis (rda) using the code below.
rdaGrid = data.frame(gamma = (0:4)/4, lambda = 3/4)
set.seed(123)
rdaFit <- train(
Class ~ .,
data = training,
method = "rda",
tuneGrid = rdaGrid,
trControl = ctrl,
metric = "ROC"
)
rdaFit
rdaClasses <- predict(rdaFit, newdata = testing)
confusionMatrix(rdaClasses, testing$Class)
rdaProbs <- predict(rdaFit, newdata = testing, type = "prob")
head(rdaProbs)
roc_rda <- roc(testing$Class, rdaProbs$R)
plot(roc_rda, main = "ROC Curve - RDA (caret)")
auc(roc_rda)
pdf("roc_rda.pdf")
plot(roc_rda, main = "ROC Curve - RDA (caret)")
auc(roc_rda)
dev.off()
Now repeat this once more, but using the random forest method using the code below.
ctrl <- trainControl(
method = "cv", # Cross-validation
number = 5, # 5-fold CV
savePredictions = TRUE,
classProbs = TRUE, # Save class probabilities
summaryFunction = twoClassSummary
)
rfFit <- train(
Class ~ ., # Predict Class using all features
data = training,
method = "rf", # Random Forest
trControl = ctrl,
tuneLength = 5, # Try 5 different mtry values
metric = "ROC", # Optimize for ROC
importance = TRUE
)
rfFit
rfClasses <- predict(rfFit, newdata = testing)
confusionMatrix(rfClasses, testing$Class)
rfProbs <- predict(rfFit, newdata = testing, type = "prob")
head(rfProbs)
roc_rf <- roc(testing$Class, rfProbs$R)
plot(roc_rf, main = "ROC Curve - RF (caret)")
auc(roc_rf)
Now make a single chart with different series overlaid, including the models you ran already on the sonar data (PLS, RDA and RF) as well as a a new one (try XGBoost or GLM).
plot(roc_pls,col="black",main="Comparison of models on Sonar data")
lines(roc_rda,col="blue")
lines(roc_rf,col="red")
lines(roc_glm,col="orange") # need to make/run this model
# then run a GLM on sonar data and
leglab=c(paste("PLS",signif(auc(roc_pls),3)),
paste("RDA",signif(auc(roc_rda),3)),
paste("RF",signif(auc(roc_rf),3)),
paste("GLM","XX") )
legend(0.3, 0.3, legend=leglab,
col=c("black", "blue","red","orange"), cex=1, lty=1, lwd=2,
title="Line types")
Let’s try to use this workflow to classify cancer and normal lung tissues based on RNA-seq data.
library("caret")
library("mlbench")
library("DESeq2")
library("pROC")
download.file("https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE158420&format=file&file=GSE158420%5Fcounts%2Etxt%2Egz",
destfile="GSE158420.counts.txt.gz")
x <- read.csv("GSE158420.counts.txt.gz",header=TRUE,sep="\t")
xa <- aggregate(. ~ X,x,sum)
rownames(xa) <- xa$X
xa$X=NULL
# remove genes with <10 reads on average
table(rowMeans(xa)>=10)
xaf <- xa[(rowMeans(xa)>=10),]
rpm <- apply(X=xaf,MARGIN=2,FUN=function(x) { x/sum(x) * 1e6} )
# run DE to get informative genes
ss <- data.frame(colnames(xaf))
ss$tumor <- factor(grepl("T",ss[,1]))
ss$patient <- factor(gsub("N","",gsub("T","",ss[,1])))
dds <- DESeqDataSetFromMatrix(countData = xaf , colData=ss, design = ~ patient + tumor)
res <- DESeq(dds)
z <- results(res)
vsd <- vst(dds, blind=FALSE)
zz <- cbind(as.data.frame(z),assay(vsd))
dge <- as.data.frame(zz[order(zz$pvalue),])
topgenes <- head(rownames(dge),100)
topmx <- rpm[which(rownames(rpm) %in% topgenes),]
dim(topmx)
topmx2 <- as.data.frame(t(topmx))
# labels cant be TRUE/FALSE, factor is better
topmx2$tumor <- factor(gsub("FALSE","control",gsub("TRUE","tumor",as.character(grepl("T",rownames(topmx2))))))
set.seed(107)
inTrain <- createDataPartition(
y = topmx2$tumor,
p = 0.75,
list = FALSE
)
training <- topmx2[ inTrain,]
testing <- topmx2[-inTrain,]
dim(training)
dim(testing)
ctrl <- trainControl(
method = "repeatedcv",
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
plsFit <- caret::train(
tumor ~ .,
data = training,
method = "pls",
preProc = c("center", "scale"),
tuneLength = 15,
trControl = ctrl,
metric = "ROC"
)
plsFit
plsClasses <- predict(plsFit, newdata = testing)
str(plsClasses)
plsProbs <- predict(plsFit, newdata = testing, type = "prob")
head(plsProbs)
confusionMatrix(data = plsClasses, testing$tumor)
roc_pls <- roc(testing$tumor, plsProbs$tumor)
plot(roc_pls, main = "ROC Curve - Cancer PLS (caret)")
auc(roc_pls)
pdf("roc_pls_cancer.pdf")
plot(roc_pls, main = "ROC Curve - PLS (caret)")
auc(roc_pls)
dev.off()
There are multiple ways to extend upon this work.
Explore different model types.
Try using a smaller training set, which will give a bigger testing set, which will be more accurate.
See if a smaller number of genes can be just as accurate as the 100 used here.
See if the model is still accurate with other lung cancer RNA-seq datasets.