To go back to the Home page, click here.
This notebook was inspired by this post from the website TowardsDataScience. The dataset can be found here.
Our goal is twofold:
We want to build a predictive model able to classify a patient as infected by the covid 19 or not based on his/her blood’s data.
We also want to identify diseases that have symptoms similar to the Covid 19. In that way, we can recommend to medical centers to treat separately people infected by these diseases. This would also allow to avoid to build dataset with false instances (i.e. with false discoveries).
<- read.csv('../../data/covid.csv');
data <- data.frame(data);
data $SARS.Cov.2.exam.result = 1*(data$SARS.Cov.2.exam.result == "positive") data
The dataset contains many NaN values. This issue of missing values must be corrected before moving forward. We also split the dataset in two parts. The fist DataFrame will be used to build a prediction model to classify patients as infected by the SARS Cov 2 (or not) based on their blood’s data. The second part will be used to find diseases with symptoms similar to the ones caused by the SARS Cov 2.
= data[,append(c(2,3),seq(22,39))];
dfDiseases = dfDiseases[,c(-7,-9)];
dfDiseases for (disease in colnames(dfDiseases)){
if (!(disease %in% c("SARS.Cov.2.exam.result","Patient.age.quantile"))){
=='detected',disease] = 1;
dfDiseases[dfDiseases[[disease]]=='',disease] = NA;
dfDiseases[dfDiseases[[disease]]
}
}= na.omit(dfDiseases);
dfDiseases = data[,append(c(2,3,40,42),seq(7,20))];
dfPrediction = na.omit(dfPrediction); dfPrediction
# attaching the final dataset
attach(dfPrediction)
# Correlation between variables
= cor(dfPrediction)
dfPrediction.corr 'SARS.Cov.2.exam.result'] dfPrediction.corr[,
## Patient.age.quantile
## 0.15935628
## SARS.Cov.2.exam.result
## 1.00000000
## Neutrophils
## -0.05666059
## Proteina.C.reativa.mg.dL
## 0.13497681
## Hematocrit
## 0.09013687
## Hemoglobin
## 0.10785913
## Platelets
## -0.30087670
## Mean.platelet.volume
## 0.08260132
## Red.blood.Cells
## 0.10400123
## Lymphocytes
## 0.03645045
## Mean.corpuscular.hemoglobin.concentrationÂ..MCHC.
## 0.08351100
## Leukocytes
## -0.35832387
## Basophils
## -0.12702656
## Mean.corpuscular.hemoglobin..MCH.
## -0.01176143
## Eosinophils
## -0.21243742
## Mean.corpuscular.volume..MCV.
## -0.05874675
## Monocytes
## 0.23547535
## Red.blood.cell.distribution.width..RDW.
## -0.08749211
# Correlation plot
corrplot(dfPrediction.corr, method = "square", type = "lower", tl.cex = 0.5);
We can take a first glance at the dataset and immediately identify some variables that are correlated with the target value (i.e. the variable SARS.Cov.2.exam.result).
A first visual approach to identify the most relevant variables to predict infections by the SARS Cov 2 consists in fitting a logistic model with only one variable and to plot the probability of being infected by the Covid with respect to the value of this predictor. Let us illustrate this method for some features.
# Plotting Separate probability graphs
<- function(var,variableORrank){
plotting.function = paste("SARS.Cov.2.exam.result", "~", as.character(var))
dfPrediction.sep.function = glm(as.formula(dfPrediction.sep.function), data = dfPrediction , family = binomial)
dfPrediction.sep.glm cv.glm(dfPrediction,dfPrediction.sep.glm,K=10)$delta[1]
<- data.frame(
dfPrediction.predicted.data probability.of.SARS=dfPrediction.sep.glm$fitted.values,
variable=dfPrediction[,as.character(var)])
<- dfPrediction.predicted.data[
dfPrediction.predicted.data order(dfPrediction.predicted.data$variable, decreasing=FALSE),]
$rank <- 1:nrow(dfPrediction.predicted.data)
dfPrediction.predicted.data
ggplot(data=dfPrediction.predicted.data, aes(x= variable, y=probability.of.SARS)) +
geom_point(aes(color=variable), size=3) +
xlab(as.character(var)) +
ylab("Probability of having SARS CoV-2") +
scale_colour_gradient(low = "darkgreen", high = "darkred", na.value = NA) +
ggtitle(coef(summary(dfPrediction.sep.glm))[,'Pr(>|z|)'])
}= plotting.function("Platelets")
plotfun1 = plotting.function("Monocytes")
plotfun2 = plotting.function("Hemoglobin")
plotfun3 = plotting.function("Red.blood.cell.distribution.width..RDW.")
plotfun4 grid.arrange(plotfun1, plotfun2, plotfun3, plotfun4, ncol=2 , nrow = 2)
Except for platelets, all predictors have a positive correlation on the probability of having SARS CoV-2. Moreover, Platelets seems to have a large influence on the probability to be infected by the covid, while the variable Monocytes seems to be less informative.
If this method is a good way to grasp a first intuition, it remains qualitative and scaling aspects can be confusing. Let us now consider a more rigorous approach to identify relevant predictors by using the Likelihood Ratio Test.
# Likelihood Ratio Tests with H0 being the global Null
<- function(var){
pvalue_LRTest <- glm(SARS.Cov.2.exam.result ~ 1, data=dfPrediction);
globalnull <- glm(paste("SARS.Cov.2.exam.result", "~", as.character(var)),data=dfPrediction);
alternative <- lrtest(globalnull, alternative);
res_lrtest $`Pr(>Chisq)`[2];
res_lrtest
}<- colnames(dfPrediction);
names_features <- names_features[names_features != 'SARS.Cov.2.exam.result'];
names_features <- sapply(names_features, pvalue_LRTest);
pvalues names(pvalues) <- names_features;
order(pvalues, decreasing=FALSE)]; pvalues[
## Leukocytes
## 3.027628e-14
## Platelets
## 2.737168e-10
## Monocytes
## 9.840969e-07
## Eosinophils
## 1.062583e-05
## Patient.age.quantile
## 1.013128e-03
## Proteina.C.reativa.mg.dL
## 5.453885e-03
## Basophils
## 8.952467e-03
## Hemoglobin
## 2.662828e-02
## Red.blood.Cells
## 3.258267e-02
## Hematocrit
## 6.416599e-02
## Red.blood.cell.distribution.width..RDW.
## 7.241497e-02
## Mean.corpuscular.hemoglobin.concentrationÂ..MCHC.
## 8.644399e-02
## Mean.platelet.volume
## 8.993860e-02
## Mean.corpuscular.volume..MCV.
## 2.282069e-01
## Neutrophils
## 2.451826e-01
## Lymphocytes
## 4.549064e-01
## Mean.corpuscular.hemoglobin..MCH.
## 8.095195e-01
Based of the results provided by Likelihood Ratio Tests, we will only consider the predictors leading to a p-value at most equal to 0.05. So for the analysis, we will use Leukocytes, Platelets, Monocytes, Eosinophils, Patient.Age.Quantile, Proteina.C.reativa.mg.dL, Basophils, Hemoglobin and Red.blood.Cells as predictors.
We split the dataset in two parts. Taking 80% of patients, we build a training dataset on which we will fit the parameters of our logistic model. On the remaining 20% of patients, we will evaluate the performance of our model.
# Dividing Train/Test data with 80% training dataset
<- floor(0.8 * nrow(dfPrediction))
sample_size <- sample(nrow(dfPrediction), size = sample_size)
train_ind <- as.data.frame(dfPrediction[train_ind,])
dfPrediction.train <- as.data.frame(dfPrediction[-train_ind,])
dfPrediction.test
# Logistic regression considering all the variables on the target variable SARS_COV2_Result
= paste("SARS.Cov.2.exam.result", "~", "Leukocytes + Platelets + Monocytes + Eosinophils + Patient.age.quantile + Proteina.C.reativa.mg.dL + Basophils + Hemoglobin + Red.blood.Cells")
dfPrediction.function = glm(as.formula(dfPrediction.function), data = dfPrediction.train , family = binomial)
dfPrediction.glm summary(dfPrediction.glm)
##
## Call:
## glm(formula = as.formula(dfPrediction.function), family = binomial,
## data = dfPrediction.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2836 -0.3371 -0.1300 -0.0396 3.1803
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.05388 0.84330 -5.993 2.06e-09 ***
## Leukocytes -2.65092 0.58280 -4.549 5.40e-06 ***
## Platelets -0.56822 0.40671 -1.397 0.16238
## Monocytes 0.09540 0.19986 0.477 0.63313
## Eosinophils -1.21933 0.44935 -2.714 0.00666 **
## Patient.age.quantile 0.09785 0.04983 1.964 0.04955 *
## Proteina.C.reativa.mg.dL 0.70595 0.32469 2.174 0.02969 *
## Basophils -0.44670 0.34147 -1.308 0.19082
## Hemoglobin 0.43039 0.54724 0.786 0.43159
## Red.blood.Cells 0.26184 0.52074 0.503 0.61509
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 268.33 on 335 degrees of freedom
## Residual deviance: 149.83 on 326 degrees of freedom
## AIC: 169.83
##
## Number of Fisher Scoring iterations: 7
# 10 fold cross-validation to verify the model
cv.glm(dfPrediction.train,dfPrediction.glm,K=10)$delta[1]
## [1] 0.08234896
Let us predict the error on our test dataset.
# Predicting on test data based on training set
<- predict(dfPrediction.glm,dfPrediction.test,type = "response")
dfPrediction.glm.predict summary(dfPrediction.glm.predict)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000064 0.0045178 0.0402410 0.1411901 0.1388168 0.8848365
tapply(dfPrediction.glm.predict, dfPrediction.test$SARS.Cov.2.exam.result, mean)
## 0 1
## 0.08366047 0.45539063
# Confusion matrix for threshold of 1%
= table(dfPrediction.test$SARS.Cov.2.exam.result, dfPrediction.glm.predict > 0.01)
dfPrediction.confusion rownames(dfPrediction.confusion) <- c("Predicted FALSE","Predicted TRUE");
print(dfPrediction.confusion);
##
## FALSE TRUE
## Predicted FALSE 27 44
## Predicted TRUE 0 13
# False negative error rate (Type II error)
= dfPrediction.confusion[1,1]/ (dfPrediction.confusion[1,1]+dfPrediction.confusion[2,2])
dfPrediction.type2error print(paste("The proportion of errors of Type II is ",as.character(dfPrediction.type2error)));
## [1] "The proportion of errors of Type II is 0.675"
Using the confusion matrix, we understand that the Type I error using our model is really small. Hence, if patient infected by the covid with our model will be classified by our model as infected with high probability.
However, Type II errors are likely. This means that our prediction should be trust only when a patient is classified as non-infected.
Now we plot below the ROC curve for our model. Let us recall that a random classifier is expected to give points lying along the diagonal (FPR = TPR). A typical method to compare the performance of classifiers is to measure the area under the ROC curve (AUC) for each of them and to keep with the one with the larger value. In practice, the AUC performs well as a general measure of predictive accuracy.
# Plotting ROCR curve
= prediction(dfPrediction.glm.predict, dfPrediction.test$SARS.Cov.2.exam.result)
dfPrediction.ROCRpred = performance(dfPrediction.ROCRpred, "tpr", "fpr")
dfPrediction.ROCRperf plot(dfPrediction.ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
As already seen, our model is able to extract relevant insight from blood’s data of the patient. To conclude, let us plot the final results.
# Creating a dataframe with variables and predicted values of SARS results
<- data.frame(
dfPrediction.predict.dataframe probability.of.having.SARS=dfPrediction.glm$fitted.values,
Leukocytes=dfPrediction.train$Leukocytes,
Patient.age.quantile = dfPrediction.train$Patient.age.quantile,
Eosinophils = dfPrediction.train$Eosinophils,
Monocytes = dfPrediction.train$Monocytes,
Platelets = dfPrediction.train$Platelets,
Proteina.C.reativa.mg.dL = dfPrediction.train$Proteina.C.reativa.mg.dL)
= ggplot(data=dfPrediction.predict.dataframe, aes(x=Patient.age.quantile, y=probability.of.having.SARS)) +
plot1 geom_point(aes(color=Patient.age.quantile), size=4)+ theme(axis.title.y = element_text(size=9)) + theme(legend.title=element_text(size=9))
= ggplot(data=dfPrediction.predict.dataframe, aes(x=Leukocytes, y=probability.of.having.SARS)) +
plot2 geom_point(aes(color=Leukocytes), size=4)+ theme(axis.title.y = element_text(size=9)) + theme(legend.title=element_text(size=9))
= ggplot(data=dfPrediction.predict.dataframe, aes(x=Monocytes, y=probability.of.having.SARS)) +
plot3 geom_point(aes(color=Monocytes), size=4)+ theme(axis.title.y = element_text(size=9)) + theme(legend.title=element_text(size=9))
= ggplot(data=dfPrediction.predict.dataframe, aes(x=Eosinophils, y=probability.of.having.SARS)) +
plot4 geom_point(aes(color=Eosinophils), size=4)+ theme(axis.title.y = element_text(size=9)) + theme(legend.title=element_text(size=9))
= ggplot(data=dfPrediction.predict.dataframe, aes(x=Platelets, y=probability.of.having.SARS)) +
plot5 geom_point(aes(color=Platelets), size=4) + theme(axis.title.y = element_text(size=9)) + theme(legend.title=element_text(size=9))
= ggplot(data=dfPrediction.predict.dataframe, aes(x=Proteina.C.reativa.mg.dL, y=probability.of.having.SARS)) +
plot6 geom_point(aes(color=Proteina.C.reativa.mg.dL), size=4) + theme(axis.title.y = element_text(size=9)) + theme(legend.title=element_text(size=9))
# Plotting the values
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, ncol=2 , nrow = 3)
In this section, we want to find disease(s) with symptoms similar to the one of the Covid 19. Such information could be useful to decrease the rate of false positive case in hospitals and also to improve the quality of datasets provided on the covid 19.
We have to skip Parainfluenza 2 as there are no cases for it.
<- dfDiseases[,colnames(dfDiseases) != 'Parainfluenza.2']; dfDiseases
Now let’s plot the CramerV’s correlation plot to check the correlation between categorical variables.
# CramerV Correlation to check for any correlation between catagorical variables
# Except Patient's age, all other variables are catagorical(binary)
= PairApply(dfDiseases[,names(dfDiseases) != "Patient.age.quantile"], CramerV, symmetric = TRUE)
dfDiseases.corr # Displaying correlation with variable SARS_COV2_Result
'SARS.Cov.2.exam.result'] dfDiseases.corr[,
## SARS.Cov.2.exam.result Respiratory.Syncytial.Virus
## 1.000000000 0.060107431
## Influenza.A Influenza.B
## 0.034910535 0.038396638
## Parainfluenza.1 Rhinovirus.Enterovirus
## 0.014172707 0.151723838
## Coronavirus.HKU1 Parainfluenza.3
## 0.036826552 0.025943102
## Chlamydophila.pneumoniae Adenovirus
## 0.024602623 0.029612806
## Parainfluenza.4 Coronavirus229E
## 0.035880616 0.008396133
## CoronavirusOC43 Inf.A.H1N1.2009
## 0.023186945 0.084016085
## Bordetella.pertussis Metapneumovirus
## 0.011567680 0.030742142
# Correlation plot
corrplot(dfDiseases.corr , method = "square", type = "lower", number.cex=0.5)
Rhinovirus_OR_Enterovirus seems to be the disease the most correlated with our target (with the correlation value of 0.1517). In the next section, we conduct a statistical test to identify if this correlation is statically significant.
We perform Likelihood Ratio Tests to identify diseases that show a significant correlation with the infection status of patient by the SARS Cov 2.
<- function(var){
pvalue_LRTest <- glm(SARS.Cov.2.exam.result ~ 1, data=dfDiseases);
globalnull <- glm(paste("SARS.Cov.2.exam.result", "~", as.character(var)),data=dfDiseases);
alternative <- lrtest(globalnull, alternative);
res_lrtest $`Pr(>Chisq)`[2];
res_lrtest
}<- colnames(dfDiseases);
names_features <- names_features[ !(names_features %in% c('SARS.Cov.2.exam.result','Patient.age.quantile'))];
names_features <- sapply(names_features, pvalue_LRTest);
pvalues names(pvalues) <- names_features;
order(pvalues, decreasing=FALSE)]; pvalues[
## Rhinovirus.Enterovirus Inf.A.H1N1.2009
## 2.007768e-08 1.970079e-03
## Respiratory.Syncytial.Virus Influenza.B
## 2.695813e-02 1.578475e-01
## Coronavirus.HKU1 Parainfluenza.4
## 1.755593e-01 1.869221e-01
## Influenza.A Metapneumovirus
## 1.991293e-01 2.582058e-01
## Adenovirus Parainfluenza.3
## 2.761149e-01 3.400448e-01
## Chlamydophila.pneumoniae CoronavirusOC43
## 3.655902e-01 3.938322e-01
## Parainfluenza.1 Bordetella.pertussis
## 6.022621e-01 6.705796e-01
## Coronavirus229E
## 7.575288e-01
Our intuition of the previous section is confirmed by the Likelihood Ratio Test. The symptoms of Rhinovirus are close to the SARS CoV-2 virus and mostly found in the nose (Rhinovirus). Therefore, the probability of having SARS decreases as Rhinovirus is detected.
Hence hospitals should distinguish between a person with Rhinovirus and a person with COVID-19. Separating patients with Rhinovirus and Covid-19 could help to relieve covid medical groups and to end up with datasets with less false positive patients.