To go back to the Home page, click here.
This notebook was inspired by this post from the website TowardsDataScience. The dataset can be found here.
Our goal is twofold:
We want to build a predictive model able to classify a patient as infected by the covid 19 or not based on his/her blood’s data.
We also want to identify diseases that have symptoms similar to the Covid 19. In that way, we can recommend to medical centers to treat separately people infected by these diseases. This would also allow to avoid to build dataset with false instances (i.e. with false discoveries).
data <- read.csv('../../data/covid.csv');
data <- data.frame(data);
data$SARS.Cov.2.exam.result = 1*(data$SARS.Cov.2.exam.result == "positive")The dataset contains many NaN values. This issue of missing values must be corrected before moving forward. We also split the dataset in two parts. The fist DataFrame will be used to build a prediction model to classify patients as infected by the SARS Cov 2 (or not) based on their blood’s data. The second part will be used to find diseases with symptoms similar to the ones caused by the SARS Cov 2.
dfDiseases = data[,append(c(2,3),seq(22,39))];
dfDiseases = dfDiseases[,c(-7,-9)];
for (disease in colnames(dfDiseases)){
  if (!(disease %in% c("SARS.Cov.2.exam.result","Patient.age.quantile"))){
    dfDiseases[dfDiseases[[disease]]=='detected',disease] = 1; 
    dfDiseases[dfDiseases[[disease]]=='',disease] = NA; 
  }
}
dfDiseases = na.omit(dfDiseases);
dfPrediction  = data[,append(c(2,3,40,42),seq(7,20))];
dfPrediction = na.omit(dfPrediction);# attaching the final dataset
attach(dfPrediction)
# Correlation between variables
dfPrediction.corr = cor(dfPrediction)
dfPrediction.corr[,'SARS.Cov.2.exam.result']##                              Patient.age.quantile 
##                                        0.15935628 
##                            SARS.Cov.2.exam.result 
##                                        1.00000000 
##                                       Neutrophils 
##                                       -0.05666059 
##                          Proteina.C.reativa.mg.dL 
##                                        0.13497681 
##                                        Hematocrit 
##                                        0.09013687 
##                                        Hemoglobin 
##                                        0.10785913 
##                                         Platelets 
##                                       -0.30087670 
##                              Mean.platelet.volume 
##                                        0.08260132 
##                                   Red.blood.Cells 
##                                        0.10400123 
##                                       Lymphocytes 
##                                        0.03645045 
## Mean.corpuscular.hemoglobin.concentrationÂ..MCHC. 
##                                        0.08351100 
##                                        Leukocytes 
##                                       -0.35832387 
##                                         Basophils 
##                                       -0.12702656 
##                 Mean.corpuscular.hemoglobin..MCH. 
##                                       -0.01176143 
##                                       Eosinophils 
##                                       -0.21243742 
##                     Mean.corpuscular.volume..MCV. 
##                                       -0.05874675 
##                                         Monocytes 
##                                        0.23547535 
##           Red.blood.cell.distribution.width..RDW. 
##                                       -0.08749211# Correlation plot
corrplot(dfPrediction.corr, method = "square", type = "lower", tl.cex = 0.5);We can take a first glance at the dataset and immediately identify some variables that are correlated with the target value (i.e. the variable SARS.Cov.2.exam.result).
A first visual approach to identify the most relevant variables to predict infections by the SARS Cov 2 consists in fitting a logistic model with only one variable and to plot the probability of being infected by the Covid with respect to the value of this predictor. Let us illustrate this method for some features.
# Plotting Separate probability graphs
plotting.function <- function(var,variableORrank){
  dfPrediction.sep.function = paste("SARS.Cov.2.exam.result", "~", as.character(var))
  dfPrediction.sep.glm = glm(as.formula(dfPrediction.sep.function), data = dfPrediction , family = binomial)
  cv.glm(dfPrediction,dfPrediction.sep.glm,K=10)$delta[1]
  
  dfPrediction.predicted.data <- data.frame(
    probability.of.SARS=dfPrediction.sep.glm$fitted.values,
    variable=dfPrediction[,as.character(var)])
  
  dfPrediction.predicted.data <- dfPrediction.predicted.data[
    order(dfPrediction.predicted.data$variable, decreasing=FALSE),]
  
  dfPrediction.predicted.data$rank <- 1:nrow(dfPrediction.predicted.data)
  
  ggplot(data=dfPrediction.predicted.data, aes(x= variable, y=probability.of.SARS)) +
    geom_point(aes(color=variable), size=3) +
    xlab(as.character(var)) +
    ylab("Probability of having SARS CoV-2") +
    scale_colour_gradient(low = "darkgreen", high = "darkred", na.value = NA) +
    ggtitle(coef(summary(dfPrediction.sep.glm))[,'Pr(>|z|)'])
}
plotfun1 = plotting.function("Platelets")
plotfun2 = plotting.function("Monocytes")
plotfun3 = plotting.function("Hemoglobin")
plotfun4 = plotting.function("Red.blood.cell.distribution.width..RDW.")
grid.arrange(plotfun1, plotfun2, plotfun3, plotfun4, ncol=2 , nrow = 2)Except for platelets, all predictors have a positive correlation on the probability of having SARS CoV-2. Moreover, Platelets seems to have a large influence on the probability to be infected by the covid, while the variable Monocytes seems to be less informative.
If this method is a good way to grasp a first intuition, it remains qualitative and scaling aspects can be confusing. Let us now consider a more rigorous approach to identify relevant predictors by using the Likelihood Ratio Test.
# Likelihood Ratio Tests with H0 being the global Null
pvalue_LRTest <- function(var){
  globalnull <- glm(SARS.Cov.2.exam.result  ~ 1, data=dfPrediction);
  alternative <- glm(paste("SARS.Cov.2.exam.result", "~", as.character(var)),data=dfPrediction);
  res_lrtest <- lrtest(globalnull, alternative);
  res_lrtest$`Pr(>Chisq)`[2];
}
names_features <- colnames(dfPrediction);
names_features <- names_features[names_features != 'SARS.Cov.2.exam.result'];
pvalues <- sapply(names_features, pvalue_LRTest);
names(pvalues) <- names_features;
pvalues[order(pvalues, decreasing=FALSE)];##                                        Leukocytes 
##                                      3.027628e-14 
##                                         Platelets 
##                                      2.737168e-10 
##                                         Monocytes 
##                                      9.840969e-07 
##                                       Eosinophils 
##                                      1.062583e-05 
##                              Patient.age.quantile 
##                                      1.013128e-03 
##                          Proteina.C.reativa.mg.dL 
##                                      5.453885e-03 
##                                         Basophils 
##                                      8.952467e-03 
##                                        Hemoglobin 
##                                      2.662828e-02 
##                                   Red.blood.Cells 
##                                      3.258267e-02 
##                                        Hematocrit 
##                                      6.416599e-02 
##           Red.blood.cell.distribution.width..RDW. 
##                                      7.241497e-02 
## Mean.corpuscular.hemoglobin.concentrationÂ..MCHC. 
##                                      8.644399e-02 
##                              Mean.platelet.volume 
##                                      8.993860e-02 
##                     Mean.corpuscular.volume..MCV. 
##                                      2.282069e-01 
##                                       Neutrophils 
##                                      2.451826e-01 
##                                       Lymphocytes 
##                                      4.549064e-01 
##                 Mean.corpuscular.hemoglobin..MCH. 
##                                      8.095195e-01Based of the results provided by Likelihood Ratio Tests, we will only consider the predictors leading to a p-value at most equal to 0.05. So for the analysis, we will use Leukocytes, Platelets, Monocytes, Eosinophils, Patient.Age.Quantile, Proteina.C.reativa.mg.dL, Basophils, Hemoglobin and Red.blood.Cells as predictors.
We split the dataset in two parts. Taking 80% of patients, we build a training dataset on which we will fit the parameters of our logistic model. On the remaining 20% of patients, we will evaluate the performance of our model.
# Dividing Train/Test data with 80% training dataset
sample_size <- floor(0.8 * nrow(dfPrediction))
train_ind <- sample(nrow(dfPrediction), size = sample_size)
dfPrediction.train <- as.data.frame(dfPrediction[train_ind,])
dfPrediction.test <- as.data.frame(dfPrediction[-train_ind,])
# Logistic regression considering all the variables on the target variable SARS_COV2_Result
dfPrediction.function = paste("SARS.Cov.2.exam.result", "~", "Leukocytes + Platelets + Monocytes + Eosinophils + Patient.age.quantile + Proteina.C.reativa.mg.dL + Basophils + Hemoglobin + Red.blood.Cells")
dfPrediction.glm = glm(as.formula(dfPrediction.function), data = dfPrediction.train , family = binomial)
summary(dfPrediction.glm)## 
## Call:
## glm(formula = as.formula(dfPrediction.function), family = binomial, 
##     data = dfPrediction.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2836  -0.3371  -0.1300  -0.0396   3.1803  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -5.05388    0.84330  -5.993 2.06e-09 ***
## Leukocytes               -2.65092    0.58280  -4.549 5.40e-06 ***
## Platelets                -0.56822    0.40671  -1.397  0.16238    
## Monocytes                 0.09540    0.19986   0.477  0.63313    
## Eosinophils              -1.21933    0.44935  -2.714  0.00666 ** 
## Patient.age.quantile      0.09785    0.04983   1.964  0.04955 *  
## Proteina.C.reativa.mg.dL  0.70595    0.32469   2.174  0.02969 *  
## Basophils                -0.44670    0.34147  -1.308  0.19082    
## Hemoglobin                0.43039    0.54724   0.786  0.43159    
## Red.blood.Cells           0.26184    0.52074   0.503  0.61509    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 268.33  on 335  degrees of freedom
## Residual deviance: 149.83  on 326  degrees of freedom
## AIC: 169.83
## 
## Number of Fisher Scoring iterations: 7# 10 fold cross-validation to verify the model
cv.glm(dfPrediction.train,dfPrediction.glm,K=10)$delta[1]## [1] 0.08234896Let us predict the error on our test dataset.
# Predicting on test data based on training set
dfPrediction.glm.predict <- predict(dfPrediction.glm,dfPrediction.test,type = "response")
summary(dfPrediction.glm.predict)##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000064 0.0045178 0.0402410 0.1411901 0.1388168 0.8848365tapply(dfPrediction.glm.predict, dfPrediction.test$SARS.Cov.2.exam.result, mean)##          0          1 
## 0.08366047 0.45539063# Confusion matrix for threshold of 1%
dfPrediction.confusion = table(dfPrediction.test$SARS.Cov.2.exam.result, dfPrediction.glm.predict > 0.01)
rownames(dfPrediction.confusion) <- c("Predicted FALSE","Predicted TRUE");
print(dfPrediction.confusion);##                  
##                   FALSE TRUE
##   Predicted FALSE    27   44
##   Predicted TRUE      0   13# False negative error rate (Type II error)
dfPrediction.type2error = dfPrediction.confusion[1,1]/ (dfPrediction.confusion[1,1]+dfPrediction.confusion[2,2])
print(paste("The proportion of errors of Type II is ",as.character(dfPrediction.type2error)));## [1] "The proportion of errors of Type II is  0.675"Using the confusion matrix, we understand that the Type I error using our model is really small. Hence, if patient infected by the covid with our model will be classified by our model as infected with high probability.
However, Type II errors are likely. This means that our prediction should be trust only when a patient is classified as non-infected.
Now we plot below the ROC curve for our model. Let us recall that a random classifier is expected to give points lying along the diagonal (FPR = TPR). A typical method to compare the performance of classifiers is to measure the area under the ROC curve (AUC) for each of them and to keep with the one with the larger value. In practice, the AUC performs well as a general measure of predictive accuracy.
# Plotting ROCR curve
dfPrediction.ROCRpred = prediction(dfPrediction.glm.predict, dfPrediction.test$SARS.Cov.2.exam.result)
dfPrediction.ROCRperf = performance(dfPrediction.ROCRpred, "tpr", "fpr")
plot(dfPrediction.ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))As already seen, our model is able to extract relevant insight from blood’s data of the patient. To conclude, let us plot the final results.
# Creating a dataframe with variables and predicted values of SARS results
dfPrediction.predict.dataframe <- data.frame(
  probability.of.having.SARS=dfPrediction.glm$fitted.values,
  Leukocytes=dfPrediction.train$Leukocytes,
  Patient.age.quantile = dfPrediction.train$Patient.age.quantile,
  Eosinophils = dfPrediction.train$Eosinophils,
  Monocytes = dfPrediction.train$Monocytes,
  Platelets = dfPrediction.train$Platelets,
  Proteina.C.reativa.mg.dL = dfPrediction.train$Proteina.C.reativa.mg.dL)
plot1 = ggplot(data=dfPrediction.predict.dataframe, aes(x=Patient.age.quantile, y=probability.of.having.SARS)) +
  geom_point(aes(color=Patient.age.quantile), size=4)+ theme(axis.title.y = element_text(size=9)) + theme(legend.title=element_text(size=9))
plot2 = ggplot(data=dfPrediction.predict.dataframe, aes(x=Leukocytes, y=probability.of.having.SARS)) +
  geom_point(aes(color=Leukocytes), size=4)+ theme(axis.title.y = element_text(size=9)) + theme(legend.title=element_text(size=9))
plot3 = ggplot(data=dfPrediction.predict.dataframe, aes(x=Monocytes, y=probability.of.having.SARS)) +
  geom_point(aes(color=Monocytes), size=4)+ theme(axis.title.y = element_text(size=9)) + theme(legend.title=element_text(size=9))
plot4 = ggplot(data=dfPrediction.predict.dataframe, aes(x=Eosinophils, y=probability.of.having.SARS)) +
  geom_point(aes(color=Eosinophils), size=4)+ theme(axis.title.y = element_text(size=9)) + theme(legend.title=element_text(size=9))
plot5 = ggplot(data=dfPrediction.predict.dataframe, aes(x=Platelets, y=probability.of.having.SARS)) +
  geom_point(aes(color=Platelets), size=4) + theme(axis.title.y = element_text(size=9)) + theme(legend.title=element_text(size=9))
plot6 = ggplot(data=dfPrediction.predict.dataframe, aes(x=Proteina.C.reativa.mg.dL, y=probability.of.having.SARS)) +
  geom_point(aes(color=Proteina.C.reativa.mg.dL), size=4) + theme(axis.title.y = element_text(size=9)) + theme(legend.title=element_text(size=9))
# Plotting the values
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, ncol=2 , nrow = 3)In this section, we want to find disease(s) with symptoms similar to the one of the Covid 19. Such information could be useful to decrease the rate of false positive case in hospitals and also to improve the quality of datasets provided on the covid 19.
We have to skip Parainfluenza 2 as there are no cases for it.
dfDiseases <- dfDiseases[,colnames(dfDiseases) != 'Parainfluenza.2'];Now let’s plot the CramerV’s correlation plot to check the correlation between categorical variables.
# CramerV Correlation to check for any correlation between catagorical variables
# Except Patient's age, all other variables are catagorical(binary)
dfDiseases.corr  = PairApply(dfDiseases[,names(dfDiseases) != "Patient.age.quantile"], CramerV, symmetric = TRUE)
# Displaying correlation with variable SARS_COV2_Result
dfDiseases.corr[,'SARS.Cov.2.exam.result']##      SARS.Cov.2.exam.result Respiratory.Syncytial.Virus 
##                 1.000000000                 0.060107431 
##                 Influenza.A                 Influenza.B 
##                 0.034910535                 0.038396638 
##             Parainfluenza.1      Rhinovirus.Enterovirus 
##                 0.014172707                 0.151723838 
##            Coronavirus.HKU1             Parainfluenza.3 
##                 0.036826552                 0.025943102 
##    Chlamydophila.pneumoniae                  Adenovirus 
##                 0.024602623                 0.029612806 
##             Parainfluenza.4             Coronavirus229E 
##                 0.035880616                 0.008396133 
##             CoronavirusOC43             Inf.A.H1N1.2009 
##                 0.023186945                 0.084016085 
##        Bordetella.pertussis             Metapneumovirus 
##                 0.011567680                 0.030742142# Correlation plot
corrplot(dfDiseases.corr , method = "square", type = "lower", number.cex=0.5)Rhinovirus_OR_Enterovirus seems to be the disease the most correlated with our target (with the correlation value of 0.1517). In the next section, we conduct a statistical test to identify if this correlation is statically significant.
We perform Likelihood Ratio Tests to identify diseases that show a significant correlation with the infection status of patient by the SARS Cov 2.
pvalue_LRTest <- function(var){
  globalnull <- glm(SARS.Cov.2.exam.result  ~ 1, data=dfDiseases);
  alternative <- glm(paste("SARS.Cov.2.exam.result", "~", as.character(var)),data=dfDiseases);
  res_lrtest <- lrtest(globalnull, alternative);
  res_lrtest$`Pr(>Chisq)`[2];
}
names_features <- colnames(dfDiseases);
names_features <- names_features[ !(names_features %in% c('SARS.Cov.2.exam.result','Patient.age.quantile'))];
pvalues <- sapply(names_features, pvalue_LRTest);
names(pvalues) <- names_features;
pvalues[order(pvalues, decreasing=FALSE)];##      Rhinovirus.Enterovirus             Inf.A.H1N1.2009 
##                2.007768e-08                1.970079e-03 
## Respiratory.Syncytial.Virus                 Influenza.B 
##                2.695813e-02                1.578475e-01 
##            Coronavirus.HKU1             Parainfluenza.4 
##                1.755593e-01                1.869221e-01 
##                 Influenza.A             Metapneumovirus 
##                1.991293e-01                2.582058e-01 
##                  Adenovirus             Parainfluenza.3 
##                2.761149e-01                3.400448e-01 
##    Chlamydophila.pneumoniae             CoronavirusOC43 
##                3.655902e-01                3.938322e-01 
##             Parainfluenza.1        Bordetella.pertussis 
##                6.022621e-01                6.705796e-01 
##             Coronavirus229E 
##                7.575288e-01Our intuition of the previous section is confirmed by the Likelihood Ratio Test. The symptoms of Rhinovirus are close to the SARS CoV-2 virus and mostly found in the nose (Rhinovirus). Therefore, the probability of having SARS decreases as Rhinovirus is detected.
Hence hospitals should distinguish between a person with Rhinovirus and a person with COVID-19. Separating patients with Rhinovirus and Covid-19 could help to relieve covid medical groups and to end up with datasets with less false positive patients.