COVID Chest X-Ray Prediction Neural Network

 


COVID Chest Imaging Data Convolutional Neural Network Experimentation - Northwestern Consulting Research for Medical Company XYZ


Paul Kellar

MSDS 458: Artificial Intelligence and Deep Learning

5/27/23


Abstract

Northwestern Consulting has recently been employed by Medical Company XYZ to build a neural network to disseminate diseases by analyzing image data, and for Northwestern Consulting to demonstrate the efficacy of such a pathology Artificial Intelligence model based on the x-ray imaging data. Specifically, since Northwestern Consulting has access to 29,986 images of chest x-ray images (approximately half of which classified as being positive for COVID-19, and the rest testing negative), they would like to build a proof-of-concept model based on this data to better understand how neural networks can be leveraged in benefitting the medical imaging and virus/disease categorization space. This dataset poses a great use case for Convolutional Neural Network (CNN) architectures, which excel at recognizing both local as well as global patterns within images to categorize based upon specific such patterns. Medical Company XYZ seeks to expand its automated pathology services soon to compete with rivals which are beginning to leverage Artificial Intelligence more often in their practices. 

In order to do so, Northwestern Consulting explores various types of such frameworks, including a baseline Dense Neural Network (DNN) as well as Convolutional Neural Networks (CNNs) to compare predictive performance on validation data. As discussed within the Literature Review, both ‘built-from-scratch’ CNNs were tested as well as various CNN model types that are have historical been successful in terms of medical image classification – including DenseNets, ImageNets, and Visual Geometry Group (VGG) models. And when landing upon an optimized neural network architecture for this use case, Northwestern Consulting further iterated upon various numbers of layers, regularization techniques, and the amount of input nodes in order to converge upon an optimized model for predicting a chest x-ray image’s respective category being COVID positive or negative. 

Introduction

Medical Company XYZ seeks to expand its use of neural networks for computer vision predictive purposes - specifically in terms of building neural network models that can predict the presence of COVID-19 in a chest x-ray image based on pixel-level patterns identified from the image itself in the short term, and then expanding pathology use cases to other disease/virus types/x-ray image types in the long term. Thus, Northwestern Consulting has provided various proof-of-concept neutral network models for the short-term purpose of identifying COVID-19 from chest x-ray images based on the aforementioned dataset. When testing and optimizing such a model, evaluation metrics like validation accuracy, validation loss, root mean squared error (RMSE), and F-1 are leveraged when evaluating the following experiments’ outputs:

  • Experiment 1: 2-layer DNN as a baseline predictive model

  • Experiment 2: CNN with 2 hidden layers, batch normalization, 20% dropout, and global average pooling

  • Experiment 3: CNN with 4 hidden layers, batch normalization, 5% dropout, and global max pooling

  • Experiment 4: CNN with 3 hidden layers, batch normalization, 20% dropout, global average pooling, and leaky ReLu activation rather than ReLu activation

  • Experiment 5-6: VGG models with 13 total CNN layers, experimenting with the number of hidden layers and percent of dropout

  • Experiment 7: DenseNet model leveraging ImageNet weights and global average pooling

  • Experiment 8: ResNet model leveraging ImageNet weights and global average pooling

  • Experiment 9-10: ImageNet CNNs with various levels of hidden nodes and dropout proportions

After comparing various model architectures and parameter iterations, the optimized ImageNet-based CNN from experiment 10 (Model 10) achieved 96.0% accuracy on unseen chest x-ray data. Further, this ImageNet model outperformed the baseline DNN model in terms of relative predictive accuracy by over 9%.

Literature Review

         To demonstrate the efficacy of CNNs in classifying medical image data with a high degree of accuracy, a combination of from-scratch CNNs, experimentally refined CNNs, and historically successful CNN types were tested for COVID-19 categorization. There is always room to build CNN models from scratch, and improve upon their predictive accuracy and loss rates on validation data in that way, yet it also can also make sense to re-use and improve upon models that historically categorize medical imaging data well, such that the best results can be gleaned more efficiently (which is especially valuable in a proof-of-concept use case of neural networks for medical image classification like this). Therefore, ImageNet, DenseNet, and VGG models were tested as compared to net-new CNN and DNN models when comparing performance on validation/test data.

         Leveraging ImageNet models’ weights for a CNN’s input layer (based on pre-trained models from other images gleaned from the ImageNet dataset) often can provide a better starting point for model training than a network learning from the training dataset alone (Xie 2018). Specifically, since neural networks tend to need extremely large datasets in order to improve their nodes’ optimal network weights (and this dataset contains less than 30,000 chest x-ray images), having a semi-optimized starting point upon which the model has been “pre-trained” can be extremely beneficial. Furthermore, Xie and team explain that grayscale ImageNet images for medical predictive modeling can provide the most optimized initial weights for pre-training models, whereas color scale images “introduce artifacts and inefficiencies into models that are intended for single-channel medical images” (Xie 2018). Though grayscale vs color was not specified in the model weights leveraged, this would be an important further such consideration when improving this model in future analyses.

         Northwestern Consulting is also testing DenseNet neural network model weight architectures because of historical accuracy in learning from medical image data. In a very similar study, Tavishee Chauhana and team built a CNN using DenseNet weights and achieved 98.3% predictive accuracy in categorizing COVID-19 cases based upon chest x-ray images (Chauhana 2021). DenseNet CNNs tend to need fewer parameters than conventional CNNs because they are “not required to learn unessential feature maps”, potentially allowing for much higher accuracy in predicting COVID-19 vs negative cases since most pixels in a chest x-ray image could likely just be considered noise when comparing said classes. In addition, the architecture of DenseNets are in compact/dense blocks, which collectively allow for information to be more seamlessly conveyed between the layers themselves. The denser the blocks are, the higher the predictive accuracy should be between blocks, which allows for more of the relevant pixel-level information to be passed through the CNN (Chauhana 2021). Figure one below (from Chauhana’s research) gives a clear illustration visually of the DenseNet block structure. It will be interesting to compare the performance of such a model to other pre-built models in terms of their input weights as well as the more original CNNs built for the purposes of this use case.  

Figure 1. Block Diagram of DenseNet

         Finally, various VGG models are tested in this research project when evaluating predictive performance. In another similar study, the article titled "Modified Vgg Deep Learning Architecture for Covid-19 Classification Using Bio-Medical Images" proposes a modified version of the VGG deep learning architecture for COVID-19 classification using biomedical images. This particular study achieved optimized predictive accuracy of 85.3% in terms of categorizing chest x-ray images as COVID-19, viral pneumonia, bacterial pneumonia, and normal chest x-rays (Al-Azzawi 2021). VGG neural networks were specifically chosen for this use case due to their ease in being modified for the specific predictive use case at hand, including changes to the network architecture and the addition of extra convolutional layers (Al-Azzawi 2021).

Even when looking at a small sample size of images (2503 images, as compared to over 29,000 chest x-ray images for Medical Company XYZ’s purposes), Al-Azzawi and team’s modified VGG Net model achieved 95.4% accuracy in specifically predicting COVID-19 cases, whereas other models they tested like GoogLe Net, Inceptionv4, AlexNet, and DenseNet201 could not achieve 95% accuracy (Al-Azzawi 2021). Thus, Northwestern Consulting’s research tested various VGG net architectures with varied numbers of hidden nodes, convolutional layers, and percent of dropout. Furthermore, given that Northwestern Consulting’s prior projects have all been focused on optimizing built-from-scratch models it was a great learning experience to compare such models versus more pre-built layer architectures and initial weights.  

Methods

Like the prior research conducted by Northwestern Consulting this quarter, various iterations of CNNs (both simple CNN structures as well as more complicated structures such as the DenseNet) are explored to converge upon an optimized solution that best fits the COVID-19 chest x-ray image data, while maximizing predictive accuracy and balancing simplicity in design (in such a way that the model’s predictions are scalable upon unseen data and future use cases). However, significantly more data pre-processing had to be conducted in order to prepare this input data into a CNN framework. Though this data was pulled directly from a Kaggle open-source dataset, more exploratory data analysis and data cleaning was needed than that for datasets like the MNIST digit recognizer. Optimally pulling the data into Python was specifically difficult to begin with, as the images themselves were not labeled as COVID-19 positive or negative; instead, there were train and test text files which held the images’ file paths and their associated classes, with the actual images being stored in separate folders. Since this was a newer exercise in pulling data as compared to prior projects, many code iterations of this were explored, and this structure tended to make data pre-processing and model ingestion a bit more difficult. A sample of the input data in terms of both COVID positive and negative chest x-rays can be seen below in Figure 2.

Figure 2. Sample Images of COVID-19 Positive and Negative/Normal Lung X-Ray Classes

After preprocessing the data for model ingestion, the images were rescaled (divided by 255), and augmented images were added such that there was a larger training set for the models to learn local and global pixel-level patterns. Specifically, new images were created by flipping some images, and changing the brightness scale on others. In future iterations of this image augmentor function, it would be worthwhile to investigate why only 2,000 new images were generated rather than ~30,000 more, however. 

From there, the data was split into test/train/validation datasets, as well as converted from a two-class output model to a one-class output model in which a prediction closer to one indicated COVID positive, whereas a prediction closer to zero indicated a negative x-ray (in an attempt to speed up computation of the neural networks themselves, as compile time was a hurdle even when leveraging a Google CoLab Pro high RAM instance). Given the aforementioned long compile times necessary to build such CNN models, the ten models tested (DNN, CNNs, ImageNets, DenseNet, and VGG Net) were run on only 10 epochs - with exception to the optimized model after testing which was then re-run at 25 epochs with early stopping - with the best model being saved off for more extensive evaluation. The models were evaluated in terms of validation accuracy, validation loss, F1 Scores, and Area Under the Curve (AUC).   

Results

As a starting point, a 2-layer Dense Neural Network (DNN) with Global Average Pooling, Batch Normalization, and 20% dropout was studied. Since DNNs are generally not as strong as CNNs in terms of extracting local and spatial patterns amongst pixels in images, starting with a model like this can provide solid baseline predictive accuracy and loss rates to improve upon through experimentation. This model actually performed relatively well in terms of predicting COVID vs negative cases, achieving 87.3% validation accuracy and 0.38 validation loss. Though not necessarily the ideal neural network model type to leverage for image classification, it is evident that DNNs can still perform well in this task (potentially aided by the fact that there were only two classes to predict for). However, the model did achieve 87.1% predictive accuracy in its first epoch, and then only marginally improved to 87.3% by the fifth epoch, which could illustrate that this model type can excel at capturing high-level patterns in image recognition but cannot discern more nuanced local patterns like some CNN models can on this dataset. The marginal improvement in validation accuracy per epoch is shown in Figure 3 below.


Figure 3. DNN Baseline Model’s Validation Accuracy per Epoch

From there, three CNNs were built ‘from scratch’ in order to compare predictive performances to the DNN model. Specifically, parameters including the number of hidden layers, the number of hidden nodes, the percent of dropout, and the activation functions (ReLu vs Leaky ReLu) were compared. The best of these models only slightly outperformed the DNN model, at predictive accuracy of 88.8% and validation loss of 0.3545 (which was Model 2 with two hidden layers). Further, when experimenting with these model architectures, adding more hidden layers seemed to be a major deterrent to predictive performance. Model 3, with four hidden layers, drastically overfit to the training images while performing very sporadically on validation image prediction. On the training data, this model achieved as high as 93% training accuracy, but alternated from 12.9% to 87.1% validation accuracy between each epoch. Similarly (though not as drastically), Model 4 with three hidden layers achieved 92% accuracy and 0.176 loss on training data while only outputting 87.1% validation accuracy and 0.915 validation loss. The fact that these more complex framework models topped out at the same validation accuracy as the simple baseline DNN model may illustrate that there is information being lost/not being identified via these CNN models as they become more complex (in turn leading to this overfitting). Thus, Model 2 is a marginally better than the baseline DNN, yet there is still room for improvement by experimenting with differing model frameworks.

Next, two VGG Net models were tested, since these models tend to be strong medical image prediction models as explained in the Literature Review. Though it was initially somewhat surprising that Model 5 actually slightly underperformed the baseline with 87.1% validation accuracy (and Model 6  significantly underperformed with 13% validation accuracy), Northwestern Consulting was able to re-iterate through this experimentation that overly complex neural network architectures can quickly lead to data overfitting to training data (especially when the size of the training data is not extremely large). These VGG Nets each had 13 total convoultional layers, given that having at least five VGG Net “blocks” tended to be the standard practice based upon other the research conducted in the Literature Review for models like these (Al-Azzawi 2021). Though Model 5 achieved 87.1% validation accuracy, this may have been a fluke of sorts because its validation loss ballooned to 14.44 by the fifth epoch. And Model 6 achieved >90% training accuracy with just 13% validation accuracy, clearly demonstrating that these overly complex models were similarly leading to model overfitting. Though building these models seemed like a ‘failure’ at the time as a result (especially because the models took >10 minutes to compile per epoch, making it more difficult to experiment with model layers iteratively and in more real-time), it was valuable learning that these more complexly-built models did not necessarily engender more accurate results on validation data as compared to simple DNN and CNN models. 

Given these new learnings, it made sense why a simpler CNN with only one convolutional layer, Global Average Pooling, Batch Normalization, Dropout, and initialized ImageNet weights achieved significantly higher predictive accuracy in predicting COVID-19 positive/negative cases based on chest x-ray imaging. For one, a simpler network structure was able to recognize specific patterns in the images without losing too much valuable information, which makes sense especially since this model had ~24,000 training images whereas image datasets such as the MNIST Digit Recognizer dataset boasted over 60,000 training images. Moreover, transfer learning played an important factor in this model achieving 96% accuracy on test data. And as seen in Figure 4 below, this model further achieved 96% F1 scores for both positive and negative COVID-19 categories. Instantiating this CNN with ImageNet weights effectively combatted the fact that this dataset was rather small, by initializing pre-trained weights learned from training on the ImageNet dataset, with only the weights of the subsequent model layers being further modified during model training. These weights are learned and generalized from an extremely large dataset in ImageNet, which allows for a more optimized starting point upon which the model can learn and adapt to the COVID-19 chest x-ray dataset. This model, therefore, is effectively able to balance simplicity and generlizability in predicting COVID in chest x-ray images without suffering from overfitting (while also signficantly beating the 87.3% valdiation accuracy baseline set by the initial DNN model). Figures 5 and 6 illustrate the high test accuracy of this model via a confusion matrix and complimentary ROC Curve chart, while Figure 7 demonstrates the model’s optimized point of validation accuracy in the 19th epoch (and given more computing power, it is possible that even higher validation accuracy could have been achieved as more epochs were tested). 

Figure 4. Classification Report for Optimized ImageNet CNN


Figure 5. Confusion Matrix for Optimized ImageNet CNN

Figure 6. ROC Curve for Optimized ImageNet CNN

Figure 7. Accuracy and Loss for Optimized ImageNet CNN per Learning Epoch

Therefore, Northwestern Consulting recommends that Medical Company XYZ leverage Model 10 (the optimized single layer CNN utilizing ImageNet instantiation weights) because it balances both simplicity and generalizability on the test data while achieving the highest predictive accuracy. Figure 8 below shows this recommended model, alongside some of the aforementioned models highlighted in this paper, and it is clear that when comparing the percent difference in validation accuracy between these models that Model 10 is best-suited for Medical Company XYZ’s proof-of-concept use case.

Figure 8. Stack-Ranked Model Recommendartion List (Sorted by Validation Accuracy and Validation Loss)

Conclusion

To conclude, Northwestern Consulting recommends Model 10 for Medical Company XYZ’s proof-of-concept CNN computer vision modeling use case due to its superior predictive performance as well as scalability for future uses. Longer term, models like this could likely be further improved via additional training data (either collecting more images, or improving the data augmentation process in order to double or triple the number of synthetic and actual training images available for us to study). It would also be very interesting to add in additional classes of chest x-ray images in order to more accurately identify the nuanced patterns seen for a COVID-19 patient versus one with viral/bacterial pneumonia, for example. It would be interesting from there to also explore visualization functions like heatmaps in order to pinpoint the specific pixel-level patterns distinguishing COVID-19 from other illnesses (Northwestern Consulting experimented with various heatmapping functions but was ultimately unable to converge on a working solution). Finally, in order to further improve experimentation opportunities upon this dataset, Northwestern Consulting would also recommend utilizing GPUs/TPUs with very high RAM, such that models like these can be compiled and iterated upon more quickly. 

 

References

Al-Azzawi, A. A. F., & Al-Bayati, A. H. (2021). Modified Vgg Deep Learning Architecture for Covid-19 Classification Using Bio-Medical Images. IOP Conference Series: Materials Science and Engineering, 1084(1), 012001. https://doi.org/10.1088/1757-899X/1084/1/012001\

Chauhan, Tavishee, et al. “Optimization and Fine-Tuning of DenseNet Model for Classification of COVID-19 Cases in Medical Imaging.” International Journal of Information Management Data Insights, vol. 1, no. 2, 2021, p. 100020, https://doi.org/10.1016/j.jjimei.2021.100020.

Xie, Yiting, and David Richmond. “Pre-Training on Grayscale ImageNet Improves Medical Image Classification.” Lecture Notes in Computer Science, 2019, pp. 476–484, https://doi.org/10.1007/978-3-030-11024-6_37.

 

 

 

 



Comments

Popular posts from this blog

NBA Betting Model - Beating the House

NBA All-Star Predictive Modeling

Making the Perfect March Madness Bracket – An (Impossible) Tradition Unlike Any Other