Predictive Image Detection - CIFAR-10 Dataset - Neural Network Model
CIFAR-10 Dataset Convolutional Neural Network Experimentation - Northwestern Consulting Research for Company XYZ
Paul Kellar
MSDS 458: Artificial Intelligence and Deep Learning
4/25/23
Abstract
Given the successful implementation of a neural network model for Company XYZ in part one of Northwestern Consulting’s research contract with them, the goal of this specific deep learning research project is to further our knowledge and understanding of neural networks - this time by experimenting with the inner workings of deep neural networks (DNNs) versus convolutional neural networks (CNNs). This project leverages another popular computer vision-focused dataset known as CIFAR-10, made up of 60,000 images of the following ten classes: airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships and trucks. Northwestern Consulting has again been tasked by Company XYZ to build a proof-of-concept, optimized neural network with the purpose of discriminating and predicting different images by the CIFAR-10 pixel data (with the long-term objective of expanding such modeling applications to other use cases at the company). In order to do so, Northwestern Consulting explores various types of neural network frameworks upon this dataset, including iterations of DNNs as well as a plethora of CNNs to compare predictive performance on validation data. As discussed within the Literature Review, CNNs excel at extracting localized patterns from image data, as well as spatial hierarchical representations via shared node weights. For both DNN and CNN frameworks, Northwestern Consulting tests the number of layers leveraged in model architectures, the use of regularizer and dropout functions to penalize overfitting, and the number of input nodes in order to converge upon an optimized model for predicting CIFAR-10 images.
Introduction
Company XYZ seeks to expand its usage of neural networks for computer vision predictive purposes - specifically in terms of building image recognition algorithms that can disseminate objects, such as ships versus airplanes versus trucks. Thus, Northwestern Consulting has provided various proof-of-concept models for this purpose based on the CIFAR-10 dataset as an open-source image data source. This dataset is made up of 60,000 images (50,000 for training and 10,000 for validation), which provides an ample sample size of data to test various model architectures within the DNN and CNN frameworks. Evaluation metrics such as validation accuracy, validation loss, root mean squared error (RMSE), and F-1 were leveraged in order to select a model, specifically in terms of the following experiments:
Experiment 1: DNN with 256 input units, 2 hidden layers (no regularization) and 10 output nodes
Experiment 2: DNN with 256 input units, 3 hidden layers (no regularization) and 10 output nodes
Experiment 3: CNN with 256 input units, 2 hidden layers (no regularization) and 10 output nodes
Experiment 4: CNN with 256 input units, 3 hidden layers (no regularization) and 10 output nodes
Experiment 5a-5d: Redoing experiments 1-4, with L2 regularization and dropout
Experiment 5e: CNN with 128 input units, 3 hidden layers, and regularization and dropout (L2-regularization and a Global Average Pooling layer)
Experiment 5f: CNN with 128 input units, 3 hidden layers, and regularization and dropout (L2-regularization and a Dense layer)
After comparing various model architectures and parameter iterations, the optimized CNN via experiment 5f achieved 81.5% validation accuracy on unseen CIFAR-10 data. Further, the aforementioned CNN experiments outperformed the DNN experiments in terms of predictive accuracy by almost 40% on average. This makes some logical sense, as CNNs tend to be more effective than DNNs in extracting localized image patterns through its consolidated filtering structure, and sharing/translating these network weights across various aspects of said images for improving predictive accuracy. Therefore, as further demonstrated in the upcoming sections of this research project write-up, CNNs can be seen as an ideal neural network framework for such a computer vision problem.
Literature Review
Since it is a very popular deep learning open-source dataset, many iterations of predictive models have been tested on the CIFAR-10 dataset for image recognition purposes - with CNN models tending to output the most accurate results by predictive accuracy and validation loss. CNNs are specifically useful image recognition models because the “convolution layer is responsible for extracting important features with the help of filters” (Doon 2018). Essentially, CNNs are made up of filters that augment spatial pattern recognition within specific subsets of image pixels, and can transfer the weights of the most relevant local features across similar images in order to more easily learn pixel patterns (Doon 2018). Therefore, especially since this computer vision dataset made up of only 50,000 images would be categorized as relatively small, leveraging these shared weights across images tend to drive improved image recognition with lower chances of overfitting an overly complicated neural network on such a dataset.
Further, in terms of the layers’ architectures, Max Pooling layers are key features of CNNs which further simplify the model weights by shrinking the image stack and just retaining the maximum weight values (Doon 2018). Other common neural network characteristics used to decrease the likelihood of overfitting include dropout and l2-regularization layers. Dropout randomly drops out some neurons each time it is used within a neural network, in turn encouraging neurons to “learn useful features on their own without relying on other neurons” (Doon 2018). L2-regularization also combats overfitting by penalizing multicollinearity amongst a neural network’s weights, which ideally garners a simpler model that is not over-trained on the input data such that it is less performant on validation data (Doon 2018).
Leveraging batch normalization in the context of CNNs with the CIFAR-10 dataset can also significantly improve predictive performance. For each layer that is introduced in a neural network, batch normalization re-normalizes its outputs in order to improve model performance by “reducing the internal covariate shift” (Thakkar 2018). This reduction in internal covariate shift allows for more relevant information about image data to be transferred between training layers (basically standardizing the data between layers so that there is less data leakage between layers from which the model can better learn pixel patterns), which in turn can engender improved pattern recognition in said neural networks. Specifically, Vignesh Thakkar and team studied the impact of batch normalization on the CIFAR_10 dataset, and observed 10% - 20% absolute improvements in predictive accuracy with the introduction of batch normalization (Thakkar 2018). Though this was one of the first research studies of its kind to specifically focus on the impact of batch normalization in improving predictive accuracy for neural networks, it generally affirmed the notion of its usefulness in CNN architectures.
Another relevant CNN architectural component tested by Northwestern Consulting to combat model overfitting is Global Average Pooling. Similarly to batch normalization and CNN filters/Max Pooling layers, which seek to optimize and simplify a model’s framework, Global Average Pooling flattens layers into a single vector by calculating the average value for each input channel, which are then combined via a concatenate layer. This framework in tandem with CNN filtering, Max Pooling, and batch normalization are actively employed in Saikat Islam Khan and team’s research regarding deep neural network approaches for predicting breast cancer. Leveraging these overfitting-fighting techniques in tandem, their optimized model achieved over 99% predictive accuracy in detecting breast cancer via image data (Khan 2022). Various iterations and combinations of these frameworks will similarly be tested by Northwestern Consulting when attempting to build an optimized model for image recognition on CIFAR-10.
Methods
Like the prior research conducted for Company XYZ, various iterations of DNNs and CNNs will be explored to converge upon an optimized solution that best fits the data, and maximizes predictive accuracy while balancing simplicity in design (in such a way that the model’s predictions are scalable upon unseen data and future use cases). Specifically, Northwestern Consulting tested DNNs versus CNNs, the number of hidden layers, and various neural network regularization techniques to combat model overfitting. It was generally found that CNNs with regularization parameters employed were the most performant on the CIFAR-10 dataset. Various other hyperparameters/model architectures could be tested in future experimentation (such as the number of hidden nodes and the optimization functions leveraged when compiling the model), but for the purposes of better understanding the model architectures in terms of the aforementioned factors, Northwestern Consulting purposely chose to keep specific parameters like these static.
The data was initially explored and preprocessed before any modeling took place. After loading the CIFAR-10 dataset and the necessary Python packages, key information such as the shape of the datasets (these images were sized 32x32x3, which is critical information when inputting the image data into the neural networks correctly) as well as a preview of the images (shown in Table 1 below) were explored. After re-shaping the data by dividing the image data by 255, which is the number of pixels per image, model training could begin.
Table 1. Sample Images from the CIFAR-10 Dataset
Generally, significantly more preprocessing would be necessary before image data would be ready for neural network modeling, but the CIFAR-10 dataset has already been extensively cleaned/organized since it is a popular open-source Python image dataset. The overall preprocessing methodology is visualized in Table 2 below. Further image preprocessing such as image augmentation was tested (essentially ‘doubling’ the number of training images by creating replicas that were flipped/rotated in order to synthetically produce more training images for the neural networks), but this was ultimately unsuccessful due to limited RAM capabilities in Northwestern Consulting’s initial Google CoLab instance. Finally, in terms of model evaluation/selection methodology, models were run through various epochs per each experimental iteration with the optimal one in terms of validation accuracy being saved. From there, these top 10 models were compared, with the ‘best’ one being selected after evaluating metrics such as validation accuracy, validation loss, F1 scores, and class-level evaluation performance in terms of confusion matrices.
Table 2. CIFAR-10 Dataset Preprocessing Visualized for Neural Network
Results
In order to evaluate a baseline model performance as a starting point for this project, a DNN with 2 hidden layers (and no regularization) was studied. Any model with greater than 10% predictive accuracy on this dataset could theoretically be considered a ‘performant’ since a random-guess model would achieve approximately this level of accuracy on CIFAR-10; however, Northwestern Consulting would expect to build a significantly more performant model when leveraging neural networks for prediction purposes based on prior research. Therefore, the baseline DNN (Model 1) provides a more relevant evaluation benchmark from which additional predictive accuracy can be gleaned via model tuning. This model has 256 input nodes in its first layer and 128 in its second, and achieved 50.4% accuracy on validation data.
Given the relevant research mentioned in the Literature Review which has achieved upwards of 80% predictive accuracy on CIFAR-10 validation data, it is evident that further model improvements could be made (Thakkar 2018). Regardless, testing various DNN models vs CNN models similarly allows for relevant model performance benchmarking. As illustrated in Table 3 below, predictive performance for Model 1 improves per epoch until around the 11th epoch. It makes sense that the model’s predictive performance improves for each epoch, as the model continues to learn more about the relevant image patterns in each iteration, up until the 11th epoch in which validation loss begins to climb from its minimum value and validation accuracy seems to stagnate (whereas the training accuracy and loss continue to improve, but at this point of divergence model overfitting would occur).
Table 3. DNN Model 1 - Validation Accuracy and Validation Loss by Epoch
The performance of the two-layer DNN was then compared to a three-layer DNN (Model 2). The third layer had 64 input nodes, but was otherwise structured the same (ReLu was also leveraged for the layers’ activation functions, with softmax as the output layer function). Overall, the three-layer DNN actually performed slightly worse than the two-layer DNN, with an optimized validation accuracy of 49.6% (where performance was optimized at the 16th epoch). The hypothesis was that adding more hidden layers could engender stronger predictive accuracy, with the trade-off being that the model would become more complex to interpret due to the extra layer. However, the decrease in performance could actually be attributed to the extra layer making the neural network too complex from a training/learning perspective, in that the three-layered DNN could not generalize patterns as well on this dataset. Given that extremely large datasets are needed for computer vision problems generally, overfitting may be occurring here (potentially demonstrated by the fact that the training data was more than 6% predictive of the image class than that of the validation images by absolute percentages). Alternatively, the deeper neural network may lead to a vanishing gradient - in which the nodes’ weights become very small to the point that convergence is very difficult to achieve (Doon 2018). Regardless, although this is not a very optimal model, it points to the value of testing both different neural network structures (CNNs i.e.) as well as implementing regularization techniques to deter model overfitting from occurring.
Next, two and three-layered CNNs were tested (Model 3 and Model 4, respectively - similarly built without regularization) to isolate the impact of the convolutional network structure in isolating local image patterns versus the structure of DNNs. Leveraging CNNs in this context led to approximately a 40% relative improvement in validation accuracy (70.3% and 71.0% for the two and three-layered CNNs, respectively), and similarly a 38% relative improvement in validation loss (0.89 and 0.87 for the two and three-layered CNNs, respectively). As mentioned in previous sections, it was expected that CNNs would be more performant because they are structured to identify and correlate spatial pixel-level relationships (whereas DNNs evaluate each pixel individually to predict the class outputs).
As illustrated in Table 4 below, these spatial relationships are evident when plotting the ‘lighted up’ regions of the convolutional and max pooling layers in a grid format compared to key features of the original images. These would correspond to the logical features making up a specific class. Specifically, this output corresponds with the ‘horse’ class (class 7) via the two-layered CNN model. When looking especially at the first convolutional and max pooling layers, the outlines of a horse can clearly be seen, which illustrates some level of model success in terms of the CNN capturing the outline of a horse in a similar way that a human might logically understand it to be in their brains. However, when looking at the second grid output for the second convolutional layer, that pattern is less clear, which could either point to the fact that the second layer is taking an even further simplification of the first grid’s features (such that the horse outline is no longer very visible to the human eye) or that the model has room to improve (via some regularization techniques). In terms of the classification report seen in Table 5 below, the horse class specifically achieved a 0.72 F-1 score in the two-layered CNN, demonstrating that this was able to predict the horse class with high accuracy, but given that some other classes achieved higher F-1 scores there may still be model improvement opportunities worthy of testing.
Table 4. Grid Outputs of Two-Layered CNN Layers
Table 5. Two-Layered CNN Classification Report
As a next iteration, the previous four models were replicated with regularization techniques included (dropout and L2-regularization, which are both meant to prevent model overfitting). These models are labeled as Model 5a, 5b, 5c, and 5d, respectively. As shown in Table 6, these surprisingly actually led to lower predictive performance for each of the aforementioned DNNs and CNNs. Though the models still performed relatively well and did not lose substantial predictive accuracy, the combination of regularization techniques may have actually led to model underfitting. For one, the dropout percentage may have been too high at multiple layers of 25% dropout. Further, instituting multiple levels of L2-regularization for each layer of the models may have similarly over-penalized the models, leading to node weights that were too small to garner as many relevant predictions. Thus, less aggressive regularization was then tested, focusing on only 5% dropout per layer as well as only one L2-regularization layer (in the Dense layer of the model). Leaky ReLu was also tested for the CNN (which is meant to be less sensitive to hyperparameter tuning) as well as Global Average Pooling (for feature extraction and dimensionality reduction before the final classification layer) as alternative ways to prevent overfitting.
Table 6. Accuracy/Loss for Two & Three-Layer neural networks with/without Regularization
Decreasing the proportion of regularization led to the most optimized models which Northwestern Consulting would recommend to Company XYZ’s management team. The CNN model leveraging 5% dropout, one L2-layer, Leaky ReLu, and Global Average Pooling achieved 80.8% validation accuracy (Model 5e), and a similar model leveraging similar features except for Global Average Pooling achieved 81.5% validation accuracy (Model 5f). These models also leveraged Batch Normalization, which is meant to standardize the data between nodes such that less data is lost in translation between them (Thakkar 2018). This clearly demonstrates the necessity to iterate upon model hyperparameters/structures; regularization initially seemed to hurt model performances when the proportion of regularization was too high, yet the most optimized models were found once various ensembles of regularization techniques were tested.
In summation, there are many different neural network techniques that can be leveraged upon the CIFAR-10 dataset, and so it is important to experiment with various model types and techniques in order to converge upon an optimized result. Due to their ability to extract spatial patterns within image data, a CNN would be the recommended model type for Company XYZ management’s use case. Northwestern Consulting’s optimized CNN leveraging dropout, L2-regularization, and Leaky ReLu activation (Model 5f) would be the specific model recommended to Company XYZ management due to its superior predictive accuracy as well as its ability to prevent overfitting. Table 7 below compares all aforementioned models’ performances in terms of validation accuracy, validation loss, and compile time which led to the final recommendation. Looking at this table, Model 5f boasts the highest predictive accuracy and lowest validation loss, as well as a relatively small difference of 3.5% in terms of training minus validation accuracy, whereas the optimized non-regularized CNN model had a 7% difference between training and validation accuracy. This further illustrates that this model is not overfitting to the training dataset as severely as in Model 5f since there is a smaller difference in accuracy between these two dataset splits. Model 5f’s predictive accuracy and minimized overfitting both allow for a model like this to be much more scalable to future data/use cases for Company XYZ.
Table 7. Neural Network Performances by Validation Accuracy, Validation Loss, Compile Time
Conclusion
To conclude, Northwestern Consulting recommends Model 5f for Company XYZ’s proof-of-concept CNN computer vision modeling use case due to its superior predictive performance as well as scalability for future uses. Long term, models like this could likely be further improved via more training data (either collecting more images, or augmenting data to produce extra synthetic samples). If this is done, Northwestern Consulting would also recommend utilizing GPUs/TPUs with very high RAM, such that models like these can be compiled and iterated upon more quickly (especially in terms of conducting high-RAM procedures like data augmentation). This would also allow for automated model experimentation such that hyper-parameters could be continuously improved over time.
References
Doon, Raveen, et al. “CIFAR-10 Classification Using Deep Convolutional Neural Network.” 2018 IEEE Punecon, 2018, https://doi.org/10.1109/punecon.2018.8745428.
Khan, Saikat Islam, et al. “MultiNet: A Deep Neural Network Approach for Detecting Breast Cancer through Multi-Scale Feature Fusion.” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 8, 2022, pp. 6217–6228., https://doi.org/10.1016/j.jksuci.2021.08.004.
Thakkar, Vignesh, et al. “Batch Normalization in Convolutional Neural Networks — a Comparative Study with CIFAR-10 Data.” 2018 Fifth International Conference on Emerging Applications of Information Technology (EAIT), 2018, https://doi.org/10.1109/eait.2018.8470438.
Comments
Post a Comment