Neural Networks for Handwritten Digit Classification
Deep Learning Project: MNIST Dataset Neural Network Experimentation - Northwestern Consulting Research for Company XYZ
Paul Kellar
MSDS 458: Artificial Intelligence and Deep Learning
4/12/23
Abstract
The goal of this deep learning research project is to gain a deeper understanding of the inner workings of neural networks, leveraging the popular computer vision-focused MNIST dataset to identify handwritten digits in conducting a hypothetical project for Company XYZ. Specifically, this means exploring neural networks in terms of their frameworks/how they work (how hyperparameters such as hidden layers and hidden nodes work and how changing them impacts how the models learns from the data, for instance), examining various neural network structures in terms of a single-hidden layer network, and learning how to fit and run a neural network on the MNIST dataset using Python’s TensorFlow and Keras packages.
Since the class lectures thus far have focused on the mathematical frameworks behind how a neural network builds out its predictions - in terms of backpropagation to converge upon accurate neuron/node weights that minimize actual vs predicted errors of the results - various experiments are run to test the impact of different hidden node/layer inputs and feature selections in obtaining an ‘optimal’ model that is most predictive on the MNIST validation data. The following models are built to evaluate predictive performance:
Experiment 1: A dense neural network with 784 input nodes, a hidden layer with 1 hidden node, and 10 output nodes
Experiment 2: A dense neural network with 784 input nodes, a hidden layer with 2 hidden nodes, and 10 output nodes
Experiment 3: Dense neural networks of the same structure as above but instead testing 4, 8, 16, 32, 64, 128, and 256 hidden nodes, respectively
Experiment 4: A dense neural network using Principal Components Analysis (PCA) to reduce the number of input dimensions from 784 to 154
Experiment 5: A dense neural network with model inputs being the 70 most ‘important’ features extracted from a Random Forest classifier
Introduction
In this hypothetical scenario, Company XYZ wants to automate the data entry of handwritten digit data into its databases. Both time and employee resources are hard constraints, and given an upcoming external audit, Company XYZ has realized that they will not have enough time to manually enter 60,000,000+ handwritten records’ worth of data into their systems in time. Given the tight timeframe and importance of this work, the goal is to input as much data as possible into their databases before the deadline. Accuracy is key, but in this situation some incorrect/misclassified data is allowable as long as the maximum number of records can be entered before the deadline. Thus, Northwestern Consulting has been employed to predictively automate this task, utilizing a single-layer neural network model for digit classification that can be leveraged for this specific project as well as for future digit recognizer-esque projects for Company XYZ in the future (given that much of their historical quantitative data is still handwritten). Northwestern Consulting’s goal is to maximize the predictive accuracy of the model such that incorrect predictions can be minimized (knowing that 100% accuracy is not required), while secondarily building a scalable model for future such use cases for Company XYZ given new data.
In order to jumpstart this process, Northwestern Consulting is leveraging the MNIST digit recognizer dataset (consisting of 60,000 handwritten digits) as input data for this model, while experimenting with various model parameters - such as the number of hidden nodes and the number of inputted features into the model (leveraging PCA and random forests to consolidate/simplify the model’s input features) - to ship the most predictive model on holdout/validation data. TensorFlor and Keras, two popular neural network Python packages, are leveraged to better understand how various deep learning algorithms will respond to hyperparameter tweaking in optimizing the result. Similar such research will be discussed in depth in the Literature Review as well as the Methods, Results, and Conclusions for Company XYZ based on this research.
Literature Review
Collected from 250 people (50% high schoolers, 50% volunteers from the US Census), the MNIST dataset comprises 70,000 handwritten digits as images - 60,000 in the “training” set, 10,000 in the “test” set (Chen 2019). This large image dataset quickly became popular for predictive data competitors and researchers in the Artificial Intelligence field, as this was one of the largest such image datasets at the time of publication. Further, the use of neural networks and deep learning as a means of predicting image recognition-based outcomes was similarly rising in interest at this time, as researchers obtained massive breakthroughs in predictive accuracy by using neural networks in similar competitions such as ImageNet in 2012 (Chen 2019). Specific to this project, the vast majority of scholarly articles in regards to the MNIST dataset leverage versions of Convolutional Neural Networks (CNNs) as “feed-forward artificial neural networks, most commonly applied to analyzing visual imagery”, rather than basic neural networks which are expounded upon in this project (Chen 2019). Specifically, Feiyang Chen and team leveraged a version of a CNN that they called “CapsNet” which achieved 99.75% predictive accuracy on the MNIST dataset, which was the highest such rate at the time of publication. It will be interesting to potentially revisit this dataset when studying CNNs and their behaviors later on in this course.
Methods
Experimentation is often necessary in order to produce an optimized neural network that maximizes predictive accuracy on test data (or minimizes error/loss on test data), as squeezing the most ‘bang for your buck’ out of a predictive model is often a very iterative process. Therefore, in order to ship the best solution for Company XYZ’s purposes, Northwestern Consulting focused on building a simple, single-layered neural network via the MNIST dataset for learning and training, and then toggling specific parameters/features in selecting a model that balances both predictive accuracy and simplicity/scalability for future use cases at Company XYZ.
This process began with exploring and preprocessing said MNIST digit dataset for neural network modeling purposes. First, visualizing a snapshot of the data engendered great insights regarding variations of how digits can be written. As seen in Table 1 below, there are a plethora of different ways in which the digit “1” is written, for example. Given that not all digits are written exactly the same, it will be important to experiment with models that have extracted the most relevant features from an image (in this case, the most relevant pixels defining a number’s appearance/output) and evaluate performance versus neural network models leveraging all inputs/pixels when learning how to recognize digits effectively.
Table 1 - Snapshot of Image Input Data
From here, simple summaries were built to count the occurrence of each digit in the MNIST dataset. In both the training dataset (consisting of 60,000 digit images) and the test dataset (consisting of 10,000 digit images), no digit seemed to be significantly more present than others. If the opposite were the case, it may make sense to reshape the dataset accordingly/test different predictive weights for each digit such that one was not being overpredicted for. Then, the data was preprocessed for the model by applying one-hot encoding to the digit labels (which changes the results to vectors which is easier for neural networks to interest), and finally reshaping and re-scaling the data for model input.
In terms of building and evaluating the models, relatively similar frameworks were generally leveraged across all model iterations such that more focus could be placed on understanding how specific changes to a single layer’s number of hidden nodes (and/or the number and types of features selected for input) would in turn affect the model results without as much noise. Specifically, all models leveraged the relu activation function for its hidden layer, softmax for its output layer, and optimized using rmsprop while penalizing loss in terms of categorical cross-entropy. Then, when fitting the model on the training data, various epochs were tested and optimized for validation accuracy, with the epochs ending once the epoch with the best validation accuracy was seen, and two epochs after it were run in which validation accuracy did not exceed the aforementioned best value. That best model was saved for comparison versus other experiments’ model outputs. And in terms of evaluation from there, validation accuracy and loss were compared as well as each model’s confusion matrices in predicting each of the 10 digits versus the actual results (to understand whether a specific model was overpredicting for certain digits/performing worse on certain digits specifically).
Results
In order to better understand the impact of the number of hidden nodes in the model’s performance, Northwestern Consulting first tested a single hidden node in the input layer. Given that the input data consists of 784 pixels per image, it makes some sense that this model did not perform very well when predicting test digits. Specifically, this model only could accurately predict 48.7% of test records, and although this is much better than random guessing, this would not be reasonable to leverage for Company XYZ’s purposes. And as illustrated in Table 2 below, the model incorrectly similarly predicts less than half of the digits correctly while making mistakes that make little sense logically, such as predicting a ‘0” to be a “7”.
Table 2 - Experiment 1 Snapshot of Digit Predictions
Furthermore, when looking at the model’s confusion matrix, it only tends to predict some digits well (such as “2” and “6”) while not being able to predict any “0”s or “9”s. Oversimplifying the input data by only leveraging one hidden input node in this context seems to lead to certain digits being overpredicted at the expense of others not even being predicted at all. It seems like trying to boil down all 784 pixels in an image in this dataset into a single hidden node (in order to product optimized neural network weights in predicting each digit) leads to only certain distinct digits being predicted accurately, whereas most others cannot be accurately patterned/mapped out by the model because it is potentially too simple. This can further be illustrated by the boxplot in Table 3 below, which maps out the activation functions’ distributions across digits 0-9. In an ‘ideal’ solution, we might expect the overlap between the range of values in the boxes to be minimal, so the large overlaps amongst boxes for digits “0”, “2”, and “6” illustrate that this is similarly not an ideal neural network output in its current form.
Table 3 - Experiment 1 (1 Hidden Node) Activation Values Boxplot
Experiment 2’s output, leveraging two hidden nodes, performed significantly better from a predictive accuracy perspective (almost twice as predictive at a validation accuracy of 95.8%). It seems that even with just two hidden nodes rather than just one, the neural network can learn and differentiate between the major characteristics of digits like “0”, “2”, and “6” which it had struggled to predict for previously. This tends to ring true when analyzing the resulting confusion matrix and classification report as well; every digit except for “8” boasts at least 92% precision. Thus, by just changing the number of hidden nodes slightly, the model vastly improved, but there are still improvement opportunities. Specifically, “8” and “9” were incorrectly predicted as a number of other digits when analyzing the confusion matrix in Table 4 below (when looking at how often these digits were predicted incorrectly on the x-axis). In an ideal scenario for Northwestern Consulting, this model would generally only be making a few predictive errors on digits that objectively could look similar, such as “4” and “9”, and given that only 2 hidden nodes are being used here it is likely that we could still see significant predictive accuracy gains by further experimenting with this parameter. Further, from Table 5 (which color-codes each digit based on the hidden node 1 and 2’s outputs in a scatterplot) it seems like digit outputs are generally clustered in the same areas with little overlap, but there are likely improvements that can still be made to further differentiate these dots into more obvious groups.
Table 4 - Experiment 2 (2 Hidden Nodes) Confusion Matrix
Table 5 - Scatter-Plot of Hidden Node Output, Color-Coded by Digit Class
Based on the aforementioned assumption that increasing the number of hidden nodes will improve the performances of this digit recognizer neural network model for Company XYZ (to a certain point), Northwestern Consulting iterated on various numbers of nodes in order to converge on a more optimized model in this regard. As shown in Table 6 below, adding more hidden nodes tended to lead to an improvement in validation accuracy, with the most accurate model having 64 nodes and 98.0% accuracy. However, there did seem to be some potential degree of diminishing returns in adding nodes vs validation accuracy, as can be seen in the top right section of this visual. Specifically, the model with 16 hidden nodes boasted 97.6% accuracy, which is only 0.4% lower than the model leveraging 64 nodes (this seems to be a negligible difference for a much simpler model in that it only has 16 hidden nodes). Furthermore, the model with 16 hidden nodes had the lowest loss rate (of all models tested in Experiment 3) at 0.0847. Northwestern Consulting, therefore, is going to move forward with the 16 hidden node model output because of these metrics evaluated as well as the potential simplicity of this model, but it will still be valuable to compare this model to more simplified ones based on PCA and random forest consolidation techniques.
Table 6 - Validation Accuracy by # of Hidden Nodes in Neural Network
Northwestern Consulting then looked to experiment upon the optimal number of input features, having already selected a 16 hidden node input layer as ‘optimal’. Experiment 4 specifically entailed leveraging PCA to create a more streamlined set of feature inputs (154 components rather than 784, which approximately captures 95% of training images’ variance lying along these components). Comparing this PCA-based model’s performance to the ‘best’ one chosen from experiment 3, in absolute percentages it was about 3% less predictive on holdout data (95% predictive accuracy) and carried a 0.4% higher root mean square error (0.9% root mean square error). Based on these metrics, as well as via its confusion matrix shown in Table 7 below, the PCA-based model also seemed to mis-predict records across all label types (i.e. predicting some records of “9”s incorrectly as “8”s, “4”s, and “3”s whereas the optimized model from Epxeriment 3 tended to only specifically misidentify digits for one other digit (such as “9”s being incorrectly predicted generally as “4”s, which makes logical sense and may also be easier to drill down upon to identify and minimize the specific reason why this issue is happening repeatedly). Given that all models in experiment 3 outperformed the PCA model in terms of validation accuracy, it would likely make sense to continue with experiment 3’s optimized 16-hidden node model at this point. If this percent difference was closer to zero, Northwestern Consulting may lean towards the PCA-based model given that it is a slightly simpler abstraction of the inputs and therefore potentially more scalable on future use cases.
Table 7 - PCA Model (16 Hidden Nodes) Confusion Matrix
Finally, when comparing experiment 3 and 4’s results to that of experiment 5, in which Northwestern Consulting utilized a random forest to select the top 70 features (an example of such features/pixels is illustrated in Table 8 below) as neural network model inputs, predictive accuracy was only 93.1%. Generally, it seemed that this random forest-based model oversimplified the input data to the point that predictions became more difficult. For example, as illustrated in Table 8 below, 988 images were correctly identified as the digit “7”, and yet 979 images were predicted as “7”s that were really “9”s and 966 predicted as “0”s that were really “7”s.
Table 8 - Random Forest (16 Hidden Nodes) Confusion Matrix
Looking at the experiment 5 model in its current form, it would be difficult to select it over the optimized one from experiment 3 due to the lack of consistent predictions across digits, even though the overall accuracy was relatively strong at 93.1%. However, if Company XYZ’s goal was to select a neural network model that had a more even balance of being predictively accurate while balancing simplicity, it could move forward with the PCA or even the random forest model because each of those are simplifying the data being inputted to them. Shipping a simpler, accurate model could scale better on future data points that have not been seen in the original MNIST dataset. In terms of final recommendations, Northwestern Consulting recommends the following model rankings to Company XYZ in the order laid out in Table 9.
Table 9 - Model Recommendations & Rationale
Conclusion
In summary, Company XYZ needed an automated model to input 60,000,000+ records of handwritten numerical data with a quick turnaround, which led to Northwestern Consulting’s recommendation of using a neural network to predict handwritten digits with 98% accuracy on test data. As demonstrated by Yan LeCunn and others in their research, leveraging neural networks on the MNIST digits dataset, though a rather nascent approach, has seen extremely high degrees of predictive accuracy that had not been encountered before these types of models were recommended for this use case (Chen 2019). And in order to converge upon the most optimal neural network solution for Company XYZ, experimentations based on the number of hidden nodes in the one-layer neural network, as well as on the number of input features leveraged, led to our recommendation of the most optimal model. For future use cases, it would make sense to experiment and iterate upon other aspects of the model’s hyperparameters with the goal of potentially obtaining an even more accurate end result, such as by changing the optimizer type or loss metric. Another way to improve would be zeroing in on the digits that are commonly mis-identified for others (such as “4”s vs “9”s) and identify further opportunities to improve the model such that these anomalies could be further minimized.
References
Chen, Feiyang, et al. “Assessing Four Neural Networks on Handwritten Digit Recognition Dataset (MNIST).” ArXiv.org, 20 July 2019, https://arxiv.org/abs/1811.08278.
Comments
Post a Comment