NLP Neural Network - AG News Predictive Article Categorization Model





AG News Dataset NLP Neural Network Experimentation - Northwestern Consulting Research for Company XYZ



Paul Kellar

MSDS 458: Artificial Intelligence and Deep Learning

5/8/23



Abstract

As a third iteration of neural network implementation by Northwestern Consulting for Company XYZ, they are now interested in developing a conversational agent or chatbot to assist customer support representatives. This is a great use case for Natural Language Processing (NLP) in terms of predicting responses/categorizations via text data (rather than image data like in the prior projects). NLP, specifically in terms of predicting outputs based on words/phrases/sentences as training data, is an extremely fast-growing field (as evidenced by the emergence of open-source, NLP-trained models such as ChatGPT). Thus, Company XYZ seeks to explore NLP for customer service purposes, but first wishes to see and learn about the intricacies of a simple NLP model using neural networks. The AG news dataset, a collection of over thousands of news articles categorized as either “World”, “Sports”, “Business”, or “Sci/Tech”, provides Northwestern Consulting with an ample dataset for demonstrating the efficacy of neural networks in terms of identifying sequential patterns between words/phrases.

In order to do so, Northwestern Consulting explores various types of such frameworks, including iterations of Recurrent Neural Networks (RNNs) as well as Convolutional Neural Networks (CNNs) to compare predictive performance on validation data. As discussed within the Literature Review, RNNs specifically excel at extracting localized patterns from word data in which order (of phrases and sentences in language) is very important for extracting meaning. In order to obtain an optimized neural network architecture for this use case, Northwestern Consulting further tests various numbers of layers, regularization techniques, and varied numbers of input nodes in order to converge upon an optimized model for predicting a news article’s respective category.

 

Introduction

Company XYZ seeks to expand its usage of neural networks for computer vision predictive purposes - specifically in terms of building NLP-based products that can predict the category of text articles based on the text data itself in the short term, and building NLP-based chatbots that can predict and respond to customers in real-time in the long term. Thus, Northwestern Consulting has provided various proof-of-concept neutral network models for the short-term purpose of articles’ category predictions based on the AG News dataset as an open-source corpus of news articles. When testing and optimizing such a model for Company XYZ’s purposes, evaluation metrics such as validation accuracy, validation loss, root mean squared error (RMSE), and F-1 are leveraged, specifically in terms of evaluating the following experiments’ outputs:

  • Experiment 1: 1-dimensional CNN as a baseline predictive model

  • Experiment 2: Simple RNN with batch normalization and dropout (256 and 128 hidden nodes)

  • Experiment 3: Simple RNN with batch normalization and dropout (16 and 8 hidden nodes)

  • Experiment 4: Simple RNN with batch normalization and dropout (32 and 16 hidden nodes)

  • Experiment 5: 3-layer RNN with batch normalization and dropout

  • Experiment 6-8: Various Bi-directional RNNs with Long Short Term Memory (LSTM) layers, dropout, and various compiler methods

  • Experiment 9-10: Multiple layer uni-directional RNNs with various levels of hidden nodes and dropout proportions

After comparing various model architectures and parameter iterations, the optimized LSTM-based RNN from experiment 10 achieved 86.8% validation accuracy on unseen AG News data. Further, the aforementioned RNN experiment outperformed the baseline CNN model in terms of predictive accuracy by over 300%.

Literature Review

NLP is a rather nascent field for a number of reasons. For one, accurately predicting categorization of a document based on the text data within it (for example) generally requires extremely large text corpuses in order to obtain relevant results. In turn, it is often difficult to obtain large enough datasets for this purpose. The open-source storage and flow of information on the internet engenders increased opportunities for NLP use cases, yet extremely large text datasets are few and far between. Additionally, it can be difficult to parse out context from text data, which is necessary to convey sequential meaning amongst words in a sentence/paragraph/etc. Some predictive models will tend to treat each individual data point in a dataset independently; however, word-based text will lose meaning when looking at individual words on their own (Noguti et al 2020). 

Law is a great example of a field that possesses massive amounts of categorized text documents. And as demonstrated by Mariana Y. Noguti and team, RNNs can excel at sequential data processing and prediction in legal document classification (Noguti et al 2020). When the research team compared linear models, boosted models, and RNN models in classifying legal documents into 18 distinct categories, they achieved over 90% accuracy when leveraging a combination of Word2Vec word embedding and an LSTM-based RNN model (Noguti et al 2020). When pre-processing this dataset, further context was mapped between words via word embedding algorithms such as Word2Vec, which specifically seeks to learn low-dimensional vector representations for each word in the corpus such words that are semantically similar/co-occur frequently can be represented by similar vectors (Noguti et al 2020). 

Given more time and computing resources to build an NLP neural network for Company XYZ, adopting a similar type of word embedding algorithm would likely increase the predictive accuracy of outputted models. After employing Word2Vec embedding, the LSTM-based RNN specifically achieved the highest predictive accuracy on the legal corpus. LSTM RNNs can capture long-term dependencies between words in documents by maintaining both short-term and long-term memory cells (Noguti et al 2020). LSTM RNNs can more effectively extract patterns from both short-term text sequences, as well as longer sequences in order to combat key information being lost across a longer string of meaning. Thus, it makes logical sense that the LSTM-based RNN similarly achieved the highest predictive accuracy in categorizing news articles from the AG News dataset. 

Methods

Like the prior research conducted for Company XYZ, various iterations of RNNs (both simple RNN structures as well as LSTM structures) and CNNs are explored to converge upon an optimized solution that best fits the AG News data, and maximizes predictive accuracy while balancing simplicity in design (in such a way that the model’s predictions are scalable upon unseen data and future use cases). Specifically, Northwestern Consulting tested frameworks such as the number of hidden layers, the number of hidden nodes, varied regularization methods such as batch normalization and dropout, and unidirectional and bidirectional RNN frameworks within the context of both simple RNNs and LSTM RNNs. And in order to compare these model results to a simpler baseline model, a 1D CNN model is used for predictive accuracy comparison. In this experimentation, it was generally found that RNNs leveraging regularization could achieve upwards of 80% predictive accuracy, with the best models boasting ~86% predictive accuracy on average when hyperparameter tuned.

Before modeling, however, extensive data preprocessing took place such that relevant sequential relationships could be more accurately extracted from the AG News text data. Initially, all non-alphabetic words/tokens are removed from the corpus. From there, one of the most important preprocessing steps for text data is removing stopwords, which are extremely common in any text document (yet these words convey very little thematic meaning on their own). So the stopwords package is leveraged in order to eliminate words from the text corpus such as “the”, “for”, “and”, etc. Then, infrequent words in the text corpus (any words that occur less than ten times across all AG News text articles) are similarly eliminated so that pattern extraction can be focused on more relevant keywords. Finally, the resulting corpus’ words are stemmed. This essentially reduces the words to their stems/roots to reduce the dimensionality of the feature space, by collapsing similar words into a common representation. The text corpus is then re-coded into integer data so that it can be fed into neural networks. After these preprocessing steps, masking is implemented within each individual model to speed up the model learning process (by ignoring zeros in the text vectors).

Results

In order to evaluate a baseline model performance as a starting point for this project, a 1D CNN (made up of an embedding layer, and two convolutional 1-dimensional layers with max pooling and dropout), was studied. Since a randomly-guessing algorithm would output around 25% predictive accuracy (since this dataset has an equal distribution of four categories as dependent variables), this is the lowest baseline upon which a performant model architecture should be judged.  With this in mind, the baseline CNN model similarly achieved only 25% validation accuracy on the AG News dataset. Though it was somewhat surprising that any neural network may perform only equally well as a random guess (and maybe a more optimized CNN model could have performed slightly better), it generally makes sense that it did not perform well in predicting outputs based on sequential text data. As mentioned above in the literature review, prediction models based on text data need to be able to associate ordered patterns between words, phrases, sentences, and paragraphs of text in tandem sequentially, yet CNNs do not take the sequential order of data points into account when learning from training data. Essentially, CNNs excel at recognizing patterns and relationships in data points but do not care about order of data points, which is why CNNs are more performant as image processing models for instance than as NLP prediction models. RNNs, on the other hand, do take input data’s order into account so it is expected that they will in turn perform much better in this use case. As visualized in Figure 1 below, RNN’s nodes learn sequentially based on the prior data point in order to account for the sequential nature of data where order matters.

Figure 1. Representation of RNN nodes’ learning pattern (Mishra 2018)

Next, various simple RNNs were tested to compare performance amongst different hyperparameter iterations. To start, an RNN with two simpleRNN layers, two batch normalization layers, and two dropout layers made up the model architecture was tested, which achieved 83.1% validation accuracy and 0.5013 validation loss. The efficacy of RNNs for NLP-based text prediction is further evidenced by the fact that this baseline RNN model outperformed the baseline CNN model by over 300% in terms of validation accuracy. This also illustrates the opportunities for further model experimentations in order to obtain even higher predictive accuracy and loss rates. At a high level, when testing various numbers of hidden nodes, percent dropout, and rmsprop vs adam compiler methods on a two-layered, simple RNN architecture, validation accuracy of up to 84.1% was achieved. The specific evaluation of these models in terms of validation accuracy and validation loss rates is further broken down in Table 2 below. And though these models were very performant on the AG News dataset, adding a third simple RNN layer quickly led to overfitting in this experimentation process. Specifically, this model only achieved 31% validation accuracy, which is likely because the three-layer model has become too complex and/or suffered from a “vanishing gradient” leading to poor model generalization. This may demonstrate that two-layered RNNs are optimal for this use case alongside other model optimization frameworks such as bidirectional and LSTM learning layers.

Table 2. Summary table of varied simple RNN model architectures’ evaluation metrics

Leveraging bi-directional learning layers in subsequent RNNs engendered further predictive improvements due to their ability to learn sentences’ meanings both by means of forward and backward learning. Essentially, two independent RNNs are created in both directions in order to learn from the training data, and the sequential patterns recognized from both iterations are combined into one learning layer (Chollet 2018). Though there is no ‘standard’ learning procedure for RNN models, they tend to solely be feed-forward or backward learning-based rather than bi-directional (unless explicitly instructed to learn bi-directionally). The ensuing model, leveraging a single bi-directional LSTM learning layer, achieved 86.3% accuracy at a low loss rate of 0.4003. The classification report for this initial model can be seen in Figure 3 below - leveraging these learning techniques engendered extremely high predictive accuracy in terms of precision, recall, and F1-scores across all four article categorization classes.

Figure 3. Classification report for 1-layer RNN with a bi-directional LSTM layer and dropout

Thinking through various RNN architecture tweaks to test from here, the next experiment tested no dropout techniques, and actually achieved the same validation accuracy rate of 86.3% (at a slightly lower loss rate of 0.3814). Though this difference in performance is relatively negligible, this may prove that the type of learning (bidirectional LSTM as compared to simple RNN) drives a higher improvement in accuracy than having a dropout layer or not. Further, testing a different number of hidden nodes in the bidirectional LSTM layer similarly led to 86.3% predictive accuracy with a loss rate of 0.3862. Ultimately, the optimal (and recommended) model for Company XYZ for the AG News dataset was achieved through a two-layered RNN model employing bi-directional LSTM learning layers - at a validation accuracy of 86.5% and a validation loss of 0.3027. This optimized model’s architecture is shown in Figure 4 below. Further, Figure 5 shows the increase in validation accuracy per epoch for this model. Validation accuracy tended to increase per learning epoch, and though training was stopped at only 10 epochs due to learning/RAM constraints, it is possible that predictive performance could be further improved with more learning epochs.

Figure 4. Optimized bi-directional LSTM model architecture 

Figure 5. Bi-directional LSTM RNN accuracy rate on training and validation data per epoch

In summary, various types of neural networks were tested in order to converge upon the optimal solution for Company XYZ’s purposes. Across ten different model experimentations, it was made evident that the model type and associated learning architectures produce the most optimal NLP prediction models. The optimized simple RNN model achieved 83% predictive accuracy, and an optimized, two-layer RNN leveraging bi-directional LSTM layers achieved the highest accuracy at 86.5%. Figure 6 below illustrates the predictive accuracy of this optimized model on a class basis based on the model’s confusion matrix. Figure 7 breaks down train/test/validation accuracy of all tested models, alongside the compile time for each model. RNN models with text data are very complicated, leading to extended compile times. Regardless, the bidirectional LSTM models both achieved the best accuracy and the lowest compilation times of all model types on average. Thus, Model 10 is the recommended model from Northwestern Consulting to Company XYZ for its proof-of-concept NLP predictive categorization neural network model.

Figure 6. Optimized RNN model confusion matrix

Figure 7. Neural Network Performances by Validation Accuracy, Validation Loss, Compile Time


Conclusion

To conclude, Northwestern Consulting recommends Model 10 for Company XYZ’s proof-of-concept RNN computer vision modeling use case for NLP due to its superior predictive performance as well as scalability for future uses. Long term, models like this could likely be further improved via improved training data (either collecting more articles, or leveraging word embedding techniques like TF-IDF, Word2Vec, or Doc2Vec in order to improve the identification of text patterns within the AG News corpus). Finally, in order to further improve experimentation opportunities upon this dataset, Northwestern Consulting would also recommend utilizing GPUs/TPUs with very high RAM, such that models like these can be compiled and iterated upon more quickly. 


References

Chollet, F. (2018). Deep Learning with Python (2nd ed.). Shelter Island, NY: Manning Publications.

Mishra, A. (2018, October 2). Recurrent Neural Networks (RNNs). Towards Data Science. https://towardsdatascience.com/recurrent-neural-networks-rnns-3f06d7653a85

Noguti, Mariana Y., et al. “Legal Document Classification: An Application to Law Area Prediction of Petitions to Public Prosecution Service.” 2020 International Joint Conference on Neural Networks (IJCNN), 2020, https://doi.org/10.1109/ijcnn48605.2020.9207211.








Comments

Popular posts from this blog

NBA Betting Model - Beating the House

NBA All-Star Predictive Modeling

Making the Perfect March Madness Bracket – An (Impossible) Tradition Unlike Any Other