Natural Language Processing for Topic Modeling - Pickleball Article Thematic Categorization

 Natural Language Processing for Topic Modeling and Predictive Categorization: a Case Study using a Corpus of Pickleball Articles

Paul Kellar ¹ ² †

8 E Long St, Columbus, OH 43215 ¹

Northwestern University School of Professional Studies ²

Master of Science in Data Science Program

633 Clark St, Evanston, IL 60208




† Address to which correspondence should be addressed:

pekellar@msn.com

Abstract

Various Natural Language Processing (NLP) algorithms/techniques are tested upon a text corpus, composed of various articles about the sport of pickleball, for categorizing these documents into optimized topic mappings. Specifically, Latent Dirichlet Allocation (LDA) is employed in obtaining optimal word clusters from the corpus, which can then be mapped in the two-dimensional space to summarize the most common/recently popular ‘topics’ and key terms discussed regarding pickleball in news articles online. From here, these generalized topics are applied for training various supervised learning models, such as Stochastic Gradient Descent and Random Forest models, to predict topic categorizations on unforeseen articles.  

Pickleball provides a great case study for NLP topic modeling analysis, as the sport has seen a 39% increase in participation since 2020, and a similar uptick in media coverage during the same time period (Letourneau 2022). Furthermore, documents within the corpus outline various distinct subjects ranging from the growth of the sport, the health benefits of it, professional pickleball (and the resulting business/investment opportunities involved), responses to the sport’s positive/negative impacts in local communities, and its similarities/coexistences with sports like tennis. This makes pickleball a very relevant subject for mapping distinct categories and their relationships via NLP. Said corpus consists of 50 different articles, ranging from <500 to >2500 words in length, with the average document ranging between 1000 - 1500 words. Various unsupervised NLP algorithms and techniques are utilized to optimize the cluster of topics outputted from the corpus, including but not limited to Latent Dirichlet Allocation (LDA), term-frequency times inverse document frequency (TF-IDF), and Doc2Vec word embedding methodologies. These topics will then be leveraged as inputs for supervised prediction of topic categorization.  

Keywords: Pickleball, Natural Language Processing (NLP), term frequency - inverse document frequency (TF-IDF), Bag of Words (BOW), Doc2Vec, Latent Dirichlet Allocation (LDA), word embedding, sentiment analysis, Stochastic Gradient Descent (SGD), topic modeling, Random Forest, Naive Bayes, unsupervised learning, supervised learning, GloVe pre-trained word embedding

Introduction

Pickleball was invented in 1965 by Joel Pritchard in Seattle, Washington as a new game to entertain his kids while on vacation (Eisenberg 2023). But it was not until 2020, driven by the desire for an outdoor and socially-distanced activity accessible to all ages at the onset of the COVID-19 pandemic, that pickleball quickly became the fastest-growing sport in America (Glusac 2023). This has drawn many new participants to the sport - both professionals playing in newly formed leagues as well as casual fans/investors in the sport’s growth - alongside some detractors who compare it to the “fad” of racquetball’s popularity in the 1980s (Sheyner 2022). Media coverage has paralleled the sport’s rise in interest; pickleball-related Google searches have increased by over 400% since the start of 2020, and the number of pickleball news articles searched weekly similarly has seen an over 1,300% spike in the same period (“Trends” 2023). 

Coupled with humans’ generally limited bandwidth and time available to consume information, the general topic of pickleball represents a strong use case for Natural Language Processing (NLP) to combat information overload in terms of media/articles published regarding the sport. Topic modeling is meant to summarize and visualize the most relevant terms and topics in a text corpus, such as the general topics currently being discussed about pickleball in the news online. NLP is utilized in this analysis both for unsupervised learning of key topics gleaned from the text corpus, as well as for supervised prediction of an article’s topic categorization based on the terms within a pickleball article’s contents. NLP allows for efficient and accurate summarizations of text data into optimized topics of words/phrases, and combined with supervised learning models such as Random Forests and Stochastic Gradient Descent, can allow for topic prediction of an article (based upon the content of the document). 

The objective is to, therefore, investigate various word-embedding, topic modeling, and document prediction techniques on a text corpus focusing on pickleball-specific content. A simple decomposition of the key topics found within the pickleball text corpus can be found below in Table 1, color-coded by potential topic categories at first glance (green highlighting pickleball’s potential benefits from casual and professional perspectives, red highlighting potential themes regarding opposition to the sport, and yellow highlighting local plans about growing and integrating the sport into communities). This top-down categorization also demonstrates various interactions of topics amongst various categorizations, as well as the plethora of keywords worthy of analysis within the corpus itself.


Table 1. Pickleball Corpus - Ontological Breakdown

The combination of the Term Frequency–Inverse Document Frequency (TF-IDF) algorithm and Latent Dirichlet Allocation (LDA) are hyper-tuned and optimized for topic mapping, with the resulting topics then being leveraged as supervised learning inputs for topic prediction on unforeseen pickleball articles. Relevant peer-reviewed research highlights current applications of NLP for topic mapping of news articles as well as technical detail regarding the most prevalent algorithms currently used in practice for such an exercise. Methodology, source data, methods and results, analysis of outcomes, and future work are also discussed at length.

Literature Review

Topic modeling via unsupervised NLP is a rapidly-improving means to summarize extremely large text corpuses into rich, well-defined themes. These algorithms can quickly scan thousands to millions of densely-populated documents, and extract key terms that are both most common and relevant as hierarchical themes from the corpus. Research conducted by Griciute, Han, and Nenadic on topic modeling regarding the Swedish news’s coverage of the COVID-19 pandemic is a great example of this in practice (Griciute, Han, Nenadic 2023). The research group extracted key topics discussed amongst 6,515 Swedish articles from January 17th, 2020 to March 13th, 2021. Key results included converging upon the optimal number of relevant topics gleaned from the corpus, both in terms of common words/phrases found within the topic clusters themselves as well as the trending prevalence of such topics within the corpus over the fourteen month analysis period (Griciute, Han, Nenadic 2023). And while the pickleball-specific topic modeling project will not focus on topic relevance over time (as this initial corpus is only made up of 50 distinct documents from 2022 to 2023), the goal is to similarly extract a refined list of key themes that “are being breath[ed] in” from the corpus (Griciute, Han, Nenadic 2023). Utilizing NLP for such purposes can better inform an end user about the key topics discussed about pickleball in the news by accurately collating a large corpus of articles into distinctly summarized clusters/themes.

Griciute, Han, and Nenadic focused on one of the most commonly used topic modeling methodologies in Latent Dirichlet Allocation (LDA) (Griciute, Han, Nenadic 2023). At a high level, LDA “treats documents as probabilistic distribution sets of words or topics … identified on the basis of the likelihood of co-occurrences of words contained in them.” (Zhao 2017). So when applied to a corpus such as the pickleball corpus, this methodology excels at both placing documents into specific clusters, as well as ranking the key words/terms found within the extracted topics themselves via the aforementioned probabilistic outputs (Zhao 2017). The dirichlet distribution is primarily controlled by the parameter α, which indicates how a corpus’s documents relate to the multiple topics outputted by the LDA algorithm itself. Specifically, α outlines the distribution of the documents amongst the generalized topics extracted from unsupervised learning across a k-shaped vector (Zvornicanin 2022). A value of α = 1 signifies that documents are distributed more evenly through the space, α > 1 means that document(s) are clustering more towards the middle than any specific topic itself, and α < 1 means that the document(s) are clustering more closely to the corners of the k-shaped vector (Zvornicanin 2022). This is accordingly called a Dirichlet distribution. The image below outlines a good example of the importance of α in mapping the relationship of documents across three example topics (given that documents in NLP corpuses often do not strictly map to one topic extremely strongly). Another important parameter within the LDA algorithm is parameter θ, which assigns probabilities of a document being assigned to each individual topic. The probabilities add up to 1 in sum (Zvornicanin 2022).

Table 2. Example of α breakdown by distribution type (Zvornicanin 2022)

LDA also samples the topic node repeatedly within a document, allowing for documents to be associated with multiple topics (since it is very common for news articles within a corpus to speak to more than one distinct topic per document). The aforementioned k-shaped vector used in LDA modeling further reinforces this concept by “treating the topic mixture weights as a k-parameter hidden random variable rather than a large set of individual parameters'' (Zvornicanin 2022). This hidden random variable is the “latent” aspect of Latent Dirichlet Analysis, and is important because documents within the pickleball corpus will almost always fall under more than one topic due to the interactions of similar words/phrases/terms seen throughout the corpus’s documents. From here, after building an optimized LDA model, the resulting topics can be summarized and visualized in terms of their relationships to one another, as well as the most prevalent words within each topic itself based upon the probability outputs of documents and words falling under various topics (Zvornicanin 2022).

Another common technique for NLP topic modeling, which can either be leveraged in-depth alongside or independently of LDA, is word embedding. Argued to be a vast improvement for topic modeling use cases over the more simplistic bag-of-words and term frequency–inverse document frequency methodologies alone (which treat words/phrases as being independent of their surrounding words in the context of the document), word embedding is meant to provide a denser word representation capturing meaning within the context of the vector itself (Brownlee 2014). Word2Vec (a word embedding algorithm created by Google) and Global Vectors for Word Representation (a similar word embedding algorithm created by Stanford, called GloVe for short) are two of the most popular techniques currently given the accuracy of results. Both are also very easily iterable based on exploratory data analysis given that the embeddings can be visualized after dimensionality reduction (Brownlee 2014). 

These word embedding algorithms can be used to train NLP models from scratch, but they may require very large corpuses for training, as large as “the entire Wikipedia corpus”, for example, in order to obtain truly accurate and relevant results (Brownless 2014). In turn, a relevant application is utilizing Stanford’s GloVe pre-trained word embedding vector to add another level of value on top of the pickleball text corpus. It particularly can be useful given this corpus’s relative sparseness due to the general nascency of the sport’s growth, to layer on pre-trained word embeddings to add further context to the corpus outside of the context/meanings derived from the corpus itself. In the context of this analysis, GloVe pre-trained word embeddings will be leveraged alongside supervised learning in order to provide enhanced accuracy in predicting corpus documents’ associated topic modeling topic.

Data

After collecting and initially trimming documents within the pickleball corpus (in order to eliminate any content not directly related to pickleball in scope), basic preprocessing was conducted. This included removing stopwords (including both common stopwords from the English language as well as manually identified stopwords identified as relevant to specifically remove from the pickleball corpus), eliminating punctuation & numbers, deleting any words from the corpus of less than three letters, and converting all words to lowercase. Next, the top 250 words by frequency were analyzed and manually consolidated into relevant equivalence classes where applicable to portray similarity in meaning amongst relatively synonymous words/terms in context (for example, categorizing terms like “board” and “council” as “government”). In order to avoid word equivalence overfitting in this process, various iterations were evaluated by resulting predictive accuracy and coherence scores of the topics obtained. The text words were then stemmed to further consolidate similar terms in the corpus. Table 3 below summarizes the most frequent words in the stemmed, cleaned text corpus in a word cloud visual.

Table 3. Word Cloud of Pickleball Corpus

Methods 

The cleaned corpus was then converted from text to vectorized, numerical data. Bag of Words (BOW) vectorization was compared to Term Frequency Inverse Document Frequency (TF-IDF) vectorization, with TF-IDF being selected because it outputted an 11% higher coherence score than the BOW vectorization. For context, coherence score is defined as the “average or median of pairwise word similarities formed by top words of a given topic” (Akdogan 2021). In non-technical terms, this is often defined as the human interpretability score of a cluster/topic of terms in an NLP algorithm based upon the similarity of top words found within it (Kumar 2020)Extreme outliers outside of the general distribution of TF-IDF vectorizations were also manually analyzed and removed on a case-by-case basis. This TF-IDF vectorized corpus was then evaluated for unsupervised learning performance/initial clustering coherence versus a Doc2Vec model, which vectorizes a corpus by determining every target word within the overall theme/meaning of the document itself (Johnson 2023). The TF-IDF-based vector also outperformed the Doc2Vec model by 58% in predictive accuracy when inputting both vectors into a simple Random Forest model. Therefore, the optimized TF-IDF vectorization of the pickleball text corpus was selected for LDA topic modeling (given that LDA is one of the most common NLP topic mapping techniques, and can easily ingest TF-IDF-vectors as inputs).

The grid search method was then employed to converge upon an optimal LDA model in terms of the highest coherence score. The ensuing grid search process involved scanning all possible combinations of parameter value inputs given, and ultimately utilizing the optimized model for evaluation (Zvornicanin 2021). Specifically, various values of alpha, beta, and the number of LDA output topics were tested. The top five models in terms of coherence score were then visualized in a two-dimensional graph, with the most readable and well-balanced topic set being selected (this will be further described in the Discussion section). 

Finally, in order to convert the LDA topic results into a supervised learning input, each document was assigned its most relevant topic by maximum probability associated (meaning if document 1 had the highest probability of being associated with topic 2, for example, then document 1 was assigned as a topic 2 category document). As discussed in the literature review section, LDA is a popular NLP technique because one-to-many relationships often exist between an individual document and the LDA model’s topics outputted, but assigning documents to topics by maximum probability is likely still a good proxy. Various supervised learning models were then evaluated on these documents’ cleaned text content, including k-Nearest Neighbors clustering, Naive Bayes, Random Forests, and Stochastic Gradient Descent (SGD). These models were hypertuned by log loss in order to converge upon optimal prediction models, and then evaluated via confusion matrices to analyze precision, recall, and F1 scores. Table 4 below summarizes this end-to-end analysis process, as well as the algorithm methodologies leveraged via both unsupervised and supervised learning from there, in order to converge on optimal topic models and predictive models for categorizing unseen documents into topics.  

Table 4. End-to-End Algorithm Pickleball Corpus Topic Modeling Process

Results

To initially analyze the efficacy of various vector types to leverage in tandem with the LDA algorithm, exploratory clusters were created for both the TF-IDF and Word2Vec vectors via the k-Nearest Neighbors (KNN) algorithm. When comparing these clustering results (using three clusters as a baseline for selecting the TF-IDF or Doc2Vec-based vector), the TF-IDF pickleball vector significantly outperformed the Doc2Vec vector in terms of clustering for both two-dimensionally graphed grouping and predictive purposes. As seen in Table 5 below, the TF-IDF vector produced more closely associated clusters from a visual perspective than the Doc2Vec clustering output (in terms of the gray, blue, and red topics clustering into relatively similar spaces, whereas the Doc2Vec clustering seemed only loosely associated). Further, when testing these clusters for prediction in a generic Random Forest model, the TF-IDF output was 58% more accurate in outputting correct cluster predictions versus the Doc2Vec model. As mentioned in the Literature Review section, the Doc2Vec algorithm tends to underperform on small text corpuses, as there is significantly less training data available to associate words and terms to the associated themes of respective documents, so it makes sense that TF-IDF is more performant for vectorization.

Table 5. TF-IDF (Left) vs Doc2Vec (Right) Clusters & Random Forest Predictive Accuracy (Bottom)

When analyzing the key terms from the three-cluster TF-IDF output, the top ten terms for each topic also make inherent/logical sense given the makeup of news articles selected for the text corpus. As displayed below in Table 6, topic 1 seems to align with words alluding to local opposition to the sport because it is loud/takes up court space for tennis players, topic 2 seems to align with potential benefits of the sport and its growth, and topic 3 generally with professional pickleball in terms of leagues/tournaments/sponsors. Thus, this added further confidence that TF-IDF vectorization made sense to use in accordance with LDA for topic modeling.

Table 6 - Top 10 words per TF-IDF KNN Cluster (Topic Names Manually Assigned)

Then, both BOW and TF-IDF vectors were tested in a generic three-topic LDA model to compare baseline versus hypertuned models’ coherence scores. As displayed in Table 11 below, the BOW baseline model produced 33.7% coherence, whereas the TF-IDF baseline model achieved 37.8% coherence. This produced another strong rationale for TF-IDF as the vectorization option for LDA analysis over a more simple BOW vector. Table 7 summarizes the top three TF-IDF hypertuned LDA model results. TF-IDF #2 was ultimately selected for topic categorization over TF-IDF model #1 based on manual analysis of the resulting topic outputs. Specifically, though TF-IDF #1 outperformed TF-IDF #2 in terms of coherence score, the three-topic LDA model actually seemed to oversimplify the topic results (and disproportionately categorized over 85% of all key terms into a single topic, which did not seem to be very useful), and so TF-IDF #2 was selected since it provided simple, well-balanced, readable topic outputs while still providing a 10.5% lift in coherence score performance versus the baseline model.

Table 7. Grid Search LDA Results (Hypertuned) vs Baseline TF-IDF and BOW LDA Models

When visualizing the hypertuned TF-IDF + LDA model in a two-dimensional space, more granular information can be gleaned about the topic compositions, the topic distribution/similarities, and the key terms that make up each topic. Table 8 demonstrates the flattened distribution of key topics, and further proves the efficacy of the optimized TF-IDF + LDA output in that the topics are relatively distributed graphically and are generally similar sized (which measures the density of terms found within each topic as a percent of the total text corpus). Optimizing the LDA model by coherence score seemed to engender generally interpretable topics based on their term composition, which will be discussed further in the Discussion section.

Table 8. Hypertuned LDA Model Topics by Intertopic Distance & Top 10 Weighted LDA Terms by Topic

From here, for document-level topic prediction purposes, these optimized topic mappings were assigned to each document in the training corpus, and the TF-IDF vectors for each document were fed into a myriad of supervised learning models. The resulting models are summarized in Table 9, descendingly sorted by predictive accuracy and F1 scores. Leveraging GloVe pre-trained word embeddings alongside the pickleball TF-IDF text vectors significantly improved supervised model performances (versus the same model outputs not employing GloVe), specifically driving a 95% increase in accuracy for the Stochastic Gradient Descent model, a 35% increase for the k-Nearest Neighbor model, and a 22% increase in the Random Forest model, respectively. However, none of these supervised models were particularly performant in predicting a document’s associated LDA topic, as accuracy was very low and sporadic across all models. The Stochastic Gradient Descent + GloVe model led to the ‘best’ outputs in terms of accuracy and topic-specific F1 scores overall. However, this model (alongside all of the predictive models tested) were not very predictive of all individual topics, generally over-predicting specific topics at the expense of inaccurately predicting/ignoring others. 

Table 9. Supervised model evaluation metrics

Discussion

Obtaining a large, meaningful, and diverse sample size of pickleball news articles was a more difficult process than initially perceived at the start of analysis. Though there are thousands of such articles available to search online, it was rather arduous to manually collect a text corpus of articles that were dense enough to glean meaningful topic content (specifically, articles relating directly to pickleball made up of around 500 or more words).A great initial time and effort was spent on collecting a wide array of pickleball themes/topics for corpus selection where possible, such that a smaller corpus of 50 distinct articles selected would not over-represent specific topics at the expense of others such as professional pickleball and its leagues/tournaments. This semi-arbitrary process of data selection could be improved upon in any upcoming iterations of pickleball topic modeling, which is discussed further in the section regarding Future Work. This small sample size for modeling input generally led to sporadic/inconsistent topic outputs depending on the LDA hyperparameters being tuned, which also likely negatively impacted the topic predictiveness from a supervised modeling evaluation perspective.

Regardless, some valuable initial results were still gleaned from an unsupervised learning/clustering perspective, and the topic modeling results were deemed generally successful/significant. Optimizing the LDA algorithms both by coherence score and the resulting two-dimensional topic distributions and term compositions produced a relatively meaningful thematic layout of the most prevalent topics being discussed in the news online regarding pickleball. Specifically, Topic 1 was labeled “Pickleball Local Benefits” based on key terms like “health”, “exercise”, “benefit”, and “social”; Topic 2 labeled “PIckleball Local Opposition” based on terms such as “loud”, “disrupt”, “local”, and “decibel”; Topic 3 labeled “Pickleball Tournaments” because of terms like “tournament”, “cities”, and “sponsor”; Topic 4 labeled “Professional Pickleball” via terms like “league” and “professional”; and finally Topic 5 labeled “Pickleball Local/Government Plans” because of terms such as “local”, “counties”, and “plan”. Comparatively to the initial manual ontology conducted in the Introduction section, the topic results illustrated in Table 10 below are relatively aligned versus expectation, further validating the potential efficacy of these topics in practice.

Local Benefits Topic (1) = 0.00074*board + 0.00074*health + 0.00071*heart + 0.00071*cities + 0.00069*benefit + 0.00069*great + 0.00069*exercise + 0.00068*park + 0.00068*moderate + 0.00067*social (25.2% of terms in cleaned text corpus within this topic)

Local Opposition Topic (2) = 0.0007*resort + 0.00066*mayor + 0.00058*cities + 0.00057*noise + 0.00057*eardrum + 0.00055*disrupt + 0.00055*town + 0.00053*decibel + 0.00053*council + 0.00052*local (25.1% of terms in cleaned text corpus within this topic)

Tournament Topic (3) = 0.00146*sponsor + 0.00073*park + 0.0007*cities + 0.00069*tournament + 0.00067*life + 0.00067*florida + 0.00064*league + 0.00064*design + 0.00064*footwear + 0.00063*court (20.2% of terms in cleaned text corpus within this topic)

Professional Topic (4) = 0.0008*league + 0.0007*local + 0.00061*plan + 0.00061*fast + 0.00059*urban + 0.00059*board + 0.00058*park + 0.00058*professional + 0.00058*america + 0.00058*noise (14.8% of terms in cleaned text corpus within this topic)

Local/Government Plans Topic (5) = 0.00087*township + 0.00072*park + 0.00071*local + 0.0007*counties + 0.00068*sound + 0.00068*neighborhood + 0.00066*noise + 0.00066*project + 0.00062*fast + 0.00061*plan (14.7% of terms in cleaned text corpus within this topic)

Table 10. Top Ten Terms per Pickleball LDA Topic, Manually Labeled and Ordered by Topic Weight 

Thus, even though LDA topic weights of the top ten terms are relatively low (and anomalous terms like “life” and “resort” still exist within some of the topics’ top ten terms), it could be interpreted that these five topics reasonably make up the most relevant themes of pickleball coverage in the news based upon the training corpus. When comparing said topics compositions by top terms and their weights, it also makes sense why the “Professional Pickleball” and “Pickleball Local/Government Plans” topics heavily intersect when visualized - both topics share the common terms of “park”, “local”, and “plan” within their top ten terms alone. This demonstrates another potential flaw in these topic outputs. It is likely that much more text data is needed to properly train and separate these topics that do not seem to have a lot in common from a logical perspective (the first topic is focused more on professional pickleball, whereas the second focuses more on local issues/plans). The current iteration of these topic models may also not provide the LDA algorithm with enough data to properly discern context from terms like “plan” and “local” in terms of plans for new, local professional leagues vs plans to build new pickleball courts in local communities. On the other hand, all other topics only share “cities” as a common term which makes sense as to why they are more distributed graphically as a result.

            Although the pickleball LDA models provided some meaningful and humanly interpretable topics via the text corpus, these did not translate very well into predictive variables on a document level. As elucidated in the Results section, none of the models achieved over 40% accuracy in classifying validation documents by topic. At 37.5% accuracy, the SGD + GloVe model did perform almost twice as well as blind/random topic selection (which would be 20% for the five potential topics on a document-level); however, when analyzing this model’s confusion matrix in Table 11, it is evident that there is still vast room for predictive improvement. As seen below, this model correctly predicted two of the five documents to fall under the first topic (“Local Benefits” topic) as well as three out of four documents correctly being predicted to fall under the fourth topic (“Professional” topic) but was not very predictive of any other topic. Additionally, this model predicted that 13 of the 16 test documents would fall under Topic 1 (“Local Opposition”), even though only one document in the validation set truly belonged to this topic. 

Table 11. SGD + GloVe Confusion Matrix

            Since this similar over-prediction behavior was seen across all predictive model iterations (all such models had at least two topics with F1 scores of 0%), it may be asserted that these sporadic topic-level predictions are a result of the small training sample size more so than the supervised models themselves. Moreover, selecting a higher percentage of documents for training the supervised models (in terms of train/test split) essentially led to more volatile predictive accuracy on the validation set since there were resultantly very few documents to validate against. Conversely, predictive accuracy tended to decrease dramatically when a higher percentage of documents were held out for validation of the models. Aside from collecting a larger training set of text, another technique for ensuring optimized predictive topic models would be selecting topic models optimized by supervised predictive accuracy, rather than by topic coherence score alone. Further, it could make sense to test various vectorization options for the supervised model training itself (rather than just using TF-IDF vectorization just because that was used in LDA topic modeling). Since the GloVe word embedding vector significantly boosted accuracy, there may be further opportunities to tune the vector itself in this way to compare predictive performance. Regardless, this unsupervised + supervised topic modeling and prediction could generally be deemed a successful initial framework, given a larger sample size of documents is leveraged for future iterations of analysis.

Conclusion

There are many different topic modeling techniques that can be explored, of which only a few were explored in depth on the pickleball text corpus (BOW, TF-IDF, Doc2Vec, k-Nearest Neighbor clustering, and LDA). To be more specific, the pickleball text corpus used for topic modeling and topic prediction was based upon 50 distinct news articles between 2022 and 2023. Based on the topics outputted by the LDA algorithm, the most relevant themes discussed in the news online about pickleball are in regards to pickleball’s local supporters (in terms of mental/physical health, social opportunities, and accessibility), its local detractors (in terms of noise and coexistence with tennis), its tournaments (amongst celebrities and professionals), its professional leagues/seasons, and its local expansion plans in regards to community government efforts in growing the sport. And though the optimized TF-IDF + LDA model boasted relatively high coherence scores per topic, supervised learning models were not very predictive of an individual news article’s associated topic. A Stochastic Gradient Descent model (complemented with GloVe pre-trained word embeddings to add word context to an otherwise sparse training set) was selected as the most predictive model but this model only achieved 37.5% validation accuracy on unforeseen articles.

Accuracy of the respective NLP models leveraged in this analysis was most positively impacted by equivalence classes and word embeddings in adding further relevant context to the pickleball corpus text. Specifically, adding equivalence classes to the text corpus led to significantly more meaningful term extractions after comparing various iterations. It was generally seen that having too many equivalence classes continued to improve coherence of the topics, but at the expense of oversimplifying the terms collected for various topics as a result. This oversimplification was likely especially evident given the smaller training corpus utilized. With a larger corpus of information available for future iterations, the use of equivalence classes likely could have been minimized such that the model could focus on learning word relationships from the text itself. Additionally, utilizing GloVe pre-trained word embeddings added significant context to the text corpus when predicting individual documents’ topics (leading to a 95% increase in predictive accuracy for the Stochastic Gradient Descent model alone). 

Future Work

A number of improvements could be made to the current form of the pickleball topic modeling and predictive topic selection algorithms. Given 500 or 5,000 training documents, rather than 50, a TF-IDF vector (and/or a Doc2Vec-based vector) likely would have provided significantly more meaningful topic outputs when evaluated by top terms as well as two-dimensionally. And although the topics outputted by LDA did garner relatively interpretable key terms from a human ‘eye test’ perspective, there were many anomalous terms within the topics, such as “pavillion”, “life”, and “matter” within the topic regarding tournaments, for example, which can hurt confidence in the results of the topics overall. Moreover, the LDA topic weights’ outputs were very low, further illustrating that the algorithm was not very confident in its associated topics’ key terms (specifically, the term “sponsor” in the context of the tournament topic was the only term with a weight of at least 0.001). Gathering a larger text corpus likely would have solved these issues, and so for future work it would make sense to gather a much larger corpus of information, ideally by web scraping articles or gathering a large plethora of tweets mentioning pickleball. A larger corpus would also likely have allowed for fewer equivalence classes in pre-processing the data, in that the LDA algorithm could have learned these associations from the larger body of text by itself. Finally, incorporating other topic evaluation topics (such as potentially optimizing by predictive accuracy of the LDA topic models in a baseline supervised learning model) may have led to stronger topic outputs, which may have led to more predictive topics per document from a supervised learning perspective.

In terms of new/expanded opportunities for analysis in future work, it would be interesting to productize the LDA topic models, a more optimized sentence summarizer algorithm, and sentiment analysis given the input of a previously unforeseen pickleball text document in a simple user interface. Imagine a website where a user could simply attach a pickleball file/document link, and then see a relevant summary of the document in terms of its predicted topic (in human terms rather than just the topic number), its similarity scores and top terms as compared to other documents within the corpus’s training input, a three to five sentence summary of the document based on the relevance of key words found within these sentences, and a positive/negative/neutral sentiment rating regarding the document’s ‘opinion’ on pickleball. Given that the linked documents provided by users were also relevant pickleball articles (maybe there could be some decision model on the back-end of this website which would only accept documents with a certain similarity score as compared to others found within the training corpus), then this could be a great way for the NLP unsupervised and supervised models to continuously learn through new, open-sourced training text. And finally, it would be interesting to apply a similar topic modeling exercise on scraped tweets about pickleball in order to compare and contrast the results versus the key topics gleaned from online news articles. This was a very interesting initial exercise that lends itself to myriads of new applications and growth opportunities over time.


References

Akdogan, Adem. “Word Embedding Techniques: Word2vec and TF-IDF Explained.” Medium. Towards Data Science, July 22, 2021. https://towardsdatascience.com/word-embedding-techniques-word2vec-and-tf-idf-explained-c5d02e34d08

Ayari, Rabeh. “NLP: Word Embedding Techniques Demystified.” Medium. Towards Data Science, March 8, 2020. https://towardsdatascience.com/nlp-embedding-techniques-51b7e6ec9f92#:~:text=Doc2Vec%20is%20another%20widely%20used,every%20document%20in%20the%20corpus

Brownlee, Jason. Deep Learning for Natural Language Processing. Shelter Island, NY, New York: Manning Publications Co., 2014.

Griciute, Bernadeta, Lifeng Han, and Goran Nenadic. “Topic Modelling of Swedish Newspaper Articles about Coronavirus: a Case Study Using Latent Dirichlet Allocation Method.” arxiv. arxiv, January 17, 2023. https://arxiv.org/pdf/2301.03029.  

Johnson, Sean. “3.2. Tuning the Hyper-Parameters of an Estimator.” scikit. Accessed March 7, 2023. https://scikit-learn.org/stable/modules/grid_search.html

Kumar, Anjani. “TF-IDF in Natural Language Processing.” Medium. DataDrivenInvestor, July 3, 2020. https://medium.datadriveninvestor.com/tf-idf-in-natural-language-processing-8db8ef4a7736

“Trends.google.com.” Accessed February 5, 2023. https://trends.google.com/trends/investigate. 

Wali, Kartik. “Explained: Stemming vs Lemmatization in NLP.” Analytics India Magazine, May 4, 2022. https://analyticsindiamag.com/explained-stemming-vs-lemmatization-in-nlp/#:~:text=Lemmatization%20has%20higher%20accuracy%20than,the%20context%20is%20not%20important

Zhao, Yanchang. Data Mining Applications with R. ELSEVIER ACADEMIC Press, 2017.

Zvornicanin, Enes. “Topic Modeling and Latent Dirichlet Allocation (LDA).” DataScience+, 2022. https://datascienceplus.com/topic-modeling-and-latent-dirichlet-allocation-lda/

Zvornicanin, Enes. “When Coherence Score Is Good or Bad in Topic Modeling?” Baeldung on Computer Science, November 6, 2021. https://www.baeldung.com/cs/topic-modeling-coherence-score




Comments

Popular posts from this blog

NBA Betting Model - Beating the House

NBA All-Star Predictive Modeling

Making the Perfect March Madness Bracket – An (Impossible) Tradition Unlike Any Other