Topic Modelling - Latent Dirichlet Allocation
Topic Modelling: is used to extract topics from a collection of documents.The topics are fundamentally a cluster of similar words. This help in the understanding of hidden semantic structure between words of a large number of the extensive texts at an aggregate level.
Latent Dirichlet Allocation: is a probabilistic modeling technique under topic modeling. The topic emerges during the statistical modeling and therefore referred to as latent.
LDA tries to map N number of documents to a k number of fixed topics, such that words in each document are explainable by the assigned topics. Each topic has a set of specific words and the weight assigned based on which it describes the probability of the document belonging to that topic.
Assumption for LDA:
-
Documents exhibit multiple topics
-
A topic is a distribution over a fixed vocabulary
LDA Learning Process:
-
Move through the N documents and randomly assign the word to one of the k topics.
-
This gives us the topic distribution over the N documents and vocabulary distribution for each topic, although assignment being random nature is erroneous.
-
Therefore, to improve :
The process moves through each document D, each word w in D, and each topic t in D to calculate:
-
p (topic t | document D): The proportion of words in document D that are currently assigned to topic t.
-
p (word w | topic t): The proportion of assignments to topic t over all documents that come from this word w.
-
Then reassign w to a new topic, where we choose a new topic with the highest score for p ( topic t | document D) * p ( word w | topic t)
-
-
After, a large number of iterations, the modeling reach a steady state where the word assignment to the topic is fairly stable.
LDA Modelling Parameters:
-
Number of Topics k: the number of topics given to the model to assign words.
-
Alpha Hyperparameter: controls the mixture of topics for any given document.
a. Higher Alpha value: documents will have more mixture of topics
b. Lower Alpha value: documents will have less mixture of topics
-
Beta Hyperparameter: controls the distribution of words per topic.
a. Higher Beta value: topics will have more words.
b. Lower Beta value: topics will have fewer words.
-
Number of iterations: over which the assignments of words to topic become stable.
Input Vocabulary:
The input vocabulary fed to the model can be pruned to restrict modeling of topics over selected words only.
Pruning Approach:
-
Bag of Words (BOW): pruning of vocabulary based on the frequency of words across documents.
-
TF-IDF: pruning of vocabulary based on the TF-IDF score of words.
Quality of topic modeling:
Problem: there is no ground truth, the model runs unsupervised, so no cross-validation
Solution: (relative metric used to compare model performance)
-
Perplexity: hold out a subset of documents, then check their likelihood in the resulting model per word.
-
Perplexity: (exp(-1. * log-likelihood per word))
-
Lower the perplexity the better model
-
-
Coherence Score: is used for assessing the quality of the learned topics.
-
For one topic, the words π,π being scored in βπ<πScore(π€π,π€π) have the highest probability of occurring for that topic.
-
Higher the score the better topic quality.
-
Used to decide the required number of topics in modeling.
-
References
-
https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24
-
https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d
-
https://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
-
https://www.analyticsindiamag.com/beginners-guide-to-latent-dirichlet-allocation/
-
https://www.cl.cam.ac.uk/teaching/1213/L101/clark_lectures/lect7.pdf
-
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/