Topic Modelling - Latent Dirichlet Allocation

Topic Modelling: is used to extract topics from a collection of documents.The topics are fundamentally a cluster of similar words. This help in the understanding of hidden semantic structure between words of a large number of the extensive texts at an aggregate level.

Latent Dirichlet Allocation: is a probabilistic modeling technique under topic modeling. The topic emerges during the statistical modeling and therefore referred to as latent.

LDA tries to map N number of documents to a k number of fixed topics, such that words in each document are explainable by the assigned topics. Each topic has a set of specific words and the weight assigned based on which it describes the probability of the document belonging to that topic.

Assumption for LDA:

Documents exhibit multiple topics
A topic is a distribution over a fixed vocabulary

LDA Learning Process:

Move through the N documents and randomly assign the word to one of the k topics.
This gives us the topic distribution over the N documents and vocabulary distribution for each topic, although assignment being random nature is erroneous.
Therefore, to improve :

The process moves through each document D, each word w in D, and each topic t in D to calculate:
1. p (topic t | document D): The proportion of words in document D that are currently assigned to topic t.
2. p (word w | topic t): The proportion of assignments to topic t over all documents that come from this word w.
3. Then reassign w to a new topic, where we choose a new topic with the highest score for p ( topic t | document D) * p ( word w | topic t)
After, a large number of iterations, the modeling reach a steady state where the word assignment to the topic is fairly stable.

LDA Modelling Parameters:

Number of Topics k: the number of topics given to the model to assign words.
Alpha Hyperparameter: controls the mixture of topics for any given document.

a. Higher Alpha value: documents will have more mixture of topics

b. Lower Alpha value: documents will have less mixture of topics
Beta Hyperparameter: controls the distribution of words per topic.

a. Higher Beta value: topics will have more words.

b. Lower Beta value: topics will have fewer words.
Number of iterations: over which the assignments of words to topic become stable.

Input Vocabulary:

The input vocabulary fed to the model can be pruned to restrict modeling of topics over selected words only.

Pruning Approach:

Bag of Words (BOW): pruning of vocabulary based on the frequency of words across documents.
TF-IDF: pruning of vocabulary based on the TF-IDF score of words.

Quality of topic modeling:

Problem: there is no ground truth, the model runs unsupervised, so no cross-validation

Solution: (relative metric used to compare model performance)

Perplexity: hold out a subset of documents, then check their likelihood in the resulting model per word.
- Perplexity: (exp(-1. * log-likelihood per word))
- Lower the perplexity the better model
Coherence Score: is used for assessing the quality of the learned topics.
- For one topic, the words 𝑖,𝑗 being scored in ∑𝑖<𝑗Score(𝑤𝑖,𝑤𝑗) have the highest probability of occurring for that topic.
- Higher the score the better topic quality.
- Used to decide the required number of topics in modeling.

References

12 Jun 2019

« Text Analysis in Python Data Science Methods for Small Dataset (Regression) »