0% found this document useful (0 votes)
16 views

Input To The LDA Algorithm:: Latent Dirichlet Allocation Using Gibbs Sampling Technique Is A Framework For Analyzing

This document summarizes Latent Dirichlet Allocation (LDA), an algorithm used to analyze hidden topic structures in large datasets like text documents. It describes the inputs and parameters to LDA including estimation from scratch or a previous model, as well as inference for new data. The outputs of LDA are also summarized, including files containing model parameters, word-topic distributions, topic-document distributions, and top words for each topic. Important parameters and variables in LDA are defined.

Uploaded by

Madhav Ramesh
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Input To The LDA Algorithm:: Latent Dirichlet Allocation Using Gibbs Sampling Technique Is A Framework For Analyzing

This document summarizes Latent Dirichlet Allocation (LDA), an algorithm used to analyze hidden topic structures in large datasets like text documents. It describes the inputs and parameters to LDA including estimation from scratch or a previous model, as well as inference for new data. The outputs of LDA are also summarized, including files containing model parameters, word-topic distributions, topic-document distributions, and top words for each topic. Important parameters and variables in LDA are defined.

Uploaded by

Madhav Ramesh
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 3

Latent Dirichlet Allocation using Gibbs Sampling Technique is a framework for analyzing

hidden/latent topic structures of large scale datasets like a collection of text documents.

Input to the LDA Algorithm:


LDA is used for parameter estimation and Inference as below.
a)Parameter Estimation from Scratch:
> lda -est [-alpha <double>] [-beta <double>] [-ntopics <int>] [-niters <int>]

[-savestep<int>] [-twords<int>] –dfile <string>


b) Parameter Estimation from a previously estimated model:
> lda -estc –dir <string> -model <string> [-niters <int>] [-savestep <int>] [-twords <int>]
c) Inference for new data:
> lda -inf -dir <string> -model <string> [-niters <int>] [-twords <int>] –dfile <string>

Parameters: ([] – indicates optional)


-est – Estimate from Scratch
-estc – Continue Estimation
-inf – Inference for New data
-alpha – value of alpha( hyper parameter)
-beta – value of beta( hyper parameter)
-ntopics – Number of topics
-niters - # of Gibbs sampling Iterations
-savestep – Step at which LDA is to be saved
-twords – # of top most likely words to be printed
Outputs of Latent Dirichlet Allocation

The following files are the outputs of LDA.


1)<model_name>.others -> contains some parameters of LDA model
alpha=0.500000
beta=0.100000
ntopics=100
ndocs=1000
nwords=5
liter=1000
2) <model_name>.phi -> word-topic distribution(rows->topics, cols-> words in document)
0.112849 0.001117 0.883799 0.001117 0.001117
0.001143 0.561143 0.046857 0.389714 0.001143
0.164444 0.045926 0.001481 0.075556 0.712593
3) <model_name>.theta -> topic-document distribution
(Rows-> document, cols-> topic)
0.008621 0.008621 0.008621 0.008621 0.008621 0.008621 …….
4) <model_name>.tassign -> contains <[word_i]> : <[topic of word_i]>
0:10 1:95 2:5 2:57 3:95 3:69 3:4 4:98
0:28 1:96 2:85 2:7 3:14 3:28 3:13 4:8
5) <model_name>.twords -> contains most likely words of each topic
Topic 0th:
acquisit 0.883799
abil 0.112849
absenc 0.001117
agreem 0.001117
ail 0.001117
Important Parameters and Variables:

M - # of Documents
V - vocabulary size
K - number of topics
alpha, beta - LDA hyper parameters
z – Matrix containing topic assignments for words
nw – Matrix containing # of instances of word i to topic I [Size is V x K]
nd – Matrix containing # of words in document i to topic i [Size is M x K]
nwsum – total # of words assigned to topic I [Size is K]
ndsum – total number of words in document i [Size is M]
theta – Matrix having document-topic distributions [Size is M x K]
phi – topic-word distributions [Size K x V]

You might also like