0% found this document useful (0 votes)

12 views

Probabilistic Topic Models

The document discusses probabilistic topic models for text mining, specifically probabilistic latent semantic analysis (PLSA). It provides an overview of PLSA and how it can be used to discover latent topics in a collection of documents using a mixture modeling approach.

Uploaded by

mausam

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Probabilistic Topic Models

Uploaded by

mausam

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 78

Probabilistic Topic Models for

Text Mining
ChengXiang Zhai ( 翟成祥 )
Department of Computer Science
Graduate School of Library & Information Science
Institute for Genomic Biology, Statistics
University of Illinois, Urbana-Champaign
http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 1
What Is Text Mining?
“The objective of Text Mining is to exploit information
contained in textual documents in various ways, including
…discovery of patterns and trends in data, associations
among entities, predictive rules, etc.” (Grobelnik et al.,
2001)

“Another way to view text data mining is as a process of

exploratory data analysis that leads to heretofore
unknown information, or to answers for questions for
which the answer is not currently known.” (Hearst, 1999)

(Slide from Rebecca Hwa’s “Intro to Text Mining”)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 2
Two Different Views of Text Mining

• Data Mining View: Explore patterns in textual data

– Find latent topics
Shallow mining
– Find topical trends
– Find outliers and other hidden patterns

• Natural Language Processing View: Make

inferences based on partial understanding natural
language text
– Information extraction Deep mining
– Question answering

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 3
Applications of Text Mining

• Direct applications: Go beyond search to find knowledge

– Question-driven (Bioinformatics, Business Intelligence, etc):
We have specific questions; how can we exploit data mining to
answer the questions?
– Data-driven (WWW, literature, email, customer reviews, etc):
We have a lot of data; what can we do with it?

• Indirect applications
– Assist information access (e.g., discover latent topics to better
summarize search results)
– Assist information organization (e.g., discover hidden
structures)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 4
Text Mining Methods
• Data Mining Style: View text as high dimensional data
– Frequent pattern finding
– Association analysis
– Outlier detection
• Information Retrieval Style: Fine granularity topical analysis
– Topic extraction
– Exploit term weighting and text similarity measures
• Natural Language Processing Style: Information Extraction
– Entity extraction
– Relation extraction
– Sentiment analysis
– Question answering
• Machine Learning Style: Unsupervised or semi-supervised learning
– Mixture models
Topic of this lecture
– Dimension reduction

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 5
Outline
• The Basic Topic Models:
– Probabilistic Latent Semantic Analysis (PLSA) [Hofmann
99]

– Latent Dirichlet Allocation (LDA) [Blei et al. 02]

• Extensions
– Contextual Probabilistic Latent Semantic Analysis
(CPLSA) [Mei & Zhai 06]

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 6
Basic Topic Model: PLSA

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 7
PLSA: Motivation
What did people say in their blog articles about “Hurricane Katrina”?

Query = “Hurricane Katrina”

Results:
Government Response New Orleans Oil Price Praying and Blessing Aid and Donation Personal
bush 0.071 city 0.063 price 0.077 god 0.141 donate 0.120 i 0.405
president 0.061 orleans 0.054 oil 0.064 pray 0.047 relief 0.076 my 0.116
federal 0.051 new 0.034 gas 0.045 prayer 0.041 red 0.070 me 0.060
government 0.047 louisiana 0.023 increase 0.020 love 0.030 cross 0.065 am 0.029
fema 0.047 flood 0.022 product 0.020 life 0.025 help 0.050 think 0.015
administrate 0.023 evacuate 0.021 fuel 0.018 bless 0.025 victim 0.036 feel 0.012
response 0.020 storm 0.017 company 0.018 lord 0.017 organize 0.022 know 0.011
brown 0.019 resident 0.016 energy 0.017 jesus 0.016 effort 0.020 something 0.007
blame 0.017 center 0.016 market 0.016 will 0.013 fund 0.019 guess 0.007
governor 0.014 rescue 0.012 gasoline 0.012 faith 0.012 volunteer 0.019 myself 0.006

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 8
Probabilistic Latent Semantic
Analysis/Indexing (PLSA/PLSI) [Hofmann 99]

• Mix k multinomial distributions to generate a document

• Each document has a potentially different set of mixing
weights which captures the topic coverage
• When generating words in a document, each word may be
generated using a DIFFERENT multinomial distribution
(this is in contrast with the document clustering model
where, once a multinomial distribution is chosen, all the
words in a document would be generated using the same
model)
• We may add a background distribution to “attract”
background words
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 9
PLSA as a Mixture Model
k
pd ( w)  B p ( w |  B )  (1  B )  d , j p ( w |  j )
j 1
k
log p (d )   c( w, d ) log[B p ( w |  B )  (1  B )  d , j p ( w |  j )]
wV j 1

Document d
warning 0.3
?
Topic 1 system 0.2.. d,1
? 1 “Generating” word w
in doc d in the collection
aid 0.1
? 2
Topic 2 ?
donation 0.05 d,2 1 - B
support 0.02
? ..
d, k W
… k
statistics 0.2
?
loss 0.1? B
Topic k dead 0.05? .. B
is 0.05
?
Background B ?
the 0.04 Parameters:
a 0.03
? .. B=noise-level (manually set)
’s and ’s are estimated with Maximum Likelihood
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 10
How to Estimate j: EM Algorithm

the 0.2
Known a 0.1
Background we 0.01 Observed Doc(s)
p(w | B) to 0.02
…

Unknown … Suppose,
topic model text =? we know
mining =? ML
p(w|1)=? association =? the identity Estimator
word =? of each
“Text mining” …
word ...
…
Unknown …
topic model information =?
p(w|2)=? retrieval =?
query =?
document =?
“information
…
retrieval”

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 11
How the Algorithm Works
c(w,d)(1 - p(zd,w = B))p(zd,w=j)
c(w,d)p(zd,w = B) πd1,1 πd1,2
c(w, d) ( P(θ1|d1) ) ( P(θ2|d1) )
aid 7
d1 price 5
Initial value
oil 6
πd2,1 πd2,2
aid 8 ( P(θ1|d2) ) ( P(θ2|d2) )
d2 price 7
oil 5 Initial value

P(w| θ) Topic 1 Topic 2

Initializing
Iteration
Iteration
Iteration
1:
2:
1:
2:πd,E
3,
M
j Step:
and
4,
Step:
5,P(w|
…
split
re-θj)
aid estimate
word Until
counts
with πconverging
d, j and
random withvalues
different
P(w| θj) by
price Initial value topics (by
adding andcomputing
normalizing z’ the
s)12
oil splitted word counts

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 12
Parameter Estimation
E-Step:
Word w in doc d is generated Application of Bayes rule
- from cluster j
- from background
 d( n, )j p ( n ) ( w |  j )
p( zd ,w  j ) 

k
j '1
 d( n, )j ' p ( n ) ( w |  j ' )
B p ( w |  B )
p( zd ,w  B) 
B p( w |  B )  (1  B ) j 1  d( n, )j p ( n ) ( w |  j )
k

M-Step:  d( n, j 1) 
 wV
c( w, d )(1  p ( z d , w  B)) p ( z d , w  j )

Re-estimate  j' wV

c( w, d )(1  p ( z d , w  B)) p ( z d , w  j ' )
- mixing weights  
m
c( w, d )(1  p ( z  B)) p ( z  j )
i 1 d Ci d ,w d ,w
- cluster LM p ( n 1)
(w |  j )
   c(w' , d )(1  p( z  B)) p( z  j )
m
w 'V i 1 d Ci d , w' d , w'

Sum over all docs Fractional counts contributing to

- using cluster j in generating d
(in multiple collections)
- generating w from cluster j
m = 1 if one collection
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 13
PLSA with Prior Knowledge

• There are different ways of choosing aspects (topics)

– Google = Google News + Google Map + Google scholar, …
– Google = Google US + Google France + Google China, …

• Users have some domain knowledge in mind, e.g.,

– We expect to see “retrieval models” as a topic in IR.
– We want to show the aspects of “history” and “statistics” for
Youtube

• A flexible way to incorporate such knowledge as priors of

PLSA model
• In Bayesian, it’s your “belief” on the topic distributions
14
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 14
Adding Prior
*  arg max p ( ) p ( Data |  )

Most likely 
Document d
warning 0.3
Topic 1 system 0.2.. d,1
1 “Generating” word w
in doc d in the collection
aid 0.1 2
Topic 2 donation 0.05 d,2 1 - B
support 0.02 ..
d, k W
… k
statistics 0.2
loss 0.1 B
Topic k dead 0.05 .. B
is 0.05
Background B the 0.04 Parameters:
a 0.03 .. B=noise-level (manually set)
’s and ’s are estimated with Maximum Likelihood
15
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 15
Adding Prior as Pseudo Counts
Observed Doc(s)
the 0.2
Known a 0.1
Background we 0.01
p(w | B) to 0.02
…
MAP
Unknown … Suppose, Estimator
topic model text =? we know
mining =?
p(w|1)=? association =? the identity
word =? of each
“Text mining” …
word ... Pseudo Doc
…
Unknown …
topic model information =?
p(w|2)=? retrieval =?
query =? Size = μ
document =? text
“information
…
retrieval”
16
mining
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 16
Maximum A Posterior (MAP) Estimation
 d( n, )j p ( n ) ( w |  j )
p( zd ,w  j ) 

k
j '1
 (n) (n)
d , j' p (w |  j ' )
B p ( w |  B )
p( zd ,w  B) 
B p( w |  B )  (1  B ) j 1  d( n, )j p ( n ) ( w |  j )
k

 d( n, j 1) 
 wV
c( w, d )(1  p ( z d , w  B )) p ( z d , w  j )
Pseudo counts of w from prior ’
j' wV
c( w, d )(1  p ( z d , w  B )) p ( z d , w  j ' )

 
m
c( w, d )(1  p ( z  B )) p ( z  j ) +p(w|’ )
i 1 d Ci d ,w d ,w
p ( n 1)
(w |  )  j

   c(w' , d )(1  p( z  B)) p( z  j ) +

j m
w 'V i 1 d Ci d , w' d , w'

Sum of all pseudo counts

What if =0? What if =+?

17
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 17
Basic Topic Model: LDA

The following slides about LDA are taken from Michael C. Mozer’s course lecture
http://www.cs.colorado.edu/~mozer/courses/ProbabilisticModels/

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 18
LDA: Motivation
– “Documents have no generative probabilistic semantics”
•i.e., document is just a symbol
– Model has many parameters
•linear in number of documents
•need heuristic methods to prevent overfitting
– Cannot generalize to new documents

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Unigram Model

N
p (w )   p ( wn )
n 1
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Mixture of Unigrams

N
p (w )   p ( z ) p ( wn | z )
z n 1

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Topic Model / Probabilistic LSI

p (d , wn )  p (d ) p ( wn | z ) p ( z | d )
z

•d is a localist representation of (trained) documents

•LDA provides a distributed representation
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
LDA
•Vocabulary of |V| words
•Document is a collection of words from vocabulary.
•N words in document
•w = (w1, ..., wN)

•Latent topics
•random variable z, with values 1, ..., k

•Like topic model, document is generated by sampling a

topic from a mixture and then sampling a word from a
mixture.
•But topic model assumes a fixed mixture of topics (multinomial
distribution) for each document.
•LDA assumes a random mixture of topics (Dirichlet distribution) for each
topic.

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Generative Model

•“Plates” indicate looping structure

•Outer plate replicated for each document
•Inner plate replicated for each word
•Same conditional distributions apply for each replicate

•Document probability

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Fancier Version

(ik1 i )  1
p (  )  k 1  k
 1 k 1

i 1 ( i )

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Inference

p ( , z | w , ,  )  p ( , z , w | ,  )
p ( w | ,  )
N
p ( , z, w  ,  )  p (  ) p ( z n  ) p ( wn z n ,  )
n 1

 N  k
p (w  ,  )   p (  )  p ( z n  ) p ( wn z n ,  ) d 

 n 1 z n 

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Inference
•In general, this formula is intractable:
 N  k
p (w  ,  )   p (  )  p ( z n  ) p ( wn z n ,  ) d 
 n 1 z n 

•Expanded version: 1 if wn is the j'th vocab word

(i  i )  k  i 1  N k V wnj 
p (w |  ,  )      i   ( i  ij ) d
i ( i )  i 1  n1 i 1 j 1 

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Variational Approximation
•Computing log likelihood and introducing Jensen's inequality:
log(E[x]) >= E[log(x)]

•Find variational distribution q such that the above equation is

computable.
– q parameterized by γ and φn
– Maximize bound with respect to γ and φn to obtain best approximation
to p(w | α, β)
– Lead to variational EM algorithm

•Sampling algorithms (e.g., Gibbs sampling) are also common

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Data Sets

C. Elegans Community abstracts

 5,225 abstracts
 28,414 unique terms
TREC AP corpus (subset)
 16,333 newswire articles
 23,075 unique terms

Held-out data – 10%

Removed terms
50 stop words, words appearing once

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
C. Elegans

Note: fold in hack for pLSI to allow it to handle novel documents.

Involves refitting p(z|dnew) parameters -> sort of a cheat
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
AP

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008
Summary: PLSA vs. LDA
• LDA adds a Dirichlet distribution on top of PLSA to
regularize the model
• Estimation of LDA is more complicated than PLSA
• LDA is a generative model, while PLSA isn’t
• PLSA is more likely to over-fit the data than LDA
• Which one to use?
– If you need generalization capacity, LDA
– If you want to mine topics from a collection, PLSA may
be better (we want overfitting!)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 32
Extension of PLSA:
Contextual Probabilistic Latent
Semantic Analysis (CPLSA)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 33
A General Introduction to EM

Data: X (observed) + H(hidden) Parameter: 

“Incomplete” likelihood: L( )= log p(X| )

“Complete” likelihood: Lc( )= log p(X,H| )
EM tries to iteratively maximize the incomplete likelihood:

Starting with an initial guess (0),

1. E-step: compute the expectation of the complete likelihood

Q( ; ( n 1) )  E [ Lc ( ) | X ]   p( H  hi | X , ( n 1) ) log P( X , hi )
( n1)
hi

2. M-step: compute (n) by maximizing the Q-function

 ( n )  arg max Q( ; ( n 1) )  arg max  p( H  hi | X , ( n 1) ) log P( X , hi )
hi

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 34
Convergence Guarantee
Goal: maximizing “Incomplete” likelihood: L( )= log p(X| )
I.e., choosing (n), so that L((n))-L((n-1))0

Note that, since p(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, )
L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X,  (n-1) )/p(H|X, (n))]

Taking expectation w.r.t. p(H|X, (n-1)),

L((n))-L((n-1)) = Q((n);  (n-1))-Q( (n-1);  (n-1)) + D(p(H|X,  (n-1))||p(H|X,  (n)))

Doesn’t contain H

EM chooses (n) to maximize Q KL-divergence, always non-negative

Therefore, L((n))  L((n-1))!

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 35
Another way of looking at EM
Likelihood p(X| )
L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) ) + D(p(H|X,  (n-1) )||p(H|X,  ))

L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) )

next guess
current
guess

Lower bound
(Q function)

E-step = computing the lower bound
M-step = maximizing the lower bound
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 36
Why Contextual PLSA?

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 37
Motivating Example:
Comparing Product Reviews
IBM Laptop APPLE Laptop DELL Laptop
Reviews Reviews Reviews

Common Themes “IBM” specific “APPLE” specific “DELL” specific

Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs

Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB

Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz

Unsupervised discovery of common topics and their variations

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 38
Motivating Example:
Comparing News about Similar Topics
Vietnam War Afghan War Iraq War

Common Themes “Vietnam” specific “Afghan” specific “Iraq” specific

United nations … … …
Death of people … … …
… … … …

Unsupervised discovery of common topics and their variations

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 39
Motivating Example:
Discovering Topical Trends in Literature
Theme Strength

Time
1980 1990 1998 2003
TF-IDF Retrieval Language Model
IR Applications Text Categorization

Unsupervised discovery of topics and their temporal variations

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 40
Motivating Example:
Analyzing Spatial Topic Patterns

• How do blog writers in different states respond to

topics such as “oil price increase during Hurricane
Karina”?
• Unsupervised discovery of topics and their variations
in different locations
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 41
Motivating Example:
Sentiment Summary

Unsupervised/Semi-supervised discovery of topics and

different sentiments of the topics

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 42
Research Questions
• Can we model all these problems generally?
• Can we solve these problems with a unified
approach?
• How can we bring human into the loop?

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 43
Contextual Text Mining
• Given collections of text with contextual information (meta-
data)
• Discover themes/subtopics/topics (interesting word clusters)
• Compute variations of themes over contexts
• Applications:
– Summarizing search results
– Federation of text information
– Opinion analysis
– Social network analysis
– Business intelligence
– ..

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 44
Context Features of Text (Meta-data)

Weblog Article
communities

Author

source

Time Location
Author’s Occupation

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 45
Context = Partitioning of Text

papers written
1998 in 1998

1999 Papers about Web

…… ……

2005

2006 papers written by

authors in US

WWW SIGIR ACL KDD SIGMOD

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 46
Themes/Topics
Theme 1 government 0.3 [ Criticism of government response to the
hurricane primarily consisted of criticism of its
response 0.2..
response to the approach of the storm and its
donate 0.1
Theme 2 relief 0.05 aftermath, specifically in the delayed
help 0.02 .. response ] to the [ flooding of New Orleans. …
… 80% of the 1.3 million residents of the greater
city 0.2 New Orleans metropolitan area evacuated ] …
new 0.1 [ Over seventy countries pledged monetary
Theme k
orleans 0.05 .. donations or other assistance]. …

Is 0.05
Background B the 0.04 • Uses of themes:
a 0.03 .. – Summarize topics/subtopics
– Navigate in a document space
– Retrieve documents
– Segment documents
– …
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 47
View of Themes:
Context-Specific Version of Views
Context: After 1998 (Language models)

vector
space
TF-IDF
Okapi Theme 2: vector
retrieve LSI
Theme 1: feedback Rocchio
model retrieval
judge Feedback
weighting
relevance feedback
Retrieval documen expansion term
Model t query pseudo
query

language mixture
model model
smoothing estimate
query EM
generation feedback
pseudo

Context: Before 1998 (Traditional models)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 48
Coverage of Themes:
Distribution over Themes
Oil Price
Government Criticism of government response to the
Response hurricane primarily consisted of criticism of its
Aid and response to … The total shut-in oil production
donation
Background from the Gulf of Mexico … approximately 24% of
the annual production and the shut-in gas
production … Over seventy countries pledged
Context: Texas monetary donations or other assistance. …
Oil Price
Government
Response

Aid and
• Theme coverage can depend on
context
donation
Background

Context: Louisiana

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 49
General Tasks of Contextual Text Mining

• Theme Extraction: Extract the global salient themes

– Common information shared over all contexts
• View Comparison: Compare a theme from different views
– Analyze the content variation of themes over contexts
• Coverage Comparison: Compare the theme coverage of different
contexts
– Reveal how closely a theme is associated to a context
• Others:
– Causal analysis
– Correlation analysis

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 50
A General Solution: CPLSA
• CPLAS = Contextual Probabilistic Latent Semantic Analysis
• An extension of PLSA model ([Hofmann 99]) by
– Introducing context variables
– Modeling views of topics
– Modeling coverage variations of topics

• Process of contextual text mining

– Instantiation of CPLSA (context, views, coverage)
– Fit the model to text data (EM algorithm)
– Compute probabilistic topic patterns

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 51
“Generation” Process of CPLSA
View1 View2 View3 Choose a theme
Themes Draw a word from i
Criticism of government
government 0.3 response togovernment
the hurricane
government
response 0.2.. Document
primarily consisted of
response
criticism of its response
context:
to … The total shut-in oil
donate 0.1 Time = from
production July the
2005
Gulf
relief 0.05 Location
of Mexico …donate= Texas
approximately
donation 24% ofAuthor = xxx
help 0.02 .. aid
the annual
help
Occup. = and
production Sociologist
the shut-
in gas
Ageproduction
Group = … 45+Over
city 0.2
seventy countries pledged
New new 0.1 …Orleans
monetary donations or
Orleans orleans 0.05 .. new …
other assistance.

Texas July sociolo

2005 gist
Choose a view

Theme
coverages …… Choose a
: Coverage
Texas July 2005 document

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 52
Probabilistic Model
• To generate a document D with context feature set C:
– Choose a view vi according to the view distribution p (vi | D, C )
– Choose a coverage кj according to the coverage distribution
p ( j | D, C )

– Choose a theme  il according to the coverage кj

– Generate a word using  il
– The likelihood of the document collection is:

n m k
log p(D)    c(w, D) log( p(v | D, C ) p(
( D ,C )D wV i 1
i
j 1
j | D, C ) p(l |  j ) p( w |  il ))
l 1

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 53
Parameter Estimation: EM Algorithm
• Interesting patterns:
 c( w, D) j 1 l 1 p ( z w,i , j ,l  1)
m k

– Theme content p ( t 1)

(v | D, C ) 
wV

  c( w, D) j '1 l '1 p ( z w,i ', j ',l '  1)

i n m k

variation for each i '1 wV

 c( w, D)i 1 l 1 p( z w,i , j ,l  1)
n k

– Theme strength p (t 1) ( j | D, C )  wV

  c( w, D)i '1 l '1 p ( z w,i ', j ',l '  1)

m n k

variation for each j '1 wV

context   c( w, D)i 1 p ( z w,i , j ,l  1)

p (l |  j )
( D ,C )D wV
p ( t 1)
(l |  j )
   c( w, D)i '1 p ( z w,i ', j ,l '  1)
l n
l '1 ( D ,C )D w 'V

• Prior from a user can p ( t 1) ( w |  il )

 ( D ,C )D
c( w, D) j 1 p ( z w,i , j ,l  1)
m

be incorporated using   w 'V ( D ,C )D

c( w' , D) j '1 p ( z w',i , j ',l  1)
m

MAP estimation

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 54
Regularization of the Model
• Why?
– Generality high complexity (inefficient, multiple local maxima)
– Real applications have domain constraints/knowledge

• Two useful simplifications:

– Fixed-Coverage: Only analyze the content variation of themes (e.g.,
author-topic analysis, cross-collection comparative analysis )
– Fixed-View: Only analyze the coverage variation of themes (e.g.,
spatiotemporal theme analysis)

• In general
– Impose priors on model parameters
– Support the whole spectrum from unsupervised to supervised
learning

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 55
Interpretation of Topics
term 0.1599 term 0.1599 term 0.1599
Statistical relevance
weight
0.0752
0.0660
relevance
weight
0.0752
0.0660
relevance
weight
0.0752
0.0660
topic models feedback 0.0372
independence 0.0311
feedback 0.0372
independence 0.0311
feedback 0.0372
independence 0.0311
model 0.0310 model 0.0310 model 0.0310
frequent 0.0233 frequent 0.0233 frequent 0.0233
probabilistic 0.0188 probabilistic 0.0188 probabilistic 0.0188
document 0.0173 document 0.0173 document 0.0173
… … …
Multinomial topic models

Collection (Context) Coverage; Discrimination

Relevance Score Re-ranking

clustering algorithm;
distance measure;
database system, clustering algorithm, …
NLP Chunker r tree, functional dependency, iceberg
Ngram stat. cube, concurrency control, Ranked List
index structure … of Labels
Candidate label pool

• Intuition: prefer phrases covering high probability words

Clustering

Good Label (l1):

dimensional “clustering
p (l |  ) algorithm”
p (l )
Latent algorithm …
Topic 
birch

shape
Bad Label (l2):
… “body shape”
p(w| body
)
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 57
Relevance: the First-Order Score

• Intuition: prefer phrases with similar context (distribution)

Clustering
Clustering Clustering

dimension
dimension dimension
Good Label (l1): Bad Label (l2):
“clustering “hash join”
Topic  …
partition partition algorithm”
algorithm
algorithm algorithm
… … join
Score (l,  )
C: SIGMOD
hash hash Proceedings
hash   p ( w |  ) PMI ( w, l | C )
P(w|) P(w|l1) P(w|l2) w

D(||l1) < D(||l2)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 58
Sample Results
• Comparative text mining
• Spatiotemporal pattern mining
• Sentiment summary
• Event impact analysis
• Temporal author-topic analysis

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 59
Comparing News Articles
Iraq War (30 articles) vs. Afghan War (26 articles)
The common theme indicates that “United Nations” is involved in both wars

Cluster 1 Cluster 2 Cluster 3

Common united 0.042 killed 0.035 …
nations 0.04 month 0.032
Theme … deaths 0.023
…
Iraq n 0.03 troops 0.016 …
Weapons 0.024 hoon 0.015
Theme Inspections 0.023 sanches 0.012
… …
Northern 0.04 taleban 0.026 …
alliance 0.04 rumsfeld 0.02
Afghan kabul 0.03 hotel 0.012
taleban 0.025 front 0.011
Theme aid 0.02 …
…
Collection-specific themes indicate different roles of “United Nations” in the two wars

Top words serve as “labels” for common themes

(e.g., [sound, speakers], [battery, hours], [cd,drive])

These word distributions can be used to segment text and

add hyperlinks between documents

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 61
Spatiotemporal Patterns in Blog Articles
• Query= “Hurricane Katrina”
• Topics in the results:
Government Response New Orleans Oil Price Praying and Blessing Aid and Donation Personal
bush 0.071 city 0.063 price 0.077 god 0.141 donate 0.120 i 0.405
president 0.061 orleans 0.054 oil 0.064 pray 0.047 relief 0.076 my 0.116
federal 0.051 new 0.034 gas 0.045 prayer 0.041 red 0.070 me 0.060
government 0.047 louisiana 0.023 increase 0.020 love 0.030 cross 0.065 am 0.029
fema 0.047 flood 0.022 product 0.020 life 0.025 help 0.050 think 0.015
administrate 0.023 evacuate 0.021 fuel 0.018 bless 0.025 victim 0.036 feel 0.012
response 0.020 storm 0.017 company 0.018 lord 0.017 organize 0.022 know 0.011
brown 0.019 resident 0.016 energy 0.017 jesus 0.016 effort 0.020 something 0.007
blame 0.017 center 0.016 market 0.016 will 0.013 fund 0.019 guess 0.007
governor 0.014 rescue 0.012 gasoline 0.012 faith 0.012 volunteer 0.019 myself 0.006

• Spatiotemporal patterns
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 62
Theme Life Cycles for Hurricane
Katrina
Oil Price
price 0.0772
oil 0.0643
New Orleans gas 0.0454
increase 0.0210
product 0.0203
fuel 0.0188
company 0.0182
…

city 0.0634
orleans 0.0541
new 0.0342
louisiana 0.0235
flood 0.0227
evacuate 0.0211
storm 0.0177
…

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 63
Theme Snapshots for Hurricane
Katrina
Week2: The discussion moves towards the north and west

Week1: The theme is the strongest along the Gulf of Mexico Week3: The theme distributes more uniformly over the states

Week4: The theme is again strong along the east coast and the Gulf of Mexico

Week5: The theme fades out in most states

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 64
Theme Life Cycles: KDD
gene 0.0173
0.02 expressions 0.0096
Biology Data
0.018
probability 0.0081
W eb Information
microarray 0.0038
0.016 Time Series
Normalized Strength of Theme

Classification
…
0.014
Association Rule marketing 0.0087
0.012
Clustering customer 0.0086
0.01 Bussiness model 0.0079
0.008 business 0.0048
0.006
…
0.004
rules 0.0142
association 0.0064
0.002
support 0.0053
0 …
1999 2000 2001 2002 2003 2004
Time (year)

Global Themes life cycles of KDD Abstracts

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 65
Theme Evolution Graph: KDD
1999 2000 2001 2002 2003 2004 T

web 0.009
SVM 0.007 classifica – mixture 0.005
criteria 0.007 tion 0.007 random 0.006
classifica – features0.006 cluster 0.006
tion 0.006 topic 0.005 clustering 0.005
linear 0.005 … variables 0.005
… … topic 0.010
mixture 0.008
decision 0.006 LDA 0.006
tree 0.006 … semantic
classifier 0.005 Classifica 0.005
class 0.005 - tion 0.015 Informa …
Bayes 0.005 text 0.013 - tion 0.012
… unlabeled 0.012 web 0.010
document 0.008 social 0.008
labeled 0.008 retrieval 0.007
learning 0.007 distance 0.005
…
… … networks 0.004
…
…

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 66
Blog Sentiment Summary
(query=“Da Vinci Code”)
Neutral Positive Negative
... Ron Howards selection of Tom Hanks stars in the But the movie might get
Tom Hanks to play Robert movie,who can be mad at delayed, and even killed off if
Langdon. that? he loses.

Directed by: Ron Howard Tom Hanks, who is my protesting ... will lose your faith
Facet 1:
Writing credits: Akiva favorite movie star act by ... watching the movie.
Movie
Goldsman ... the leading role.

After watching the movie I Anybody is interested in ... so sick of people making
went online and some it? such a big deal about a
research on ... FICTION book and movie.

I remembered when i first Awesome book. ... so sick of people making

read the book, I finished the such a big deal about a
book in two days. FICTION book and movie.
Facet 2:
Book I’m reading “Da Vinci Code” So still a good book to This controversy book cause
now. past time. lots conflict in west society.
…

Facet: the book “ the da vinci Facet: religious beliefs ( Bursts

code”. ( Bursts during the during the movie, Neg >
movie, Pos > Neg ) Pos )

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 68
Event Impact Analysis: IR Research
xml 0.0678
vector 0.0514 email 0.0197
concept 0.0298 model 0.0191
extend 0.0297 collect 0.0187
Theme: retrieval model 0.0291 judgment 0.0102 SIGIR papers
models space 0.0236 rank 0.0097
boolean 0.0151 subtopic 0.0079
function 0.0123 …
term 0.1599 feedback 0.0077
relevance 0.0752 … Publication of the paper “A language
weight 0.0660 1992 modeling approach to information retrieval”
feedback 0.0372
independence 0.0311
year
model 0.0310 Starting of the TREC conferences
1998
frequent 0.0233 model 0.1687
probabilistic 0.0188 language 0.0753
probabilist 0.0778 estimate 0.0520
document 0.0173
model 0.0432 parameter 0.0281
…
logic 0.0404 distribution 0.0268
ir 0.0338 probable 0.0205
boolean 0.0281 smooth 0.0198
algebra 0.0200 markov 0.0137
estimate 0.0119 likelihood 0.0059
weight 0.0111 …
…
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 69
Temporal-Author-Topic Analysis
close 0.0805 index 0.0440
project 0.0444 pattern 0.0720 graph 0.0343
itemset 0.0433 sequential 0.0462 web 0.0307
intertransaction 0.0397 min_support 0.0353 gspan 0.0273
support 0.0264 threshold 0.0207 substructure 0.0201
associate 0.0258 top-k 0.0176 gindex 0.0164
frequent 0.0181 fp-tree 0.0102 bide 0.0115
closet 0.0176 Author … xml 0.0109
prefixspan 0.0170 Jiawei Han …
…

Author A
Global theme:
frequent patterns time
2000
pattern 0.1107 Author B research 0.0551
frequent 0.0406 Rakesh Agrawal next 0.0308
frequent-pattern 0.039 transaction 0.0308
sequential 0.0360 panel 0.0275
pattern-growth 0.0203 technical 0.0275
constraint 0.0184 article 0.0258
push 0.0138 revolution 0.0154
… innovate 0.0154
…

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 70
Modeling Topical Communities
(Mei et al. 08)

Community 1:
Information Retrieval
Community 2:
Data Mining

Community 3:
Machine Learning
71
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 71
Other Extensions (LDA Extensions)

• Many extensions of LDA, mostly done by David

Blei, Andrew McCallum and their co-authors
• Some examples:
– Hierarchical topic models [Blei et al. 03]
– Modeling annotated data [Blei & Jordan 03]
– Dynamic topic models [Blei & Lafferty 06]
– Pachinko allocation [Li & McCallum 06])
• Also, some specific context extension of PLSA,
e.g., author-topic model [Steyvers et al. 04]

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 72
Future Research Directions
• Topic models for text mining
– Evaluation of topic models
– Improve the efficiency of estimation and inferences
– Incorporate linguistic knowledge
– Applications in new domains and for new tasks

• Text mining in general

– Combination of NLP-style and DM-style mining algorithms
– Integrated mining of text (unstructured) and unstructured data
(e.g., Text OLAP)
– Interactive mining:
• Incorporate user constraints and support iterative mining
• Design and implement mining languages

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 73
Lecture 5: Key Points
• Topic models coupled with topic labeling are quite
useful for extracting and modeling subtopics in text
• Adding context variables significantly increases a
topic model’s capacity of performing text mining
– Enable interpretation of topics in context
– Accommodate variation analysis and correlation
analysis of topics over context

• User’s preferences and domain knowledge can be

added as prior or soft constraint

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 74
Readings
• PLSA:
– http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf

• LDA:
– http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf
– Many recent extensions, mostly done by David Blei and Andrew
McCallums

• CPLSA:
– http://sifaka.cs.uiuc.edu/czhai/pub/kdd06-mix.pdf
– http://sifaka.cs.uiuc.edu/czhai/pub/www08-net.pdf

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 75
Discussion
• Topic models for mining multimedia data
– Simultaneous modeling of text and images

• Cross-media analysis
– Text provides context to analyze images and vice
versa

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 76
Course Summary
Integrated Multimedia Data AnalysisScope of the course
-Mutual reinforcement (e.g., text images)
Information Retrieval
-Simultaneous mining of text + images +video…
1. Evaluation
Multimedia Data Retrieval models/framework
Text Data
2. User modeling Evaluation Feedback
3. Ranking Contextual topic models
4. Computer Vision
Learning with little Natural Language Processing
supervision
Machine Learning

Looking forward to collaborations…

Statistics

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (83)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
91% (35)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
1001 Songs
70% (73)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
Topic Models in Natural Language Processing
No ratings yet
Topic Models in Natural Language Processing
64 pages
SNLP Overview
No ratings yet
SNLP Overview
43 pages
Unit 2, Part 2:topic Modeling
No ratings yet
Unit 2, Part 2:topic Modeling
26 pages
WINSEM2018-19 - CSE6019 - ETH - SJT421 - VL2018195001554 - Reference Material I - 3.3 PLSI
No ratings yet
WINSEM2018-19 - CSE6019 - ETH - SJT421 - VL2018195001554 - Reference Material I - 3.3 PLSI
22 pages
01_Introduction to Text Analytics_part2
No ratings yet
01_Introduction to Text Analytics_part2
48 pages
Apex Institute of Technology Natural Language Processing (CST-354)
No ratings yet
Apex Institute of Technology Natural Language Processing (CST-354)
22 pages
Information Retrieval Using Effective Bigram Topic Modeling
No ratings yet
Information Retrieval Using Effective Bigram Topic Modeling
8 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Ai Lecture22
No ratings yet
Ai Lecture22
32 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
Text Mining
No ratings yet
Text Mining
85 pages
Natural Language Processing
No ratings yet
Natural Language Processing
49 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Applied Natural Language Processing: Barbara Rosario
No ratings yet
Applied Natural Language Processing: Barbara Rosario
39 pages
Statistical Language Processing
No ratings yet
Statistical Language Processing
32 pages
Pert23 - NLP
No ratings yet
Pert23 - NLP
30 pages
module5_DS_ppt
No ratings yet
module5_DS_ppt
38 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
45 pages
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
No ratings yet
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
5 pages
Text Mining - Analytics
No ratings yet
Text Mining - Analytics
35 pages
BDA3
No ratings yet
BDA3
61 pages
An Introduction To Text: Mining
No ratings yet
An Introduction To Text: Mining
39 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Topic Model For LDA
No ratings yet
Topic Model For LDA
9 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
Project Example
No ratings yet
Project Example
19 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
2023F DM (E) 13
No ratings yet
2023F DM (E) 13
66 pages
CS109/Stat121/AC209/E-109 Data Science: Bayesian Methods Continued, Text Data
No ratings yet
CS109/Stat121/AC209/E-109 Data Science: Bayesian Methods Continued, Text Data
35 pages
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
No ratings yet
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
7 pages
Combining Lexical and Semantic Features For Short Text Classification
No ratings yet
Combining Lexical and Semantic Features For Short Text Classification
9 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
60 pages
Topic Modelling Using NLP
No ratings yet
Topic Modelling Using NLP
18 pages
Lecture 6 - From Unstructured Texts to Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts to Structure Data I
17 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
AI Lec 04+05 - Naive Bayes
No ratings yet
AI Lec 04+05 - Naive Bayes
55 pages
AI Unit V
No ratings yet
AI Unit V
64 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Keyword 2
No ratings yet
Keyword 2
5 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Data Science Interview Preparation Questions (#Day06)
No ratings yet
Data Science Interview Preparation Questions (#Day06)
10 pages
Information Retrieval - Lsi, Plsi and Lda: Jian-Yun Nie
No ratings yet
Information Retrieval - Lsi, Plsi and Lda: Jian-Yun Nie
34 pages
Module_5-Natural_language_processing[1]
No ratings yet
Module_5-Natural_language_processing[1]
13 pages
Intro to statistical nlp
No ratings yet
Intro to statistical nlp
57 pages
Research Paper Reading-2024
No ratings yet
Research Paper Reading-2024
52 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
Efficient Preprocessing and Patterns Identification Approach For Text Mining
No ratings yet
Efficient Preprocessing and Patterns Identification Approach For Text Mining
6 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
64 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Sentiment Analysis
100% (1)
Sentiment Analysis
19 pages
Session 2
No ratings yet
Session 2
58 pages
Lecture03 Naive Bayes
No ratings yet
Lecture03 Naive Bayes
33 pages
Comments on Paul Cobley's Essay (2018) "Human Understanding: A Key Triad"
From Everand
Comments on Paul Cobley's Essay (2018) "Human Understanding: A Key Triad"
Razie Mah
No ratings yet
Fourier Analysis on Groups
From Everand
Fourier Analysis on Groups
Walter Rudin
No ratings yet
Topoc Modeling PDF
No ratings yet
Topoc Modeling PDF
120 pages
18 Rnns
No ratings yet
18 Rnns
57 pages
1507965390
No ratings yet
1507965390
17 pages
GUNA1
No ratings yet
GUNA1
46 pages
1624338842
No ratings yet
1624338842
53 pages
Final 50
No ratings yet
Final 50
20 pages
Earthquake Final Report
No ratings yet
Earthquake Final Report
42 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
15 pages
Deep Learning
No ratings yet
Deep Learning
189 pages
Eurolan 07 Wiebe
No ratings yet
Eurolan 07 Wiebe
220 pages
255986458
No ratings yet
255986458
44 pages
12 Subrata DL
No ratings yet
12 Subrata DL
25 pages
A Joint Representation of Rényi's and Tsalli's Entropy With Application in Coding Theory - 2017 - International Journal of Mathematics A
No ratings yet
A Joint Representation of Rényi's and Tsalli's Entropy With Application in Coding Theory - 2017 - International Journal of Mathematics A
6 pages
1612644858
No ratings yet
1612644858
33 pages
PCFG
No ratings yet
PCFG
79 pages
Ling571 Class12 Sem SRL
No ratings yet
Ling571 Class12 Sem SRL
136 pages
Ling571 Class14 Distr Thes
No ratings yet
Ling571 Class14 Distr Thes
122 pages
1 s2.0 S016516841630295X Main
No ratings yet
1 s2.0 S016516841630295X Main
12 pages
Ling571 Class13 Lex Sem
No ratings yet
Ling571 Class13 Lex Sem
82 pages
Creative Accounting
No ratings yet
Creative Accounting
14 pages
Fire Pump Controllers
No ratings yet
Fire Pump Controllers
4 pages
Abstract Iqbal
No ratings yet
Abstract Iqbal
2 pages
Lab Report 4
100% (1)
Lab Report 4
16 pages
Activity 2022 Nov. 11 To 2022 Nov. 15
No ratings yet
Activity 2022 Nov. 11 To 2022 Nov. 15
2 pages
GK002 Pi SP 003 (1C3)
No ratings yet
GK002 Pi SP 003 (1C3)
8 pages
Bekele Article Review
No ratings yet
Bekele Article Review
34 pages
City Beautiful by DR Kesava Reddypdf
No ratings yet
City Beautiful by DR Kesava Reddypdf
116 pages
Literacy-Practice-Test - ACER
No ratings yet
Literacy-Practice-Test - ACER
12 pages
Unesco - Eolss Sample Chapters: Stress Classification in Pressure Vessels and Piping
No ratings yet
Unesco - Eolss Sample Chapters: Stress Classification in Pressure Vessels and Piping
7 pages
Algorithms Design and Analysis Mcqs With Answers Set 1
60% (5)
Algorithms Design and Analysis Mcqs With Answers Set 1
10 pages