SlideShare a Scribd company logo
Part II

Entity Retrieval
Krisztian Balog 

University of Stavanger

Half-day tutorial at the WSDM’14 conference | New York City, USA, 2014
Entity retrieval

Addressing information needs that are better
answered by returning specific objects
(entities) instead of just any type of documents.
Distribution of web search
queries [Pound et al. 2010]
6%

41%
36%

1%
5%

12%

Entity (“1978 cj5 jeep”)
Type (“doctors in barcelona”)
Attribute (“zip code waterville Maine”)
Relation (“tom cruise katie holmes”)
Other (“nightlife in Barcelona”)
Uninterpretable
Distribution of web search
queries [Lin et al. 2011]
28%

29%

15%

14%
10% 4%

Entity
Entity+refiner
Category
Category+refiner
Other
Website
What’s so special here?
- Entities are not always directly represented

- Recognize and disambiguate entities in text 

(that is, entity linking)
- Collect and aggregate information about a given
entity from multiple documents and even multiple
data collections

- More structure than in document-based IR

- Types (from some taxonomy)
- Attributes (from some ontology)
- Relationships to other entities (“typed links”)
Semantics in our context
- working definition:

references to meaningful structures
- How to capture, represent, and use structure?

- It concerns all components of the retrieval process!

info
need

matching
Abc

entity

Abc

Abc

Text-only representation

info
need
Abc

matching
Abc

entity
Abc

Text+structure representation
Overview of core tasks
Queries

Data set

Results

(adhoc) entity
retrieval

keyword

unstructured/

semistructured

ranked list

adhoc object
retrieval

keyword

structured

ranked list

keyword+++


(semi)structured

ranked list

unstructured 

& structured

ranked list

list completion
related entity
finding

(examples)

keyword++

(target type, relation)
In this part
- Input: keyword(++) query
- Output: a ranked list of entities

- Data collection: unstructured and
(semi)structured data sources (and their
combinations)

- Main RQ: How to incorporate structure into
text-based retrieval models?
Outline

1.Ranking based on
entity descriptions


Attributes 

(/Descriptions)

2.Incorporating entity
types


Type(s)

3.Entity relationships

Relationships
Ranking entity
descriptions
Attributes 

(/Descriptions)
Type(s)
Relationships
Task: ad-hoc entity retrieval
- Input: unconstrained natural language query

- “telegraphic” queries (neither well-formed nor
grammatically correct sentences or questions)

- Output: ranked list of entities

- Collection: unstructured and/or semistructured documents
Example information needs
american embassy nairobi
ben franklin
Chernobyl
meg ryan war
Worst actor century
Sweden Iceland currency
Two settings
1.With ready-made entity descriptions





e
e
e


xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx

xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx

xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx

2.Without explicit entity representations
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx xxxxxx
xx x xxx xx x xxxx xx
xxx x xxxxxx xx x xxx xx
xxxx xx xxx xx x xxxxx
xxx xx x

xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx

xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx
Ranking with ready-made
entity descriptions
This is not unrealistic...
Document-based entity
representations
- Most entities have a “home page”

- I.e., each entity is described by a document

- In this scenario, ranking entities is much like
ranking documents

- unstructured
- semi-structured
Evaluation initiatives
- INEX Entity Ranking track (2007-09)

- Collection is the (English) Wikipedia
- Entities are represented by Wikipedia articles

- Semantic Search Challenge (2010-11)

- Collection is a Semantic Web crawl (BTC2009)
- ~1 billion RDF triples

- Entities are represented by URIs

- INEX Linked Data track (2012-13)

- Wikipedia enriched with RDF properties from
DBpedia and YAGO
Standard Language Modeling
approach
- Rank documents d according to their likelihood
of being relevant given a query q: P(d|q)
P (q|d)P (d)
P (d|q) =
/ P (q|d)P (d)
P (q)
Query likelihood

Probability that query q 

was “produced” by document d

P (q|d) =

Document prior

Probability of the document 

being relevant to any query

Y
t2q

P (t|✓d )

n(t,q)
Standard Language Modeling
approach (2)
Number of times t appears in q

P (q|d) =

Y
t2q

P (t|✓d )

n(t,q)

Document language model

Multinomial probability distribution
over the vocabulary of terms

P (t|✓d ) = (1

Smoothing parameter


)P (t|d) + P (t|C)

Empirical 

document model


n(t, d)
|d|

Maximum

likelihood 

estimates

Collection
model 


P
d n(t, d)
P
d |d|
Here, documents==entities, so
P (e|q) / P (e)P (q|✓e ) = P (e)
Entity prior

Probability of the entity 

being relevant to any query

Y

P (t|✓e )n(t,q)

t2q

Entity language model

Multinomial probability distribution
over the vocabulary of terms
Semi-structured entity
representation
- Entity description documents are rarely
unstructured

- Representing entities as 

- Fielded documents – the IR approach
- Graphs – the DB/SW approach
dbpedia:Audi_A4
foaf:name
rdfs:label
rdfs:comment

dbpprop:production

rdf:type
dbpedia-owl:manufacturer
dbpedia-owl:class
owl:sameAs
is dbpedia-owl:predecessor of
is dbpprop:similar of

Audi A4
Audi A4
The Audi A4 is a compact executive car
produced since late 1994 by the German car
manufacturer Audi, a subsidiary of the
Volkswagen Group. The A4 has been built [...]
1994
2001
2005
2008
dbpedia-owl:MeanOfTransportation
dbpedia-owl:Automobile
dbpedia:Audi
dbpedia:Compact_executive_car
freebase:Audi A4
dbpedia:Audi_A5
dbpedia:Cadillac_BLS
Mixture of Language Models
[Ogilvie & Callan 2003]

- Build a separate language model for each field

- Take a linear combination of them
m
X
P (t|✓d ) =
µj P (t|✓dj )
j=1

Field language model


Field weights
m
X
j=1

µj = 1

Smoothed with a collection model built

from all document representations of the

same type in the collection
Comparison of models
df

t
d

...
t

Unstructured

document model

d

...

...

df

t

F

t

Fielded

document model

d

df

t

...

...

...

F

df

t

Hierarchical

document model
Setting field weights
- Heuristically 

- Proportional to the length of text content in that field,
to the field’s individual performance, etc.

- Empirically (using training queries)

- Problems

- Number of possible fields is huge
- It is not possible to optimise their weights directly

- Entities are sparse w.r.t. different fields

- Most entities have only a handful of predicates
Predicate folding

- Idea: reduce the number of fields by grouping
them together

- Grouping based on (BM25F and)

- type [Pérez-Agüera et al. 2010]
- manually determined importance [Blanco et al. 2011]
Hierarchical Entity Model
[Neumayer et al. 2012]

- Organize fields into a 2-level hierarchy

- Field types (4) on the top level
- Individual fields of that type on the bottom level

- Estimate field weights

- Using training data for field types
- Using heuristics for bottom-level types
Two-level hierarchy
[Neumayer et al. 2012]
Name


Attributes


foaf:name
rdfs:label
rdfs:comment

dbpprop:production

rdf:type

Out-relations

In-relations


dbpedia-owl:manufacturer
dbpedia-owl:class
owl:sameAs
is dbpedia-owl:predecessor of
is dbpprop:similar of

!

Audi A4
Audi A4
The Audi A4 is a compact executive car
produced since late 1994 by the German car
manufacturer Audi, a subsidiary of the
Volkswagen Group. The A4 has been built [...]
1994
2001
2005
2008
dbpedia-owl:MeanOfTransportation
dbpedia-owl:Automobile
dbpedia:Audi
dbpedia:Compact_executive_car
freebase:Audi A4
dbpedia:Audi_A5
dbpedia:Cadillac_BLS
Comparison of models
df

t
d

...
t

Unstructured

document model

d

...

...

df

t

F

t

Fielded

document model

d

df

t

...

...

...

F

df

t

Hierarchical

document model
Probabilistic Retrieval Model
for Semistructured data

[Kim et al. 2009]
- Extension to the Mixture of Language Models

- Find which document field each query term
may be associated with
m
X
P (t|✓d ) =
µj P (t|✓dj )
j=1

Mapping probability


Estimated for each query term

P (t|✓d ) =

m
X
j=1

P (dj |t)P (t|✓dj )
Estimating the mapping
probability
P
d n(t, dj )
P (t|Cj ) = P
d |dj |
Term likelihood


Probability of a query term
occurring in a given field type


P (t|dj )P (dj )
P (dj |t) =
P (t)
X
dk

P (t|dk )P (dk )

Prior field probability

Probability of mapping the query term 

to this field before observing collection
statistics
Example
meg ryan war

dj

cast
team
title

P (t|dj )
0,407
0,382
0,187

dj
cast

team
title

P (t|dj )
0,601
0,381
0,017

P (t|dj )
0,927
title
0,07
location 0,002
dj
genre
Ranking without explicit
entity representations
Scenario

- Entity descriptions are not readily available

- Entity occurrences are annotated

- manually
- automatically (~entity linking)
TREC Enterprise track
- Expert finding task (2005-08)

- Enterprise setting (intranet of a large organization)
- Given a query, return people who are experts on the
query topic
- List of potential experts is provided

- We assume that the collection has been
annotated with <person>...</person> tokens
The basic idea

Use documents to go from queries to entities

xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx xxxxxx
xx x xxx xx x xxxx xx
xxx x xxxxxx xx x xxx xx
xxxx xx xxx xx x xxxxx
xxx xx x

q

Query-document
association
the document’s relevance

xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx

xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx

e

Document-entity
association
how well the document
characterises the entity
Two principal approaches
- Profile-based methods

- Create a textual profile for entities, then rank them
(by adapting document retrieval techniques)

- Document-based methods

- Indirect representation based on mentions identified
in documents
- First ranking documents (or snippets) and then
aggregating evidence for associated entities
Profile-based methods
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx xxxxxx
xx x xxx xx x xxxx xx
xxx x xxxxxx xx x xxx xx
xxxx xx xxx xx x xxxxx
xxx xx x

xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx

xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx

e

e

e

xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx

xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx

xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx

q
Document-based methods
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx xxxxxx
xx x xxx xx x xxxx xx
xxx x xxxxxx xx x xxx xx
xxxx xx xxx xx x xxxxx
xxx xx x

q

e

xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx

e

xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx

e

X
X
X
Many possibilities in terms of
modeling
- Generative (probabilistic) models

- Discriminative (probabilistic) models

- Voting models

- Graph-based models
Generative probabilistic
models
- Candidate generation models (P(e|q))

- Two-stage language model

- Topic generation models (P(q|e))

- Candidate model, a.k.a. Model 1
- Document model, a.k.a. Model 2
- Proximity-based variations

- Both families of models can be derived from the
Probability Ranking Principle [Fang & Zhai 2007]
Candidate models (“Model 1”)
[Balog et al. 2006]
Y
n(t,q)
P (q|✓e ) =
P (t|✓e )
t2q

Smoothing

With collection-wide background model

(1

)P (t|e) + P (t)
X
P (t|d, e)P (d|e)
d

Term-candidate 

co-occurrence
In a particular document. 

In the simplest case:P (t|d)

Document-entity
association
Document models (“Model 2”)
[Balog et al. 2006]
X
P (q|e) =
P (q|d, e)P (d|e)
d

Document relevance
How well document d
supports the claim that e
is relevant to q

Y
t2q

Document-entity
association

P (t|d, e)n(t,q)
Simplifying assumption 

(t and e are conditionally
independent given d)

P (t|✓d )
Document-entity associations

- Boolean (or set-based) approach

- Weighted by the confidence in entity linking

- Consider other entities mentioned in the
document
Proximity-based variations
- So far, conditional independence assumption
between candidates and terms when
computing the probability P(t|d,e)

- Relationship between terms and entities that in
the same document is ignored

- Entity is equally strongly associated with everything
discussed in that document

- Let’s capture the dependence between entities
and terms

- Use their distance in the document
Using proximity kernels
[Petkova & Croft 2007]

N
X
1
P (t|d, e) =
Z i=1
Normalizing
contant

d (i, t)k(t, e)

Indicator function
1 if the term at position i is t,
0 otherwise

Proximity-based kernel
- constant function

- triangle kernel

- Gaussian kernel

- step function
Figure taken from D. Petkova and W.B. Croft. Proximity-based document representation for named entity
retrieval. CIKM'07.
Many possibilities in terms of
modeling
- Generative probabilistic models

- Discriminative probabilistic models

- Voting models

- Graph-based models
Discriminative models
- Vs. generative models:

- Fewer assumptions (e.g., term independence)
- “Let the data speak”
- Sufficient amounts of training data required

- Incorporating more document features, multiple
signals for document-entity associations
- Estimating P(r=1|e,q) directly (instead of P(e,q|r=1))
- Optimization can get trapped in a local maximum/
minimum
Arithmetic Mean
Discriminative (AMD) model
[Yang et al. 2010]

P✓ (r = 1|e, q) =

X

P (r1 = 1|q, d)P (r2 = 1|e, d)P (d)

d

Query-document Document-entity
relevance
relevance
logistic function 

over a linear
combination of features
standard logistic
function

Nf
⇣X
i=1

⌘
↵i fi (q, dt )

weight 
 features
parameters

(learned)

Ng
⇣X
j=1

⌘
j gj (e, dt )

Document
prior
Learning to rank && entity
retrieval
- Pointwise

- AMD, GMD [Yang et al. 2010]
- Multilayer perceptrons, logistic regression [Sorg &
Cimiano 2011]
- Additive Groves [Moreira et al. 2011]

- Pairwise

- Ranking SVM [Yang et al. 2009]
- RankBoost, RankNet [Moreira et al. 2011]

- Listwise

- AdaRank, Coordinate Ascent [Moreira et al. 2011]
Voting models

[Macdonald & Ounis 2006]

- Inspired by techniques from data fusion

- Combining evidence from different sources

- Documents ranked w.r.t. the query are seen as
“votes” for the entity
Voting models

Many different variants, including...
- Votes

- Number of documents mentioning the entity
!Score(e, q) = |M (e)  R(q)|
!

- Reciprocal Rank

- Sum of inverse ranks of documents
X
1
!
Score(e, q) =
rank(d, q)
!

- CombSUM


{M (e)R(q)}

- Sum of scores of documents
Score(e, q) = |{M (e)  R(q)}|

X

{M (e)R(q)}

s(d, q)
Graph-based models
[Serdyukov et al. 2008]

- One particular way of constructing graphs

- Vertices are documents and entities
- Only document-entity edges

- Search can be approached as a random walk
on this graph

- Pick a random document or entity
- Follow links to entities or other documents
- Repeat it a number of times
Infinite random walk
[Serdyukov et al. 2008]

e

Pi (e) =

PJ (d) + (1
X

e

e

d

Pi (d) =

e

d

d

d

)

X

e!d

P (e|d)Pi

d!e

PJ (d) = P (d|q),

1 (d),

P (d|e)Pi

1 (e),
Incorporating
entity types
Attributes 

(/Descriptions)
Type(s)
Relationships
For a handful of types

grouping results by entity type is a viable solution
For a handful of types

grouping results by entity type is a viable solution
But what about very many types?

which are typically hierarchically organized
Challenges
- Users are not familiar with the type system

- (Often) user input is to be treated as a hint, not as a
strict filter

- Type system is imperfect

- Inconsistencies
- Missing assignments
- Granularity issues
- Entities labeled with too general or too specific types

- In general, categorizing things can be hard 

- E.g. is King Arthur “British royalty”, “fictional character”,
or “military person”?
Two settings

- Target type(s) are provided by the user

- keyword++ query

- Target types need to be automatically identified

- keyword query
Target type(s) are provided

faceted search, form fill-in, etc.
INEX Entity Ranking track
- Entities are represented by Wikipedia articles

- Topic definition includes target categories
Movies with eight or more Academy Awards

best picture oscar british films american films
Entity Retrieval (WSDM 2014 tutorial)
Using target type information
- Constraining results

- Soft/hard filtering
- Different ways to measure type similarity
-

Set-based
Content-based
Lexical similarity of type labels
Distance based on the hierarchy

- Query expansion

- Adding terms from type names to the query

- Entity expansion

- Types added as a separate metadata field
Modeling terms and categories
[Balog et al. 2011]

P (e|q) / P (q|e)P (e)
T T
C C
P (q|e) = (1
)P (✓q |✓e ) + P (✓q |✓e )
Term-based representation
Query model
T
p(t|✓q )

Entity model

T
T
KL(✓q ||✓e )

T
p(t|✓e )

Category-based representation
Query model
C
p(c|✓q )

Entity model

C
C
KL(✓q ||✓e )

C
p(c|✓e )
Advantages
- Transparent combination of term-based and
category-based information

- Sound modeling of uncertainty associated with
category information

- Category-based feedback is possible
(analogously to the term-based case)
Expanding target types

- Pseudo relevance feedback

- Based on hierarchical structure

- Using lexical similarity of type labels
Two settings

- Target type(s) are provided by the user

- keyword++ query

- Target types need to be automatically identified

- keyword query
Identifying target types for
queries
- Types of top ranked entities [Vallet & Zaragoza
2008]

- Types can be ranked much like entities [Balog &
Neumayer 2012]

- Direct term-based vs. indirect entity-based
representations (“Model 1 vs. Model 2”)
- Hierarchical case is difficult
Joint type detection and entity
ranking [Sawant & Chakrabarti 2013]
- Assumes “telegraphic” queries with target type

- woodrow wilson president university
- dolly clone institute
- lead singer led zeppelin band

- Type detection is integrated into the ranking

- Multiple query interpretations are considered

- Both generative and discriminative
formulations
Approach
q z
- Each query term is either a “type hint” ( h(~, ~))
q z
or a “word matcher” (s(~, ~))

- Number of possible partitions is manageable ( 2|q|)
losing baseball team world series 1998
Type Major league baseball teams
instanceOf
Entity

San Diego Padres
mentionOf

Evidence 

snippets

By comparison, the Padres have been to two
World Series, losing in 1984 and 1998.
Generative approach
Generate query from entity
E

San Diego Padres!
context

type

Major league !
baseball team!

T

model!

Padres have been to two World
Series, losing in 1984 and 1998!

ϕ

Type hint :
baseball , team

θ
switch!

model!

Context matchers : !
lost , 1998, world series

Z
q

losing team baseball world series 1998

Figure taken from Sawant & Chakrabarti (2013). Learning Joint Query Interpretation and Response
Ranking. In WWW ’13. (see presentation)
Generative formulation
P (e|q) / P (e)

X

P (t|e)P (~)P (h(~, ~)|t)P (s(~, ~)|e)
z
q z
q z

t,~
z

Type prior
 Query switch


Estimated from 
 Probability of the
answer types 
 interpretation
in the past

Entity prior

Type model

Probability of observing
t in the type model

Entity model

Probability of observing
t in the entity model
Discriminative approach
Separate correct and incorrect entities

q!: losing team baseball world series 1998!

San_Diego_Padres!

losing team baseball
losing team baseball
world series 1998
losing team baseball
world seriesteam)!
1998
(baseball 1998
world seriesteam)!
(baseball
(t = baseball team)!

1998_World_Series!

losing team baseball
losing team baseball
world series 1998
losing team baseball
world series 1998
(series)!
world series 1998
(series)!
(t = series)!

Figure taken from Sawant & Chakrabarti (2013). Learning Joint Query Interpretation and Response
Ranking. In WWW ’13. (see presentation)
Discriminative formulation
(q, e, t, ~) = h
z

1 (q, e),

Models the type
prior P(t|e)

2 (t, e),

z
3 (q, ~ , t),

z
4 (q, ~ , e)i

Models the entity
prior P(e)
Comparability
between hint words
and type

Comparability
between matchers
and snippets that
mention e
Entity relationships
Attributes 

(/Descriptions)
Type(s)
Relationships
Related entities
Entity Retrieval (WSDM 2014 tutorial)
Entity Retrieval (WSDM 2014 tutorial)
TREC Entity track
- Related Entity Finding task

- Given

- Input entity (defined by name and homepage)
- Type of the target entity (PER/ORG/LOC)
- Narrative (describing the nature of the relation in free
text)

- Return (homepages of) related entities
Example information needs
airlines that currently use Boeing 747 planes

ORG Boeing 747
Members of The Beaux Arts Trio

PER The Beaux Arts Trio
What countries does Eurail operate in?

LOC Eurail
A typical pipeline

Input
(entity, target type, relation)


Candidate 

entities

Retrieving docs/snippets
Query expansion
...

Ranked list 

of entities

Type filtering
Deduplication
Exploiting lists
...

Entity
homepages 


Heuristic rules
Learning
...
Modeling related entity finding
[Bron et al. 2010]

- Three-component model
p(e|E, T, R) / p(e|E) · p(T |e) · p(R|E, e)
Co-occurrence
model
Type filtering

Context model
xxxx x xxx xx xxxxxx xx x xxx xx x xxxx
xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx
xxxx xxx xx xxxxxx xx x xxx xx x xxxx
xx xxx x xxxxx xx x xxx xx xxxx xx xx x xx
xxxxxx x xxxxxx xxxxxx xx x xxx xxx xxxx
xxx x xxx
x xxxxx xxx xxxxxx x xxx xxxx x xxx xx x xx
xxxx xxx xxxxx xx xxxxxx xxxx xx xxx xxxx
xxx x x
x xxxxx xxx
xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx
x xxxxx xxx
Wrapping up

- Increasingly more discriminative approaches
over generative ones

- Increasing amount of components (and parameters)
- Easier to incrementally add informative but
correlated features
- But, (massive amounts of ) training data is required!
Future challenges
- It’s “easy” when the “query intent” is known

- Desired results: single entity, ranked list, set, …
- Query type: ad-hoc, list search, related entity finding, …

- Methods specifically tailored to specific types 

of requests

- Understanding query intent still has a long 

way to go
Entity Retrieval (WSDM 2014 tutorial)

More Related Content

What's hot (20)

PDF
Entity Linking in Queries: Efficiency vs. Effectiveness
Faegheh Hasibi
 
PPTX
Gleaning Types for Literals in RDF with Application to Entity Summarization
Kalpa Gunaratna
 
PDF
Linked Open Data to support content based Recommender Systems
Vito Ostuni
 
PDF
Question Answering with Lydia
Jae Hong Kil
 
PDF
Representing financial reports on the semantic web a faithful translation f...
Jie Bao
 
PDF
Verifying Integrity Constraints of a RDF-based WordNet
Alexandre Rademaker
 
PDF
Recommender Systems in the Linked Data era
Roku
 
PDF
What's next in Julia
Jiahao Chen
 
PDF
Type Vector Representations from Text. DL4KGS@ESWC 2018
Federico Bianchi
 
PPTX
Expressive Query Answering For Semantic Wikis (20min)
Jie Bao
 
PPT
Information extraction for Free Text
butest
 
PPTX
SWT Lecture Session 8 - Rules
Mariano Rodriguez-Muro
 
PDF
Introduction to Ontology Engineering with Fluent Editor 2014
Cognitum
 
PDF
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Vladimir Alexiev, PhD, PMP
 
PPTX
Topical_Facets
Eric Van Horenbeeck
 
ODP
Information Extraction from the Web - Algorithms and Tools
Benjamin Habegger
 
PDF
Ontology Engineering: Introduction
Guus Schreiber
 
PDF
Scaling the (evolving) web data –at low cost-
WU (Vienna University of Economics and Business)
 
PPTX
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Sean Golliher
 
Entity Linking in Queries: Efficiency vs. Effectiveness
Faegheh Hasibi
 
Gleaning Types for Literals in RDF with Application to Entity Summarization
Kalpa Gunaratna
 
Linked Open Data to support content based Recommender Systems
Vito Ostuni
 
Question Answering with Lydia
Jae Hong Kil
 
Representing financial reports on the semantic web a faithful translation f...
Jie Bao
 
Verifying Integrity Constraints of a RDF-based WordNet
Alexandre Rademaker
 
Recommender Systems in the Linked Data era
Roku
 
What's next in Julia
Jiahao Chen
 
Type Vector Representations from Text. DL4KGS@ESWC 2018
Federico Bianchi
 
Expressive Query Answering For Semantic Wikis (20min)
Jie Bao
 
Information extraction for Free Text
butest
 
SWT Lecture Session 8 - Rules
Mariano Rodriguez-Muro
 
Introduction to Ontology Engineering with Fluent Editor 2014
Cognitum
 
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Vladimir Alexiev, PhD, PMP
 
Topical_Facets
Eric Van Horenbeeck
 
Information Extraction from the Web - Algorithms and Tools
Benjamin Habegger
 
Ontology Engineering: Introduction
Guus Schreiber
 
Scaling the (evolving) web data –at low cost-
WU (Vienna University of Economics and Business)
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Sean Golliher
 

Similar to Entity Retrieval (WSDM 2014 tutorial) (20)

ODP
Sigir 2011 proceedings
chetanagavankar
 
ODP
Summary of SIGIR 2011 Papers
chetanagavankar
 
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
PPT
lecture11-prohdhhdhdhdhdhdhdhdbir(2).ppt
joyaluca2
 
PPT
lectueereerrrrrrtttttrre11-probir(1).ppt
joyaluca2
 
PPT
Intelligent Methods in Models of Text Information Retrieval: Implications for...
inscit2006
 
PPT
ECO_TEXT_CLUSTERING
George Simov
 
PPTX
A first look at tf idf-pdx data science meetup
Dan Sullivan, Ph.D.
 
PPT
Slides
butest
 
PDF
Latent semantic analysis
Kiarash Kiani
 
PPT
Artificial Intelligence
vini89
 
PDF
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
National Institute of Informatics
 
PDF
Some Information Retrieval Models and Our Experiments for TREC KBA
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
PPTX
Make your data great again - Ver 2
Daniel JACOB
 
PPTX
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Sean Golliher
 
PPT
6640200.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
RAtna29
 
PDF
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Vrije Universiteit Amsterdam
 
PDF
A Survey of Entity Ranking over RDF Graphs
Intelligent Search Systems and Semantic Technologies lab at ITIS KFU
 
PPT
Chapter 10 Data Mining Techniques
Houw Liong The
 
Sigir 2011 proceedings
chetanagavankar
 
Summary of SIGIR 2011 Papers
chetanagavankar
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
lecture11-prohdhhdhdhdhdhdhdhdbir(2).ppt
joyaluca2
 
lectueereerrrrrrtttttrre11-probir(1).ppt
joyaluca2
 
Intelligent Methods in Models of Text Information Retrieval: Implications for...
inscit2006
 
ECO_TEXT_CLUSTERING
George Simov
 
A first look at tf idf-pdx data science meetup
Dan Sullivan, Ph.D.
 
Slides
butest
 
Latent semantic analysis
Kiarash Kiani
 
Artificial Intelligence
vini89
 
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
National Institute of Informatics
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
Make your data great again - Ver 2
Daniel JACOB
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Sean Golliher
 
6640200.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
RAtna29
 
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Vrije Universiteit Amsterdam
 
Chapter 10 Data Mining Techniques
Houw Liong The
 
Ad

More from krisztianbalog (12)

PDF
Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...
krisztianbalog
 
PDF
Conversational AI from an Information Retrieval Perspective: Remaining Challe...
krisztianbalog
 
PDF
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
krisztianbalog
 
PDF
Personal Knowledge Graphs
krisztianbalog
 
PDF
Entities for Augmented Intelligence
krisztianbalog
 
PDF
On Entities and Evaluation
krisztianbalog
 
PDF
Overview of the TREC 2016 Open Search track: Academic Search Edition
krisztianbalog
 
PDF
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
krisztianbalog
 
PDF
Time-aware Evaluation of Cumulative Citation Recommendation Systems
krisztianbalog
 
PDF
Multi-step Classification Approaches to Cumulative Citation Recommendation
krisztianbalog
 
PDF
Semistructured Data Seach
krisztianbalog
 
KEY
Collection Ranking and Selection for Federated Entity Search
krisztianbalog
 
Towards Filling the Gap in Conversational Search: From Passage Retrieval to C...
krisztianbalog
 
Conversational AI from an Information Retrieval Perspective: Remaining Challe...
krisztianbalog
 
What Does Conversational Information Access Exactly Mean and How to Evaluate It?
krisztianbalog
 
Personal Knowledge Graphs
krisztianbalog
 
Entities for Augmented Intelligence
krisztianbalog
 
On Entities and Evaluation
krisztianbalog
 
Overview of the TREC 2016 Open Search track: Academic Search Edition
krisztianbalog
 
Overview of the Living Labs for IR Evaluation (LL4IR) CLEF Lab
krisztianbalog
 
Time-aware Evaluation of Cumulative Citation Recommendation Systems
krisztianbalog
 
Multi-step Classification Approaches to Cumulative Citation Recommendation
krisztianbalog
 
Semistructured Data Seach
krisztianbalog
 
Collection Ranking and Selection for Federated Entity Search
krisztianbalog
 
Ad

Recently uploaded (20)

PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Are there government-backed agri-software initiatives in Limerick.pdf
giselawagner2
 
PDF
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
PPTX
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
PPTX
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
PDF
Shuen Mei Parth Sharma Boost Productivity, Innovation and Efficiency wit...
AWS Chicago
 
PDF
Upgrading to z_OS V2R4 Part 01 of 02.pdf
Flavio787771
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Novus-Safe Pro: Brochure-What is Novus Safe Pro?.pdf
Novus Hi-Tech
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
UiPath vs Other Automation Tools Meeting Presentation.pdf
Tracy Dixon
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
Top Managed Service Providers in Los Angeles
Captain IT
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Are there government-backed agri-software initiatives in Limerick.pdf
giselawagner2
 
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
Shuen Mei Parth Sharma Boost Productivity, Innovation and Efficiency wit...
AWS Chicago
 
Upgrading to z_OS V2R4 Part 01 of 02.pdf
Flavio787771
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Novus-Safe Pro: Brochure-What is Novus Safe Pro?.pdf
Novus Hi-Tech
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
UiPath vs Other Automation Tools Meeting Presentation.pdf
Tracy Dixon
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 

Entity Retrieval (WSDM 2014 tutorial)

  • 1. Part II Entity Retrieval Krisztian Balog University of Stavanger Half-day tutorial at the WSDM’14 conference | New York City, USA, 2014
  • 2. Entity retrieval Addressing information needs that are better answered by returning specific objects (entities) instead of just any type of documents.
  • 3. Distribution of web search queries [Pound et al. 2010] 6% 41% 36% 1% 5% 12% Entity (“1978 cj5 jeep”) Type (“doctors in barcelona”) Attribute (“zip code waterville Maine”) Relation (“tom cruise katie holmes”) Other (“nightlife in Barcelona”) Uninterpretable
  • 4. Distribution of web search queries [Lin et al. 2011] 28% 29% 15% 14% 10% 4% Entity Entity+refiner Category Category+refiner Other Website
  • 5. What’s so special here? - Entities are not always directly represented - Recognize and disambiguate entities in text 
 (that is, entity linking) - Collect and aggregate information about a given entity from multiple documents and even multiple data collections - More structure than in document-based IR - Types (from some taxonomy) - Attributes (from some ontology) - Relationships to other entities (“typed links”)
  • 6. Semantics in our context - working definition:
 references to meaningful structures - How to capture, represent, and use structure? - It concerns all components of the retrieval process! info need matching Abc entity Abc Abc Text-only representation info need Abc matching Abc entity Abc Text+structure representation
  • 7. Overview of core tasks Queries Data set Results (adhoc) entity retrieval keyword unstructured/
 semistructured ranked list adhoc object retrieval keyword structured ranked list keyword+++
 (semi)structured ranked list unstructured 
 & structured ranked list list completion related entity finding (examples) keyword++
 (target type, relation)
  • 8. In this part - Input: keyword(++) query - Output: a ranked list of entities - Data collection: unstructured and (semi)structured data sources (and their combinations) - Main RQ: How to incorporate structure into text-based retrieval models?
  • 9. Outline 1.Ranking based on entity descriptions Attributes 
 (/Descriptions) 2.Incorporating entity types Type(s) 3.Entity relationships Relationships
  • 11. Task: ad-hoc entity retrieval - Input: unconstrained natural language query - “telegraphic” queries (neither well-formed nor grammatically correct sentences or questions) - Output: ranked list of entities - Collection: unstructured and/or semistructured documents
  • 12. Example information needs american embassy nairobi ben franklin Chernobyl meg ryan war Worst actor century Sweden Iceland currency
  • 13. Two settings 1.With ready-made entity descriptions
 
 
 e e e 
 xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx 2.Without explicit entity representations xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx
  • 15. This is not unrealistic...
  • 16. Document-based entity representations - Most entities have a “home page” - I.e., each entity is described by a document - In this scenario, ranking entities is much like ranking documents - unstructured - semi-structured
  • 17. Evaluation initiatives - INEX Entity Ranking track (2007-09) - Collection is the (English) Wikipedia - Entities are represented by Wikipedia articles - Semantic Search Challenge (2010-11) - Collection is a Semantic Web crawl (BTC2009) - ~1 billion RDF triples - Entities are represented by URIs - INEX Linked Data track (2012-13) - Wikipedia enriched with RDF properties from DBpedia and YAGO
  • 18. Standard Language Modeling approach - Rank documents d according to their likelihood of being relevant given a query q: P(d|q) P (q|d)P (d) P (d|q) = / P (q|d)P (d) P (q) Query likelihood
 Probability that query q 
 was “produced” by document d P (q|d) = Document prior
 Probability of the document 
 being relevant to any query Y t2q P (t|✓d ) n(t,q)
  • 19. Standard Language Modeling approach (2) Number of times t appears in q P (q|d) = Y t2q P (t|✓d ) n(t,q) Document language model
 Multinomial probability distribution over the vocabulary of terms P (t|✓d ) = (1 Smoothing parameter
 )P (t|d) + P (t|C) Empirical 
 document model
 n(t, d) |d| Maximum
 likelihood 
 estimates Collection model 
 P d n(t, d) P d |d|
  • 20. Here, documents==entities, so P (e|q) / P (e)P (q|✓e ) = P (e) Entity prior
 Probability of the entity 
 being relevant to any query Y P (t|✓e )n(t,q) t2q Entity language model
 Multinomial probability distribution over the vocabulary of terms
  • 21. Semi-structured entity representation - Entity description documents are rarely unstructured - Representing entities as - Fielded documents – the IR approach - Graphs – the DB/SW approach
  • 22. dbpedia:Audi_A4 foaf:name rdfs:label rdfs:comment dbpprop:production rdf:type dbpedia-owl:manufacturer dbpedia-owl:class owl:sameAs is dbpedia-owl:predecessor of is dbpprop:similar of Audi A4 Audi A4 The Audi A4 is a compact executive car produced since late 1994 by the German car manufacturer Audi, a subsidiary of the Volkswagen Group. The A4 has been built [...] 1994 2001 2005 2008 dbpedia-owl:MeanOfTransportation dbpedia-owl:Automobile dbpedia:Audi dbpedia:Compact_executive_car freebase:Audi A4 dbpedia:Audi_A5 dbpedia:Cadillac_BLS
  • 23. Mixture of Language Models [Ogilvie & Callan 2003] - Build a separate language model for each field - Take a linear combination of them m X P (t|✓d ) = µj P (t|✓dj ) j=1 Field language model
 Field weights m X j=1 µj = 1 Smoothed with a collection model built
 from all document representations of the
 same type in the collection
  • 24. Comparison of models df t d ... t Unstructured
 document model d ... ... df t F t Fielded
 document model d df t ... ... ... F df t Hierarchical
 document model
  • 25. Setting field weights - Heuristically - Proportional to the length of text content in that field, to the field’s individual performance, etc. - Empirically (using training queries) - Problems - Number of possible fields is huge - It is not possible to optimise their weights directly - Entities are sparse w.r.t. different fields - Most entities have only a handful of predicates
  • 26. Predicate folding - Idea: reduce the number of fields by grouping them together - Grouping based on (BM25F and) - type [Pérez-Agüera et al. 2010] - manually determined importance [Blanco et al. 2011]
  • 27. Hierarchical Entity Model [Neumayer et al. 2012] - Organize fields into a 2-level hierarchy - Field types (4) on the top level - Individual fields of that type on the bottom level - Estimate field weights - Using training data for field types - Using heuristics for bottom-level types
  • 28. Two-level hierarchy [Neumayer et al. 2012] Name
 Attributes
 foaf:name rdfs:label rdfs:comment dbpprop:production rdf:type Out-relations
 In-relations
 dbpedia-owl:manufacturer dbpedia-owl:class owl:sameAs is dbpedia-owl:predecessor of is dbpprop:similar of ! Audi A4 Audi A4 The Audi A4 is a compact executive car produced since late 1994 by the German car manufacturer Audi, a subsidiary of the Volkswagen Group. The A4 has been built [...] 1994 2001 2005 2008 dbpedia-owl:MeanOfTransportation dbpedia-owl:Automobile dbpedia:Audi dbpedia:Compact_executive_car freebase:Audi A4 dbpedia:Audi_A5 dbpedia:Cadillac_BLS
  • 29. Comparison of models df t d ... t Unstructured
 document model d ... ... df t F t Fielded
 document model d df t ... ... ... F df t Hierarchical
 document model
  • 30. Probabilistic Retrieval Model for Semistructured data [Kim et al. 2009] - Extension to the Mixture of Language Models - Find which document field each query term may be associated with m X P (t|✓d ) = µj P (t|✓dj ) j=1 Mapping probability
 Estimated for each query term P (t|✓d ) = m X j=1 P (dj |t)P (t|✓dj )
  • 31. Estimating the mapping probability P d n(t, dj ) P (t|Cj ) = P d |dj | Term likelihood Probability of a query term occurring in a given field type
 P (t|dj )P (dj ) P (dj |t) = P (t) X dk P (t|dk )P (dk ) Prior field probability
 Probability of mapping the query term 
 to this field before observing collection statistics
  • 32. Example meg ryan war dj cast team title P (t|dj ) 0,407 0,382 0,187 dj cast team title P (t|dj ) 0,601 0,381 0,017 P (t|dj ) 0,927 title 0,07 location 0,002 dj genre
  • 34. Scenario - Entity descriptions are not readily available - Entity occurrences are annotated - manually - automatically (~entity linking)
  • 35. TREC Enterprise track - Expert finding task (2005-08) - Enterprise setting (intranet of a large organization) - Given a query, return people who are experts on the query topic - List of potential experts is provided - We assume that the collection has been annotated with <person>...</person> tokens
  • 36. The basic idea Use documents to go from queries to entities xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x q Query-document association the document’s relevance xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx e Document-entity association how well the document characterises the entity
  • 37. Two principal approaches - Profile-based methods - Create a textual profile for entities, then rank them (by adapting document retrieval techniques) - Document-based methods - Indirect representation based on mentions identified in documents - First ranking documents (or snippets) and then aggregating evidence for associated entities
  • 38. Profile-based methods xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx e e e xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx q
  • 39. Document-based methods xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x q e xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx xx x xxxx x xxx xx e xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx e X X X
  • 40. Many possibilities in terms of modeling - Generative (probabilistic) models - Discriminative (probabilistic) models - Voting models - Graph-based models
  • 41. Generative probabilistic models - Candidate generation models (P(e|q)) - Two-stage language model - Topic generation models (P(q|e)) - Candidate model, a.k.a. Model 1 - Document model, a.k.a. Model 2 - Proximity-based variations - Both families of models can be derived from the Probability Ranking Principle [Fang & Zhai 2007]
  • 42. Candidate models (“Model 1”) [Balog et al. 2006] Y n(t,q) P (q|✓e ) = P (t|✓e ) t2q Smoothing
 With collection-wide background model (1 )P (t|e) + P (t) X P (t|d, e)P (d|e) d Term-candidate 
 co-occurrence In a particular document. In the simplest case:P (t|d) Document-entity association
  • 43. Document models (“Model 2”) [Balog et al. 2006] X P (q|e) = P (q|d, e)P (d|e) d Document relevance How well document d supports the claim that e is relevant to q Y t2q Document-entity association P (t|d, e)n(t,q) Simplifying assumption 
 (t and e are conditionally independent given d) P (t|✓d )
  • 44. Document-entity associations - Boolean (or set-based) approach - Weighted by the confidence in entity linking - Consider other entities mentioned in the document
  • 45. Proximity-based variations - So far, conditional independence assumption between candidates and terms when computing the probability P(t|d,e) - Relationship between terms and entities that in the same document is ignored - Entity is equally strongly associated with everything discussed in that document - Let’s capture the dependence between entities and terms - Use their distance in the document
  • 46. Using proximity kernels [Petkova & Croft 2007] N X 1 P (t|d, e) = Z i=1 Normalizing contant d (i, t)k(t, e) Indicator function 1 if the term at position i is t, 0 otherwise Proximity-based kernel - constant function - triangle kernel - Gaussian kernel - step function
  • 47. Figure taken from D. Petkova and W.B. Croft. Proximity-based document representation for named entity retrieval. CIKM'07.
  • 48. Many possibilities in terms of modeling - Generative probabilistic models - Discriminative probabilistic models - Voting models - Graph-based models
  • 49. Discriminative models - Vs. generative models: - Fewer assumptions (e.g., term independence) - “Let the data speak” - Sufficient amounts of training data required - Incorporating more document features, multiple signals for document-entity associations - Estimating P(r=1|e,q) directly (instead of P(e,q|r=1)) - Optimization can get trapped in a local maximum/ minimum
  • 50. Arithmetic Mean Discriminative (AMD) model [Yang et al. 2010] P✓ (r = 1|e, q) = X P (r1 = 1|q, d)P (r2 = 1|e, d)P (d) d Query-document Document-entity relevance relevance logistic function 
 over a linear combination of features standard logistic function Nf ⇣X i=1 ⌘ ↵i fi (q, dt ) weight 
 features parameters
 (learned) Ng ⇣X j=1 ⌘ j gj (e, dt ) Document prior
  • 51. Learning to rank && entity retrieval - Pointwise - AMD, GMD [Yang et al. 2010] - Multilayer perceptrons, logistic regression [Sorg & Cimiano 2011] - Additive Groves [Moreira et al. 2011] - Pairwise - Ranking SVM [Yang et al. 2009] - RankBoost, RankNet [Moreira et al. 2011] - Listwise - AdaRank, Coordinate Ascent [Moreira et al. 2011]
  • 52. Voting models [Macdonald & Ounis 2006] - Inspired by techniques from data fusion - Combining evidence from different sources - Documents ranked w.r.t. the query are seen as “votes” for the entity
  • 53. Voting models Many different variants, including... - Votes - Number of documents mentioning the entity !Score(e, q) = |M (e) R(q)| ! - Reciprocal Rank - Sum of inverse ranks of documents X 1 ! Score(e, q) = rank(d, q) ! - CombSUM {M (e)R(q)} - Sum of scores of documents Score(e, q) = |{M (e) R(q)}| X {M (e)R(q)} s(d, q)
  • 54. Graph-based models [Serdyukov et al. 2008] - One particular way of constructing graphs - Vertices are documents and entities - Only document-entity edges - Search can be approached as a random walk on this graph - Pick a random document or entity - Follow links to entities or other documents - Repeat it a number of times
  • 55. Infinite random walk [Serdyukov et al. 2008] e Pi (e) = PJ (d) + (1 X e e d Pi (d) = e d d d ) X e!d P (e|d)Pi d!e PJ (d) = P (d|q), 1 (d), P (d|e)Pi 1 (e),
  • 57. For a handful of types grouping results by entity type is a viable solution
  • 58. For a handful of types grouping results by entity type is a viable solution
  • 59. But what about very many types?
 which are typically hierarchically organized
  • 60. Challenges - Users are not familiar with the type system - (Often) user input is to be treated as a hint, not as a strict filter - Type system is imperfect - Inconsistencies - Missing assignments - Granularity issues - Entities labeled with too general or too specific types - In general, categorizing things can be hard - E.g. is King Arthur “British royalty”, “fictional character”, or “military person”?
  • 61. Two settings - Target type(s) are provided by the user - keyword++ query - Target types need to be automatically identified - keyword query
  • 62. Target type(s) are provided
 faceted search, form fill-in, etc.
  • 63. INEX Entity Ranking track - Entities are represented by Wikipedia articles - Topic definition includes target categories Movies with eight or more Academy Awards
 best picture oscar british films american films
  • 65. Using target type information - Constraining results - Soft/hard filtering - Different ways to measure type similarity - Set-based Content-based Lexical similarity of type labels Distance based on the hierarchy - Query expansion - Adding terms from type names to the query - Entity expansion - Types added as a separate metadata field
  • 66. Modeling terms and categories [Balog et al. 2011] P (e|q) / P (q|e)P (e) T T C C P (q|e) = (1 )P (✓q |✓e ) + P (✓q |✓e ) Term-based representation Query model T p(t|✓q ) Entity model T T KL(✓q ||✓e ) T p(t|✓e ) Category-based representation Query model C p(c|✓q ) Entity model C C KL(✓q ||✓e ) C p(c|✓e )
  • 67. Advantages - Transparent combination of term-based and category-based information - Sound modeling of uncertainty associated with category information - Category-based feedback is possible (analogously to the term-based case)
  • 68. Expanding target types - Pseudo relevance feedback - Based on hierarchical structure - Using lexical similarity of type labels
  • 69. Two settings - Target type(s) are provided by the user - keyword++ query - Target types need to be automatically identified - keyword query
  • 70. Identifying target types for queries - Types of top ranked entities [Vallet & Zaragoza 2008] - Types can be ranked much like entities [Balog & Neumayer 2012] - Direct term-based vs. indirect entity-based representations (“Model 1 vs. Model 2”) - Hierarchical case is difficult
  • 71. Joint type detection and entity ranking [Sawant & Chakrabarti 2013] - Assumes “telegraphic” queries with target type - woodrow wilson president university - dolly clone institute - lead singer led zeppelin band - Type detection is integrated into the ranking - Multiple query interpretations are considered - Both generative and discriminative formulations
  • 72. Approach q z - Each query term is either a “type hint” ( h(~, ~)) q z or a “word matcher” (s(~, ~)) - Number of possible partitions is manageable ( 2|q|) losing baseball team world series 1998 Type Major league baseball teams instanceOf Entity San Diego Padres mentionOf Evidence 
 snippets By comparison, the Padres have been to two World Series, losing in 1984 and 1998.
  • 73. Generative approach Generate query from entity E San Diego Padres! context type Major league ! baseball team! T model! Padres have been to two World Series, losing in 1984 and 1998! ϕ Type hint : baseball , team θ switch! model! Context matchers : ! lost , 1998, world series Z q losing team baseball world series 1998 Figure taken from Sawant & Chakrabarti (2013). Learning Joint Query Interpretation and Response Ranking. In WWW ’13. (see presentation)
  • 74. Generative formulation P (e|q) / P (e) X P (t|e)P (~)P (h(~, ~)|t)P (s(~, ~)|e) z q z q z t,~ z Type prior
 Query switch
 Estimated from 
 Probability of the answer types 
 interpretation in the past Entity prior Type model
 Probability of observing t in the type model Entity model
 Probability of observing t in the entity model
  • 75. Discriminative approach Separate correct and incorrect entities q!: losing team baseball world series 1998! San_Diego_Padres! losing team baseball losing team baseball world series 1998 losing team baseball world seriesteam)! 1998 (baseball 1998 world seriesteam)! (baseball (t = baseball team)! 1998_World_Series! losing team baseball losing team baseball world series 1998 losing team baseball world series 1998 (series)! world series 1998 (series)! (t = series)! Figure taken from Sawant & Chakrabarti (2013). Learning Joint Query Interpretation and Response Ranking. In WWW ’13. (see presentation)
  • 76. Discriminative formulation (q, e, t, ~) = h z 1 (q, e), Models the type prior P(t|e) 2 (t, e), z 3 (q, ~ , t), z 4 (q, ~ , e)i Models the entity prior P(e) Comparability between hint words and type Comparability between matchers and snippets that mention e
  • 81. TREC Entity track - Related Entity Finding task - Given - Input entity (defined by name and homepage) - Type of the target entity (PER/ORG/LOC) - Narrative (describing the nature of the relation in free text) - Return (homepages of) related entities
  • 82. Example information needs airlines that currently use Boeing 747 planes
 ORG Boeing 747 Members of The Beaux Arts Trio
 PER The Beaux Arts Trio What countries does Eurail operate in?
 LOC Eurail
  • 83. A typical pipeline Input (entity, target type, relation)
 Candidate 
 entities Retrieving docs/snippets Query expansion ... Ranked list 
 of entities Type filtering Deduplication Exploiting lists ... Entity homepages 
 Heuristic rules Learning ...
  • 84. Modeling related entity finding [Bron et al. 2010] - Three-component model p(e|E, T, R) / p(e|E) · p(T |e) · p(R|E, e) Co-occurrence model Type filtering Context model xxxx x xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx xxxx xxx xx xxxxxx xx x xxx xx x xxxx xx xxx x xxxxx xx x xxx xx xxxx xx xx x xx xxxxxx x xxxxxx xxxxxx xx x xxx xxx xxxx xxx x xxx x xxxxx xxx xxxxxx x xxx xxxx x xxx xx x xx xxxx xxx xxxxx xx xxxxxx xxxx xx xxx xxxx xxx x x x xxxxx xxx xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx x xxxxx xxx
  • 85. Wrapping up - Increasingly more discriminative approaches over generative ones - Increasing amount of components (and parameters) - Easier to incrementally add informative but correlated features - But, (massive amounts of ) training data is required!
  • 86. Future challenges - It’s “easy” when the “query intent” is known - Desired results: single entity, ranked list, set, … - Query type: ad-hoc, list search, related entity finding, … - Methods specifically tailored to specific types 
 of requests - Understanding query intent still has a long 
 way to go