Entity Retrieval (WSDM 2014 tutorial)

Part II

Entity Retrieval
Krisztian Balog

University of Stavanger

Half-day tutorial at the WSDM’14 conference | New York City, USA, 2014

Entity retrieval

Addressing information needs that are better
answered by returning speciﬁc objects
(entities) instead of just any type of documents.

Distribution of web search
queries [Pound et al. 2010]
6%

41%
36%

1%
5%

12%

Entity (“1978 cj5 jeep”)
Type (“doctors in barcelona”)
Attribute (“zip code waterville Maine”)
Relation (“tom cruise katie holmes”)
Other (“nightlife in Barcelona”)
Uninterpretable

Distribution of web search
queries [Lin et al. 2011]
28%

29%

15%

14%
10% 4%

Entity
Entity+reﬁner
Category
Category+reﬁner
Other
Website

What’s so special here?
- Entities are not always directly represented

- Recognize and disambiguate entities in text  
(that is, entity linking)
- Collect and aggregate information about a given
entity from multiple documents and even multiple
data collections

- More structure than in document-based IR

- Types (from some taxonomy)
- Attributes (from some ontology)
- Relationships to other entities (“typed links”)

Semantics in our context
- working deﬁnition: 
references to meaningful structures
- How to capture, represent, and use structure?

- It concerns all components of the retrieval process!

info
need

matching
Abc

entity

Abc

Abc

Text-only representation

info
need
Abc

matching
Abc

entity
Abc

Text+structure representation

Overview of core tasks
Queries

Data set

Results

(adhoc) entity
retrieval

keyword

unstructured/ 
semistructured

ranked list

adhoc object
retrieval

keyword

structured

ranked list

keyword+++ 

(semi)structured

ranked list

unstructured  
& structured

ranked list

list completion
related entity
ﬁnding

(examples)

keyword++ 
(target type, relation)

In this part
- Input: keyword(++) query
- Output: a ranked list of entities

- Data collection: unstructured and
(semi)structured data sources (and their
combinations)

- Main RQ: How to incorporate structure into
text-based retrieval models?

Outline

1.Ranking based on
entity descriptions

Attributes  
(/Descriptions)

2.Incorporating entity
types

Type(s)

3.Entity relationships

Relationships

Ranking entity
descriptions
Attributes  
(/Descriptions)
Type(s)
Relationships

Task: ad-hoc entity retrieval
- Input: unconstrained natural language query

- “telegraphic” queries (neither well-formed nor
grammatically correct sentences or questions)

- Output: ranked list of entities

- Collection: unstructured and/or semistructured documents

Example information needs
american embassy nairobi
ben franklin
Chernobyl
meg ryan war
Worst actor century
Sweden Iceland currency

Two settings
1.With ready-made entity descriptions 
 
 
e
e
e
 
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx

x xxxx x xxx xx

x xxxx x xxx xx

2.Without explicit entity representations
x xxxx x xxx xx xxxxxx
xx x xxx xx x xxxx xx
xxx x xxxxxx xx x xxx xx
xxxx xx xxx xx x xxxxx
xxx xx x

x xxxx x xxx xx

xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx

Ranking with ready-made
entity descriptions

Document-based entity
representations
- Most entities have a “home page”

- I.e., each entity is described by a document

- In this scenario, ranking entities is much like
ranking documents

- unstructured
- semi-structured

Evaluation initiatives
- INEX Entity Ranking track (2007-09)

- Collection is the (English) Wikipedia
- Entities are represented by Wikipedia articles

- Semantic Search Challenge (2010-11)

- Collection is a Semantic Web crawl (BTC2009)
- ~1 billion RDF triples

- Entities are represented by URIs

- INEX Linked Data track (2012-13)

- Wikipedia enriched with RDF properties from
DBpedia and YAGO

Standard Language Modeling
approach
- Rank documents d according to their likelihood
of being relevant given a query q: P(d|q)
P (q|d)P (d)
P (d|q) =
/ P (q|d)P (d)
P (q)
Query likelihood 
Probability that query q  
was “produced” by document d

P (q|d) =

Document prior 
Probability of the document  
being relevant to any query

Y
t2q

P (t|✓d )

n(t,q)

Here, documents==entities, so
P (e|q) / P (e)P (q|✓e ) = P (e)
Entity prior 
Probability of the entity  
being relevant to any query

Y

P (t|✓e )n(t,q)

t2q

Entity language model 
Multinomial probability distribution
over the vocabulary of terms

Semi-structured entity
representation
- Entity description documents are rarely
unstructured

- Representing entities as

- Fielded documents – the IR approach
- Graphs – the DB/SW approach

dbpedia:Audi_A4
foaf:name
rdfs:label
rdfs:comment

dbpprop:production

rdf:type
dbpedia-owl:manufacturer
dbpedia-owl:class
owl:sameAs
is dbpedia-owl:predecessor of
is dbpprop:similar of

Audi A4
Audi A4
The Audi A4 is a compact executive car
produced since late 1994 by the German car
manufacturer Audi, a subsidiary of the
Volkswagen Group. The A4 has been built [...]
1994
2001
2005
2008
dbpedia-owl:MeanOfTransportation
dbpedia-owl:Automobile
dbpedia:Audi
dbpedia:Compact_executive_car
freebase:Audi A4
dbpedia:Audi_A5
dbpedia:Cadillac_BLS

Mixture of Language Models
[Ogilvie & Callan 2003]

- Build a separate language model for each ﬁeld

- Take a linear combination of them
m
X
P (t|✓d ) =
µj P (t|✓dj )
j=1

Field language model 

Field weights
m
X
j=1

µj = 1

Smoothed with a collection model built 
from all document representations of the 
same type in the collection

Comparison of models
df

t
d

...
t

Unstructured 
document model

d

...

...

df

t

F

t

Fielded 
document model

d

df

t

...

...

...

F

df

t

Hierarchical 
document model

Setting field weights
- Heuristically

- Proportional to the length of text content in that field,
to the field’s individual performance, etc.

- Empirically (using training queries)

- Problems

- Number of possible fields is huge
- It is not possible to optimise their weights directly

- Entities are sparse w.r.t. different fields

- Most entities have only a handful of predicates

Predicate folding

- Idea: reduce the number of ﬁelds by grouping
them together

- Grouping based on (BM25F and)

- type [Pérez-Agüera et al. 2010]
- manually determined importance [Blanco et al. 2011]

Hierarchical Entity Model
[Neumayer et al. 2012]

- Organize fields into a 2-level hierarchy

- Field types (4) on the top level
- Individual fields of that type on the bottom level

- Estimate field weights

- Using training data for field types
- Using heuristics for bottom-level types

Two-level hierarchy
[Neumayer et al. 2012]
Name 

Attributes 

foaf:name
rdfs:label
rdfs:comment

dbpprop:production

rdf:type

Out-relations 
In-relations 

dbpedia-owl:manufacturer
dbpedia-owl:class
owl:sameAs
is dbpedia-owl:predecessor of
is dbpprop:similar of

!

Audi A4
Audi A4
The Audi A4 is a compact executive car
produced since late 1994 by the German car
manufacturer Audi, a subsidiary of the
Volkswagen Group. The A4 has been built [...]
1994
2001
2005
2008
dbpedia-owl:MeanOfTransportation
dbpedia-owl:Automobile
dbpedia:Audi
dbpedia:Compact_executive_car
freebase:Audi A4
dbpedia:Audi_A5
dbpedia:Cadillac_BLS

Probabilistic Retrieval Model
for Semistructured data

[Kim et al. 2009]
- Extension to the Mixture of Language Models

- Find which document ﬁeld each query term
may be associated with
m
X
P (t|✓d ) =
µj P (t|✓dj )
j=1

Mapping probability 

Estimated for each query term

P (t|✓d ) =

m
X
j=1

P (dj |t)P (t|✓dj )

Estimating the mapping
probability
P
d n(t, dj )
P (t|Cj ) = P
d |dj |
Term likelihood

Probability of a query term
occurring in a given field type 

P (t|dj )P (dj )
P (dj |t) =
P (t)
X
dk

P (t|dk )P (dk )

Prior field probability 
Probability of mapping the query term  
to this field before observing collection
statistics

Example
meg ryan war

dj

cast
team
title

P (t|dj )
0,407
0,382
0,187

dj
cast

team
title

P (t|dj )
0,601
0,381
0,017

P (t|dj )
0,927
title
0,07
location 0,002
dj
genre

Ranking without explicit
entity representations

Scenario

- Entity descriptions are not readily available

- Entity occurrences are annotated

- manually
- automatically (~entity linking)

TREC Enterprise track
- Expert ﬁnding task (2005-08)

- Enterprise setting (intranet of a large organization)
- Given a query, return people who are experts on the
query topic
- List of potential experts is provided

- We assume that the collection has been
annotated with <person>...</person> tokens

The basic idea

Use documents to go from queries to entities

xxx xx x

q

Query-document
association
the document’s relevance

x xxxx x xxx xx

xx x xxxxx xxx

e

Document-entity
association
how well the document
characterises the entity

Two principal approaches
- Profile-based methods

- Create a textual profile for entities, then rank them
(by adapting document retrieval techniques)

- Document-based methods

- Indirect representation based on mentions identified
in documents
- First ranking documents (or snippets) and then
aggregating evidence for associated entities

Proﬁle-based methods
xxx xx x

x xxxx x xxx xx

xx x xxxxx xxx

e

e

e

x xxxx x xxx xx

x xxxx x xxx xx

x xxxx x xxx xx

q

Document-based methods
xxx xx x

q

e

x xxxx x xxx xx

e

xx x xxxxx xxx

e

X
X
X

Many possibilities in terms of
modeling
- Generative (probabilistic) models

- Discriminative (probabilistic) models

- Voting models

- Graph-based models

Generative probabilistic
models
- Candidate generation models (P(e|q))

- Two-stage language model

- Topic generation models (P(q|e))

- Candidate model, a.k.a. Model 1
- Document model, a.k.a. Model 2
- Proximity-based variations

- Both families of models can be derived from the
Probability Ranking Principle [Fang & Zhai 2007]

Candidate models (“Model 1”)
[Balog et al. 2006]
Y
n(t,q)
P (q|✓e ) =
P (t|✓e )
t2q

Smoothing 
With collection-wide background model

(1

)P (t|e) + P (t)
X
P (t|d, e)P (d|e)
d

Term-candidate  
co-occurrence
In a particular document.

In the simplest case:P (t|d)

Document-entity
association

Document models (“Model 2”)
[Balog et al. 2006]
X
P (q|e) =
P (q|d, e)P (d|e)
d

Document relevance
How well document d
supports the claim that e
is relevant to q

Y
t2q

Document-entity
association

P (t|d, e)n(t,q)
Simplifying assumption  
(t and e are conditionally
independent given d)

P (t|✓d )

Document-entity associations

- Boolean (or set-based) approach

- Weighted by the conﬁdence in entity linking

- Consider other entities mentioned in the
document

Proximity-based variations
- So far, conditional independence assumption
between candidates and terms when
computing the probability P(t|d,e)

- Relationship between terms and entities that in
the same document is ignored

- Entity is equally strongly associated with everything
discussed in that document

- Let’s capture the dependence between entities
and terms

- Use their distance in the document

Using proximity kernels
[Petkova & Croft 2007]

N
X
1
P (t|d, e) =
Z i=1
Normalizing
contant

d (i, t)k(t, e)

Indicator function
1 if the term at position i is t,
0 otherwise

Proximity-based kernel
- constant function

- triangle kernel

- Gaussian kernel

- step function

Figure taken from D. Petkova and W.B. Croft. Proximity-based document representation for named entity
retrieval. CIKM'07.

Many possibilities in terms of
modeling
- Generative probabilistic models

- Discriminative probabilistic models

- Voting models

- Graph-based models

Discriminative models
- Vs. generative models:

- Fewer assumptions (e.g., term independence)
- “Let the data speak”
- Sufﬁcient amounts of training data required

- Incorporating more document features, multiple
signals for document-entity associations
- Estimating P(r=1|e,q) directly (instead of P(e,q|r=1))
- Optimization can get trapped in a local maximum/
minimum

Arithmetic Mean
Discriminative (AMD) model
[Yang et al. 2010]

P✓ (r = 1|e, q) =

X

P (r1 = 1|q, d)P (r2 = 1|e, d)P (d)

d

Query-document Document-entity
relevance
relevance
logistic function  
over a linear
combination of features
standard logistic
function

Nf
⇣X
i=1

⌘
↵i fi (q, dt )

weight   features
parameters 
(learned)

Ng
⇣X
j=1

⌘
j gj (e, dt )

Document
prior

Learning to rank && entity
retrieval
- Pointwise

- AMD, GMD [Yang et al. 2010]
- Multilayer perceptrons, logistic regression [Sorg &
Cimiano 2011]
- Additive Groves [Moreira et al. 2011]

- Pairwise

- Ranking SVM [Yang et al. 2009]
- RankBoost, RankNet [Moreira et al. 2011]

- Listwise

- AdaRank, Coordinate Ascent [Moreira et al. 2011]

Voting models

[Macdonald & Ounis 2006]

- Inspired by techniques from data fusion

- Combining evidence from different sources

- Documents ranked w.r.t. the query are seen as
“votes” for the entity

Voting models

Many different variants, including...
- Votes

- Number of documents mentioning the entity
!Score(e, q) = |M (e) R(q)|
!

- Reciprocal Rank

- Sum of inverse ranks of documents
X
1
!
Score(e, q) =
rank(d, q)
!

- CombSUM

{M (e)R(q)}

- Sum of scores of documents
Score(e, q) = |{M (e) R(q)}|

X

{M (e)R(q)}

s(d, q)

Graph-based models
[Serdyukov et al. 2008]

- One particular way of constructing graphs

- Vertices are documents and entities
- Only document-entity edges

- Search can be approached as a random walk
on this graph

- Pick a random document or entity
- Follow links to entities or other documents
- Repeat it a number of times

Inﬁnite random walk
[Serdyukov et al. 2008]

e

Pi (e) =

PJ (d) + (1
X

e

e

d

Pi (d) =

e

d

d

d

)

X

e!d

P (e|d)Pi

d!e

PJ (d) = P (d|q),

1 (d),

P (d|e)Pi

1 (e),

Incorporating
entity types
Attributes  
(/Descriptions)
Type(s)
Relationships

For a handful of types

grouping results by entity type is a viable solution

But what about very many types? 
which are typically hierarchically organized

Challenges
- Users are not familiar with the type system

- (Often) user input is to be treated as a hint, not as a
strict filter

- Type system is imperfect

- Inconsistencies
- Missing assignments
- Granularity issues
- Entities labeled with too general or too specific types

- In general, categorizing things can be hard

- E.g. is King Arthur “British royalty”, “fictional character”,
or “military person”?

Two settings

- Target type(s) are provided by the user

- keyword++ query

- Target types need to be automatically identiﬁed

- keyword query

Target type(s) are provided 
faceted search, form ﬁll-in, etc.

INEX Entity Ranking track
- Entities are represented by Wikipedia articles

- Topic definition includes target categories
Movies with eight or more Academy Awards 
best picture oscar british films american films

Entity Retrieval (WSDM 2014 tutorial)

Using target type information
- Constraining results

- Soft/hard ﬁltering
- Different ways to measure type similarity
-

Set-based
Content-based
Lexical similarity of type labels
Distance based on the hierarchy

- Query expansion

- Adding terms from type names to the query

- Entity expansion

- Types added as a separate metadata ﬁeld

Modeling terms and categories
[Balog et al. 2011]

P (e|q) / P (q|e)P (e)
T T
C C
P (q|e) = (1
)P (✓q |✓e ) + P (✓q |✓e )
Term-based representation
Query model
T
p(t|✓q )

Entity model

T
T
KL(✓q ||✓e )

T
p(t|✓e )

Category-based representation
Query model
C
p(c|✓q )

Entity model

C
C
KL(✓q ||✓e )

C
p(c|✓e )

Advantages
- Transparent combination of term-based and
category-based information

- Sound modeling of uncertainty associated with
category information

- Category-based feedback is possible
(analogously to the term-based case)

Expanding target types

- Pseudo relevance feedback

- Based on hierarchical structure

- Using lexical similarity of type labels

Identifying target types for
queries
- Types of top ranked entities [Vallet & Zaragoza
2008]

- Types can be ranked much like entities [Balog &
Neumayer 2012]

- Direct term-based vs. indirect entity-based
representations (“Model 1 vs. Model 2”)
- Hierarchical case is difﬁcult

Joint type detection and entity
ranking [Sawant & Chakrabarti 2013]
- Assumes “telegraphic” queries with target type

- woodrow wilson president university
- dolly clone institute
- lead singer led zeppelin band

- Type detection is integrated into the ranking

- Multiple query interpretations are considered

- Both generative and discriminative
formulations

Approach
q z
- Each query term is either a “type hint” ( h(~, ~))
q z
or a “word matcher” (s(~, ~))

- Number of possible partitions is manageable ( 2|q|)
losing baseball team world series 1998
Type Major league baseball teams
instanceOf
Entity

San Diego Padres
mentionOf

Evidence  
snippets

By comparison, the Padres have been to two
World Series, losing in 1984 and 1998.

Generative approach
Generate query from entity
E

San Diego Padres!
context

type

Major league !
baseball team!

T

model!

Padres have been to two World
Series, losing in 1984 and 1998!

ϕ

Type hint :
baseball , team

θ
switch!

model!

Context matchers : !
lost , 1998, world series

Z
q

losing team baseball world series 1998

Figure taken from Sawant & Chakrabarti (2013). Learning Joint Query Interpretation and Response
Ranking. In WWW ’13. (see presentation)

Generative formulation
P (e|q) / P (e)

X

P (t|e)P (~)P (h(~, ~)|t)P (s(~, ~)|e)
z
q z
q z

t,~
z

Type prior  Query switch 

Estimated from   Probability of the
answer types   interpretation
in the past

Entity prior

Type model 
Probability of observing
t in the type model

Entity model 
Probability of observing
t in the entity model

Discriminative approach
Separate correct and incorrect entities

q!: losing team baseball world series 1998!

San_Diego_Padres!

losing team baseball
world series 1998
world seriesteam)!
1998
(baseball 1998
world seriesteam)!
(baseball
(t = baseball team)!

1998_World_Series!

world series 1998
world series 1998
(series)!
world series 1998
(series)!
(t = series)!

Figure taken from Sawant & Chakrabarti (2013). Learning Joint Query Interpretation and Response
Ranking. In WWW ’13. (see presentation)

Discriminative formulation
(q, e, t, ~) = h
z

1 (q, e),

Models the type
prior P(t|e)

2 (t, e),

z
3 (q, ~ , t),

z
4 (q, ~ , e)i

Models the entity
prior P(e)
Comparability
between hint words
and type

Comparability
between matchers
and snippets that
mention e

Entity relationships
Attributes  
(/Descriptions)
Type(s)
Relationships

TREC Entity track
- Related Entity Finding task

- Given

- Input entity (deﬁned by name and homepage)
- Type of the target entity (PER/ORG/LOC)
- Narrative (describing the nature of the relation in free
text)

- Return (homepages of) related entities

Example information needs
airlines that currently use Boeing 747 planes 
ORG Boeing 747
Members of The Beaux Arts Trio 
PER The Beaux Arts Trio
What countries does Eurail operate in? 
LOC Eurail

A typical pipeline

Input
(entity, target type, relation) 

Candidate  
entities

Retrieving docs/snippets
Query expansion
...

Ranked list  
of entities

Type ﬁltering
Deduplication
Exploiting lists
...

Entity
homepages  

Heuristic rules
Learning
...

Modeling related entity ﬁnding
[Bron et al. 2010]

- Three-component model
p(e|E, T, R) / p(e|E) · p(T |e) · p(R|E, e)
Co-occurrence
model
Type ﬁltering

Context model
xxxx x xxx xx xxxxxx xx x xxx xx x xxxx
xx xxx x xxxxxx xxxxxx xx x xxx xx x xxxx
xxxx xxx xx xxxxxx xx x xxx xx x xxxx
xx xxx x xxxxx xx x xxx xx xxxx xx xx x xx
xxxxxx x xxxxxx xxxxxx xx x xxx xxx xxxx
xxx x xxx
x xxxxx xxx xxxxxx x xxx xxxx x xxx xx x xx
xxxx xxx xxxxx xx xxxxxx xxxx xx xxx xxxx
xxx x x
x xxxxx xxx
xx xxx x xxxxx xx x xxx xx xxxx xx xxx xx
x xxxxx xxx

Wrapping up

- Increasingly more discriminative approaches
over generative ones

- Increasing amount of components (and parameters)
- Easier to incrementally add informative but
correlated features
- But, (massive amounts of ) training data is required!

Future challenges
- It’s “easy” when the “query intent” is known

- Desired results: single entity, ranked list, set, …
- Query type: ad-hoc, list search, related entity finding, …

- Methods specifically tailored to specific types  
of requests

- Understanding query intent still has a long  
way to go

Entity Retrieval (WSDM 2014 tutorial)

More Related Content

What's hot (20)

Similar to Entity Retrieval (WSDM 2014 tutorial) (20)

More from krisztianbalog (12)

Recently uploaded (20)

Entity Retrieval (WSDM 2014 tutorial)