Lecture: Summarization

Seman&c
Analysis
in
Language
Technology

http://stp.lingﬁl.uu.se/~santinim/sais/2016/sais_2016.htm  
 
Summarization
Marina
San(ni

san$nim@stp.lingﬁl.uu.se

Department
of
Linguis(cs
and

Philology

Uppsala
University,
Uppsala,

Sweden

Spring
2016

Previous
Lecture:
Rela$on
Extrac$on

2

What’s
a
rela$on?

•  A
rela(on
can
be
formally
deﬁned
in
the
form
of
a
tuple

•  t
=
(e1;
e2
…;
en)

•  where
the
ei
are
en((es
in
a
predeﬁned
rela(on
r
within

document
D.

•  Most
rela(on
extrac(on
systems
focus
on
extrac(ng
binary

rela$ons.

•  Examples
of
binary
rela(ons
include

•  located-‐in(CMU,
PiHsburgh),

•  father-‐of(ManuelBlum,
Avrim
Blum).

•  It
is
also
possible
to
go
to
higher-‐order
rela(ons
as
well
and

extract
more
complex
rela(ons
(ex
biomedicine).

3

Why
Rela$on
Extrac$on?

•  There
exists
a
vast
amount
of
unstructured
electronic
text
on
the

Web,
including
newswire,
blogs
,emails,
governmental

documents,
chats,
and
so
on.

•  The
whole
idea
of
IE
is
turn
unstructured
text
into
structured
by

annota(ng
seman(c
informa(on.

•  RE
is
the
task

of
recognizing
rela(ons
between
en((es
in

unstructured
text.

!
If a query to a search engine is “When was Gandhi born ?”,
then the expected answer would be“Gandhi was born in 1869”.
The template of the answer is <PERSON> born-in <YEAR> which
is nothing but the relational triple: !
born in(PERSON, YEAR) !
where PERSON and YEAR are the entities. !
4

Watch
out!

•  RE
=
extract
facts
from
unstructured
texts,
ie
rela(ons
that
exist

betw
en((es,
such
as
dates,
proper
names,
companies.

•  Other
rela(ons
(related
to
Word
Senses):
seman(c
rela(ons

betw
concepts:
hyperonyms,
hyponyms,
etc.
like
in
Wordnet.

5

How
to
build
rela$on
extractors

1.  Hand-‐wriHen
paHerns

2.  Supervised
machine
learning

3.  Semi-‐supervised
and
unsupervised

•  Bootstrapping
(using
seeds)

•  Distant
supervision

•  Unsupervised
learning
from
the
web

6

Seed-‐based
or
bootstrapping
approaches

to
rela$on
extrac$on

•  No
training
set?
Maybe
you
have:

•  A
few
seed
tuples

or

•  A
few
high-‐precision
paHerns

•  Can
you
use
those
seeds
to
do
something
useful?

•  Bootstrapping:
use
the
seeds
to
directly
learn
to
populate
a

rela(on

7

Roughly
said:
Use
seeds
to
ini(alize
a

process
of
annota(on,
then
reﬁne

through
itera(ons

Dipre:
Extract
<author,book>
pairs

•  Start
with
5
seeds:

•  Find
Instances:

The
Comedy
of
Errors,
by

William
Shakespeare,
was

The
Comedy
of
Errors,
by

William
Shakespeare,
is

The
Comedy
of
Errors,
one
of
William
Shakespeare's
earliest
aHempts

The
Comedy
of
Errors,
one
of
William
Shakespeare's
most

•  Extract
paHerns
(group
by
middle,
take
longest
common
prefix/suffix)

?x , by ?y , ?x , one of ?y ‘s !
•  Now
iterate,
finding
new
seeds
that
match
the
paHern

!
Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web.
Author
Book

Isaac
Asimov
The
Robots
of
Dawn

David
Brin
Star(de
Rising

James
Gleick
Chaos:
Making
a
New

Science

Charles
Dickens
Great
Expecta(ons

William

Shakespeare

The
Comedy
of
Errors

8

Prac$cal
Ac$vity

Search
for
phrasal
paHerns
on
the
web

Our
seeds:

"*
is
a
novel
by
*"

"*
wrote
the
novel
*"

"the
novel
*
was
wriHen
by
*"

op#onally
add
more
phrases…

Further
reﬁnemets
that
we
felt
are
needed:

•  get
read
of
non-‐informa(ve
text
included
in
the
returned
strings

(maybe
via
adding
addi(onal
paHerns
in
the
regular
expressions)

•  Iden(fy
name
en((es

•  Maybe
via
Reg
Expressions
(eg.
iden(fy
words
star(ng
with
uppercase)

•  Maybe
combining
seeds
and
a
NER
system

•  ect.

9c

Google is fantastic, but
also unpredictable… à
different behaviours
depending on the
machines, domains, and
some “hidden” criteria…

End
of
previous
lecture

10

Acknowledgements
Most
slides
borrowed
or
adapted
from:

Dan
Jurafsky
and
Christopher
Manning,
Coursera

Some
inspira(on
from
Dragomir
Radev,
Coursera
….

J&M(2009)

Book
Summaries

15

Cliﬀ’s
Notes
are
a
series

of
student
study
guides

available
primarily
in
the

United
States.

Search
Engine
Snippets

17

Types
of
Summaries

19

Human
Summariza$on
and
Abstrac$ng

22

Extrac$ve
Summariza$on

23

Question
Answering
Summarization in
Question
Answering

Text
Summariza$on

•  Goal:
produce
an
abridged
version
of
a
text
that
contains

informa(on
that
is
important
or
relevant
to
a
user.

•  Summariza$on
Applica$ons

•  outlines
or
abstracts
of
any
document,
ar(cle,
etc

•  summaries
of
email
threads

•  ac$on
items
from
a
mee(ng

•  simplifying
text
by
compressing
sentences

25

What
to
summarize?

Single
vs.
mul$ple
documents

•  Single-‐document
summariza$on

•  Given
a
single
document,
produce

•  abstract

•  outline

•  headline

•  Mul$ple-‐document
summariza$on

•  Given
a
group
of
documents,
produce
a
gist
of
the
content:

•  a
series
of
news
stories
on
the
same
event

•  a
set
of
web
pages
about
some
topic
or
ques(on

26

Query-‐focused
Summariza$on

&

Generic
Summariza$on

•  Generic
summariza(on:

• 
Summarize
the
content
of
a
document

•  Query-‐focused
summariza(on:

• 
summarize
a
document
with
respect
to
an

informa(on
need
expressed
in
a
user
query.

•  a
kind
of
complex
ques(on
answering:

•  Answer
a
ques(on
by
summarizing
a
document

that
has
the
informa(on
to
construct
the
answer

27

Summariza$on
for
Ques$on
Answering:

Snippets

•  Create
snippets
summarizing
a
web
page
for
a
query

•  Google:
156
characters
(about
26
words)
plus
(tle
and
link

28

Summariza$on
for
Ques$on
Answering:

Mul$ple
documents

Create
answers
to
complex
ques(ons

summarizing
mul(ple
documents.

•  Instead
of
giving
a
snippet
for
each
document

•  Create
a
cohesive
answer
that
combines

informa(on
from
each
document

29

Extrac$ve
summariza$on
&

Abstrac$ve
summariza$on

•  Extrac(ve
summariza(on:

•  create
the
summary
from
phrases
or
sentences
in
the
source

document(s)

•  Abstrac(ve
summariza(on:

•  express
the
ideas
in
the
source
documents
using
(at
least
in

part)
diﬀerent
words

30

Simple
baseline:
take
the
ﬁrst
sentence

31

Question
Answering
Generating Snippets
and other Single-
Document Answers

Snippets:
query-‐focused
summaries

33

Summariza$on:
Three
Stages

1.  content
selec(on:
choose
sentences
to
extract

from
the
document

2.  informa(on
ordering:
choose
an
order
to
place

them
in
the
summary

3.  sentence
realiza(on:
clean
up
the
sentences

34

Document
Sentence
Segmentation
Sentence
Extraction
All sentences
from documents
Extracted
sentences
Information
Ordering
Sentence
Realization
Summary
Content Selection
Sentence
Simpliﬁcation

Basic
Summariza$on
Algorithm

1.  content
selec(on:
choose
sentences
to
extract

from
the
document

2.  informa(on
ordering:
just
use
document
order

3.  sentence
realiza(on:
keep
original
sentences

35

Document
Sentence
Segmentation
Sentence
Extraction
All sentences
from documents
Extracted
sentences
Information
Ordering
Sentence
Realization
Summary
Content Selection
Sentence
Simpliﬁcation

Unsupervised
content
selec$on

•  Intui(on
da(ng
back
to
Luhn
(1958):

•  Choose
sentences
that
have
salient
or
informa(ve
words

•  Two
approaches
to
deﬁning
salient
words

1.  o-‐idf:
weigh
each
word
wi
in
document
j
by
o-‐idf

2.  topic
signature:
choose
a
smaller
set
of
salient
words

•  mutual
informa(on

•  log-‐likelihood
ra(o
(LLR)

Dunning
(1993),
Lin
and
Hovy
(2000)

36

weight(wi ) = tfij ×idfi
weight(wi ) =
1 if -2logλ(wi ) >10
0 otherwise
!
"
#
$#
H.
P.
Luhn.
1958.
The
Automa(c
Crea(on
of
Literature
Abstracts.

IBM
Journal
of
Research
and
Development.
2:2,
159-‐165.

Topic
signature-‐based
content
selec$on

with
queries

•  choose
words
that
are
informa(ve
either

•  by
log-‐likelihood
ra(o
(LLR)

•  or
by
appearing
in
the
query

•  Weigh
a
sentence
(or
window)
by
weight
of
its
words:

37

Conroy,
Schlesinger,
and
O’Leary
2006

weight(wi ) =
1 if -2logλ(wi ) >10
1 if wi ∈ question
0 otherwise
"
#
$$
%
$
$
weight(s) =
1
S
weight(w)
w∈S
∑
(could
learn
more

complex
weights)

Supervised
content
selec$on

•  Given:

•  a
labeled
training
set
of
good

summaries
for
each
document

•  Align:

•  the
sentences
in
the
document

with
sentences
in
the
summary

•  Extract
features

•  posi(on
(first
sentence?)

•  length
of
sentence

•  word
informa(veness,
cue
phrases

•  cohesion

•  Train

•  Problems:

•  hard
to
get
labeled
training

data

•  alignment
difficult

•  performance
not
beHer
than

unsupervised
algorithms

•  So
in
prac(ce:

•  Unsupervised
content

selec$on
is
more
common

•  a
binary
classifier
(put
sentence
in
summary?
yes
or
no)

Question
Answering
Evalua(ng
Summaries:

ROUGE

ROUGE
(Recall
Oriented
Understudy
for

Gis$ng
Evalua$on)

•  Intrinsic
metric
for
automa(cally
evalua(ng
summaries

•  Based
on
BLEU
(a
metric
used
for
machine
transla(on)

•  Not
as
good
as
human
evalua(on
(“Did
this
answer
the
user’s
ques(on?”)

•  But
much
more
convenient

•  Given
a
document
D,
and
an
automa(c
summary
X:

1.  Have
N
humans
produce
a
set
of
reference
summaries

of
D

2.  Run
system,
giving
automa(c
summary
X

3.  What
percentage
of
the
bigrams
from
the
reference

summaries
appear
in
X?

40

Lin and Hovy 2003

ROUGE − 2 =
min(count(i, X),count(i,S))
bigrams i∈S
∑
s∈{RefSummaries}
∑
count(i,S)
bigrams i∈S
∑
s∈{RefSummaries}
∑

A
ROUGE
example:

Q:
“What
is
water
spinach?”

Human
1:
Water
spinach
is
a
green
leafy
vegetable
grown
in
the

tropics.

Human
2:

Water
spinach
is
a
semi-‐aqua(c
tropical
plant
grown
as
a

vegetable.

Human
3:
Water
spinach
is
a
commonly
eaten
leaf
vegetable
of
Asia.

•  System
answer:
Water
spinach
is
a
leaf
vegetable
commonly
eaten

in
tropical
areas
of
Asia.

•  ROUGE-‐2

=

41
10
+
9
+
9

3
+
3
+
6

=
12/28
=
.43

Question
Answering
Summarization for
Complex Questions

Deﬁni$on
ques$ons

Q:
What
is
water
spinach?

A:
Water
spinach
(ipomoea
aqua(ca)
is
a
semi-‐aqua(c
leafy

green
plant
with
long
hollow
stems
and
spear-‐
or
heart-‐
shaped
leaves,
widely
grown
throughout
Asia
as
a
leaf

vegetable.
The
leaves
and
stems
are
oten
eaten
s(r-‐fried

ﬂavored
with
salt
or
in
soups.
Other
common
names
include

morning
glory
vegetable,
kangkong
(Malay),
rau
muong

(Viet.),
ong
choi
(Cant.),
and
kong
xin
cai
(Mand.).
It
is
not

related
to
spinach,
but
is
closely
related
to
sweet
potato
and

convolvulus.

Medical
ques$ons

Q:
In
children
with
an
acute
febrile
illness,
what
is

the
eﬃcacy
of
single
medica(on
therapy
with

acetaminophen
or
ibuprofen
in
reducing
fever?

A:
Ibuprofen
provided
greater
temperature

decrement
and
longer
dura(on
of
an(pyresis
than

acetaminophen
when
the
two
drugs
were

administered
in
approximately
equal
doses.

(PubMedID:
1621668,
Evidence
Strength:
A)

Demner-‐Fushman
and
Lin
(2007)

Other
complex
ques$ons

1.  How
is
compost
made
and
used
for
gardening
(including

different
types
of
compost,
their
uses,
origins
and
benefits)?

2.  What
causes
train
wrecks
and
what
can
be
done
to
prevent

them?

3.  Where
have
poachers
endangered
wildlife,
what
wildlife
has

been
endangered
and
what
steps
have
been
taken
to
prevent

poaching?

4.  What
has
been
the
human
toll
in
death
or
injury
of
tropical

storms
in
recent
years?

45

Modified
from
the
DUC
2005
compe((on
(Hoa
Trang
Dang
2005)

Answering
harder
ques$ons:

Query-‐focused
mul$-‐document
summariza$on

•  The
(boHom-‐up)
snippet
method

•  Find
a
set
of
relevant
documents

•  Extract
informa(ve
sentences
from
the
documents

•  Order
and
modify
the
sentences
into
an
answer

•  The
(top-‐down)
informa(on
extrac(on
method

•  build
specific
answerers
for
different
ques(on
types:

•  defini(on
ques(ons

•  biography
ques(ons

•  certain
medical
ques(ons

Query-‐Focused
Mul$-‐Document

Summariza$on

47

•  a

Document
Document
Document
Document
Document
Input Docs
Sentence
Segmentation
All sentences
from documents
Sentence
Simplification
Content Selection
Sentence
Extraction:
LLR, MMR
Extracted
sentences
Information
Ordering
Sentence
Realization
Summary
All sentences
plus simplified versions
Query

Informa$on
Ordering

•  Chronological
ordering:

•  Order
sentences
by
the
date
of
the
document
(for
summarizing
news)..

(Barzilay,
Elhadad,
and
McKeown
2002)

•  Coherence:

•  Choose
orderings
that
make
neighboring
sentences
similar
(by
cosine).

•  Choose
orderings
in
which
neighboring
sentences
discuss
the
same
en(ty

(Barzilay
and
Lapata
2007)

•  Topical
ordering

•  Learn
the
ordering
of
topics
in
the
source
documents

48

Domain-‐speciﬁc
answering:

The
Informa$on
Extrac$on
method

•  a
good
biography
of
a
person
contains:

•  a
person’s
birth/death,
fame
factor,
educa$on,
na$onality
and
so
on

•  a
good
deﬁni$on
contains:

•  genus
or
hypernym

•  The
Hajj
is
a
type
of
ritual

•  a
medical
answer
about
a
drug’s
use
contains:

•  the
problem
(the
medical
condi(on),

•  the
interven$on
(the
drug
or
procedure),
and

•  the
outcome
(the
result
of
the
study).

Informa$on
that
should
be
in
the
answer

for
3
kinds
of
ques$ons

Lecture: Summarization

Recommended

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Lecture: Summarization (20)

More from Marina Santini (18)

Recently uploaded (20)

Lecture: Summarization