SlideShare a Scribd company logo
FEVER Workshop
9 July 2020
Towards Explainable
Fact Checking
Isabelle Augenstein*
augenstein@di.ku.dk
@IAugenstein
http://isabelleaugenstein.github.io/
*partial slide credit: Pepa Atanasova, Dustin Wright
Overview of Today’s Talk
• Introduction
• Fact checking – where are we at?
• Open research questions
• Part 1: Generating explanations
• A first case study
• Generating explanations inspired by how humans fact-check
• Part 2: Understanding claims
• What even is a check-worthy claim?
• Automatically generating confusing claims
Full Fact Checking Pipeline
Claim Check-
Worthiness Detection
Evidence Document
Retrieval
Stance Detection /
Textual Entailment
Veracity Prediction
“Immigrants	are	a	drain	
on	the	economy”
not check-worthy
check-worthy
“Immigrants	are	a	drain	
on	the	economy”
“Immigrants	are	a	drain	on	the	
economy”,	“EU	immigrants	have	a	
positive	impact	on	public	finances”
positive
negative
neutral
true
false
not enough info“Immigrants	are	a	drain	
on	the	economy”
How is fact checking framed in NLP?
4
Veracity Prediction
true
false
neutral“Immigrants	are	a	drain	
on	the	economy”
• Most early work on detecting false information follows this setup
• Deception detection (Mihalcea & Strapparava, 2009; Feng et al., 2012,
Morris et al., 2012)
• Clickbait detection (Chen et al., 2015; Potthast et al., 2016)
• Fake news detection (Rubin et al., 2016; Volkova et al., 2017; Wang, 2017)
• Option for early detection of fake news or rumours (”zero-shot setting”)
• Prediction not evidence-based, but strictly based on linguistic cues
👍
👎
How is fact checking framed in NLP?
5
• Relatively well established tasks in NLP
• Stance Detection (Anand et al., 2011; Mohammad et al., 2016)
• Recognising Textual Entailment (Dagan et al., 2004; Bowman et al., 2015)
• … more recently also studied in the context of fact checking
• Fake News Challenge (Pomerleau & Rao, 2017)
• Option for fact checking based on one evidence document
• Central component of fact checking can be studied in isolation
• Additional methods are needed for combining predictions of multiple
evidence documents
👍
👎
Stance Detection /
Textual Entailment
“Immigrants	are	a	drain	on	the	
economy”,	“EU	immigrants	have	a	
positive	impact	on	public	finances”
positive
negative
neutral
👍
How is fact checking framed in NLP?
6
• Setup followed by more recent research, e.g. FEVER (Thorne et al., 2018) and
MultiFC (Augenstein et al., 2019)
• Includes most tasks of the fact checking pipeline
• Inter-dependencies of components makes it difficult to understand how a
prediction is derived
• The difficulty of the task means models are relatively easily fooled
👍
👎
Evidence Document
Retrieval
Stance Detection /
Textual Entailment
Veracity Prediction
“Immigrants	are	a	drain	
on	the	economy”
“Immigrants	are	a	drain	on	the	
economy”,	“EU	immigrants	have	a	
positive	impact	on	public	finances”
positive
negative
neutral
true
false
not enough info“Immigrants	are	a	drain	
on	the	economy”
👎
Research Challenges in Fact Checking
7
• Zero-shot or few-shot fact checking
• How to fact-check claims with little available evidence?
• Can we do better than the ”claim-only” setting?
• Another option: prediction based on user responses on social media (e.g. Chen et
al., 2019)
• Combining multiple evidence documents for fact checking
• Strength of evidence can differ
• Learn this as part of the model (Augenstein et al. 2019)
• Documents can play different roles, e.g. evidence, counter-evidence
• Some early work on combining one piece of evidence with one of counter-evidence
(Yin and Roth, 2018)
• Documents can address different parts of the claim
• Multi-hop fact checking (Nishida et al., 2019)
Research Challenges in Fact Checking
8
• Better understanding how a fact checking prediction is derived
• Explainability metrics or tools can be used
• Those do not necessarily highlight inter-dependencies between claim and evidence
or provide an easy human-understandable explanation
Ø Addressed in this talk
• Claim check-worthiness detection
• This is typically seen as a preprocessing task and rarely studied as part of the fact
checking pipeline
• Due to being relatively understudied, the problem is ill-defined
Ø Addressed in this talk
Research Challenges in Fact Checking
9
• Fact-checking systems are easily fooled
• ... even with completely non-sensical inputs
• Studying what types of claims can fool models can help make them more robust
(FEVER 2.0)
• However, it is very challenging to fully automatially create claims that look ”natural”
to humans
Ø Addressed in this talk
Part 1:
Generating Explanations
10
Generating Fact Checking
Explanations
Pepa Atanasova, Jakob Grue Simonsen,
Christina Lioma, Isabelle Augenstein
ACL 2020
11
A Typical Fact Checking Pipeline
Claim Check-
Worthiness Detection
Evidence Document
Retrieval
Stance Detection /
Textual Entailment
Veracity Prediction
“Immigrants	are	a	drain	
on	the	economy”
not check-worthy
check-worthy
“Immigrants	are	a	drain	
on	the	economy”
“Immigrants	are	a	drain	on	the	
economy”,	“EU	immigrants	have	a	
positive	impact	on	public	finances”
positive
negative
neutral
true
false
not enough info“Immigrants	are	a	drain	
on	the	economy”
Fact Checking
https://www.politifact.com/factchecks/2020/apr/30/donald-trump/trumps-boasts-testing-spring-cherry-picking-data/
Terminology
https://www.politifact.com/factchecks/2020/apr/30/donald-trump/trumps-boasts-testing-spring-cherry-picking-data/
Automating Fact Checking
Statement: “The last quarter, it was just
announced, our gross domestic product was below
zero. Who ever heard of this? Its never below zero.”
Speaker: Donald Trump
Context: presidential announcement speech
Label: Pants on Fire
Wang, William Yang. "“Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection." Proceedings ACL’2017.
Automating Fact Checking
Claim:
We’ve tested more than every
country combined.
Metadata:
topic=healthcare, Coronavirus,
speaker=Donald Trump,
job=president,
context=White House briefing,
history={4, 10, 14, 21, 34, 14}
Hybrid CNN Veracity Label
0,204 0,208
0,247 0,274
0
0,1
0,2
0,3
Validation Test
Fact checking macro F1
score
Majority Wang et. al
Wang, William Yang. "“Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection." Proceedings ACL’2017.
Automating Fact Checking
Alhindi, Tariq, Savvas Petridis, and Smaranda Muresan. "Where is your Evidence: Improving Fact-checking by Justification
Modeling." Proceedings of FEVER’2018.
Claim:
We’ve tested more than every country
combined.
Justification:
Trump claimed that the United States has
"tested more than every country combined."
There is no reasonable way to conclude that the
American system has run more diagnostics than
"all other major countries combined." Just by
adding up a few other nations’ totals, you can
quickly see Trump’s claim fall apart.
Plus, focusing on the 5 million figure distracts
from the real issue — by any meaningful metric
of diagnosing and tracking, the United States is
still well behind countries like Germany and
Canada.
The president’s claim is not only inaccurate but
also ridiculous. We rate it Pants on Fire!
Logistic Regression Veracity Label
0,204 0,208
0,247 0,274
0,370 0,370
0
0,2
0,4
Validation Test
Fact checking macro F1
score
Majority Wang et. al Alhindi et. Al
How about an explanation?
Claim:
We’ve tested more than every country
combined.
Justification:
Trump claimed that the United States has
"tested more than every country combined."
There is no reasonable way to conclude that the
American system has run more diagnostics than
"all other major countries combined." Just by
adding up a few other nations’ totals, you can
quickly see Trump’s claim fall apart.
Plus, focusing on the 5 million figure distracts
from the real issue — by any meaningful metric
of diagnosing and tracking, the United States is
still well behind countries like Germany and
Canada.
The president’s claim is not only inaccurate but
also ridiculous. We rate it Pants on Fire!
Model
Veracity Label
Explanation
Generating Explanations from Ruling Comments
Claim:
We’ve tested more than every country
combined.
Ruling Comments:
Responding to weeks of criticism over his administration’s
COVID-19 response, President Donald Trump claimed at a
White House briefing that the United States has well
surpassed other countries in testing people for the virus.
"We’ve tested more than every country combined,"
Trump said April 27 […] We emailed the White House for
comment but never heard back, so we turned to the data.
Trump’s claim didn’t stand up to scrutiny.
In raw numbers, the United States has tested more
people than any other individual country — but
nowhere near more than "every country combined" or,
as he said in his tweet, more than "all major countries
combined.”[…] The United States has a far bigger
population than many of the "major countries" Trump often
mentions. So it could have run far more tests but still
have a much larger burden ahead than do nations like
Germany, France or Canada.[…]
Joint Model
Veracity Label
Justification/
Explanation
Related Studies on Generating Explanations
• Camburu et. al; Rajani et. al generate abstractive
explanations
• Short input text and explanations;
• Large amount of annotated data.
• Real world fact checking datasets are of limited size and
the input consists of long documents
• We take advantage of the LIAR-PLUS dataset:
• Use the summary of the ruling comments as a gold explanation;
• Formulate the problem as extractive summarization.
• Camburu, Oana-Maria, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. "e-SNLI: Natural language inference with natural language
explanations." In Advances in Neural Information Processing Systems,. 2018.
• Rajani, Nazneen Fatema, Bryan McCann, Caiming Xiong, and Richard Socher. "Explain Yourself! Leveraging Language Models for
Commonsense Reasoning." In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4932-4942. 2019.
Example of an Oracle’s Gold Summary
Claim: “The president promised that if he spent money on a stimulus program that
unemployment would go to 5.7 percent or 6 percent. Those were his words.”
Label: Mostly-False
Just: Bramnick said “the president promised that if he spent money on a stimulus
program that unemployment would go to 5.7 percent or 6 percent. Those werehis
words.” Two economic advisers estimated in a 2009 report that with the stimulus
plan, the unemployment rate would peak near 8 percent before dropping to less than
6 percent by now. Those are critical details Bramnick’s statement ignores. To comment
on this ruling, go to NJ.com.
Oracle: “The president promised that if he spent money on a stimulus program that
unemployment would go to 5.7 percent or 6 percent. Those were his
words,”Bramnick said in a Sept. 7 interview on NJToday. But with the stimulus plan,
the report projected the nation’s jobless rate would peak near 8 percent in 2009
before falling to about 5.5 percent by now. So the estimates in the report were
wrong.
Which sentences should be selected for the
explanation?
What is the veracity of the claim?
Joint Explanation and Veracity Prediction
Ruder, Sebastian, et al. "Latent multi-task architecture learning." Proceedings of the AAAI’2019.
Cross-stitch layer
Multi-task objective
Fact Checking Results
0,208
0,274
0,323
0,300 0,313
0,370
0,443
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0,5
Test
Macro F1 score on the test split
Majority Wang et. al MT Ruling Ruling Oracles Ruling Alhindi et. al Justification
28,11
6,96
24,38
43,57
22,23
39,26
35,70
13,51
31,5831,13
12,90
30,93
0,00
5,00
10,00
15,00
20,00
25,00
30,00
35,00
40,00
45,00
50,00
ROUGE-1 ROUGE-2 ROUGE-L
ROUGE scores on the test split
Lead-4 Oracle Extractive Extractive-MT
Explanation Results
*ROUGE-N: Overlap of
N-grams between the
system and reference
summaries.
*ROUGE-L : Longest
Common Subsequence
(LCS) between the
system and reference
summaries.
Manual Evaluation
• Explanation Quality
• Coverage. The explanation contains important, salient
information and does not miss any important points that
contribute to the fact check.
• Non-redundancy. The summary does not contain any
information that is redundant/repeated/not relevant to the
claim and the fact check.
• Non-contradiction. The summary does not contain any pieces
of information contradictory to the claim and the fact check.
• Overall. Rank the explanations by their overall quality.
• Explanation Informativeness. Provide a veracity label
for a claim based on a veracity explanation coming
from the justification, the Explain-MT, or the Explain-
Extractive system.
Explanation Quality
1,48 1,48 1,45
1,58
1,89
1,75
1,40
2,03
1,68
1,79
1,48
1,90
-
0,50
1,00
1,50
2,00
2,50
Coverage Non-redundancy Non-contradicton Overall
Mean Average Rank (MAR).
Lower MAR is better! (higher rank)
Justification Extractive Extractive MT
Explanation Informativeness
-
0,100
0,200
0,300
0,400
0,500
Agree-Correct Agree-Not Correct Agree-Nost
Sufficient
Disagree
Manual veracity labelling, given a particular
explanation as percentages of the dis/agreeing
annotator predictions.
Justification Extractive Extractive-MT
Summary
• First study on generating veracity explanations
• Jointly training veracity prediction and explanation
• improves the performance of the classification system
• improves the coverage and overall performance of the
generated explanations
Future Work
• Can we generate better or even abstractive
explanations given limited resources?
• How to automatically evaluate the properties of the
explanations?
• Can explanations be extracted from evidence pages
only (lots of irrelevant and multi-modal results)?
Part 2:
Understanding Claims
32
Claim Check-Worthiness Detection
as Positive Unlabelled Learning
Dustin Wright, Isabelle Augenstein
Findings of EMNLP, 2020
33
A Typical Fact Checking Pipeline
Claim Check-
Worthiness Detection
Evidence Document
Retrieval
Stance Detection /
Textual Entailment
Veracity Prediction
“Immigrants	are	a	drain	
on	the	economy”
not check-worthy
check-worthy
“Immigrants	are	a	drain	
on	the	economy”
“Immigrants	are	a	drain	on	the	
economy”,	“EU	immigrants	have	a	
positive	impact	on	public	finances”
positive
negative
neutral
true
false
not enough info“Immigrants	are	a	drain	
on	the	economy”
What is Check-Worthiness?
35
- A check-worthy statement is ”an assertion about the world
that is checkable” (Konstantinovskiy et al., 2018)
- However, there is more to this:
- The statement should be factual
- Typically, only sentences unlikely to be believed without
verification are marked as check-worthy
- Coupled with claim importance, which is subjective
- Domain interests skew what is selected as check-worthy (e.g.
only certain political claims)
What is Check-Worthiness?
36
- Many different types of claims (Francis and FullFact,
2016)
- numerical claims involving numerical properties about entities
and comparisons among them
- entity and event properties such as professional qualifications
and event participants
- position statements such as whether a political entity supported
a certain policy
- quote verification assessing whether the claim states precisely
the source of a quote, its content, and the event at which it
supposedly occurred.
Wikipedia Citation Needed Detection
Twitter Rumour Detection
Political Check-Worthiness Detection
Given a statement from Wikipedia, determine if the statement needs a
citation in order to be accepted as true
37
Determine if the content of a tweet was verified as true or false at the
time of posting
Given statements from a political speech or debate, rank them by how
check-worthy they are
What is Check-Worthiness?
What is Check-Worthiness?
38
Baseline Approach for Claim Check-
Worthiness Detection: Binary Classification
39
𝑔 ∶ 𝑥	 → (0, 1)
Reviewers described the
book as “magistral”,
“encyclopaedic” and a
“classic”.
l = 1
As was pointed out
above, Lenten
traditions have
developed over time.
l = 0
A Unified Approach
• Utilise existing published datasets on check-
worthiness detection and related tasks
• Use Positive Unlabelled (PU) learning to overcome
issues with subjective annotation
• Transfer learning from Wikipedia citation needed
detection
• Evaluate how well each dataset reflects a general
definition of check-worthiness
40
Solution: Positive Unlabelled Learning
41
• Assumptions:
• Not all ‘true’ instances are labelled as such
• Labelled samples are selected at random from an
underlying distribution
• If a sample is labelled, then the label cannot be ‘false’
• Intuition:
• Train another model, 𝑓, first, which estimates the
confidence in a sample being ‘true’
• Automatically label unlabelled samples using 𝑓 to train
model 𝑔
A Traditional Classifier
42
𝑔 ∶ 𝑥	 → (0, 1)
Reviewers described the
book as “magistral”,
“encyclopaedic” and a
“classic”.
l = 1
As was pointed out
above, Lenten
traditions have
developed over time.
l = 0
Positive-Negative Classifier
All three of his albums
appeared on Rolling Stone
magazine's list of The 500
Greatest Albums of All
Time in 2003
l = 0
True negative?
Positive Unlabelled Learning
43
𝑓 ∶ 𝑥	 → (0, 1)
s = 0
s = 0
s = 1
Positive-Unlabelled
Classifier
1.0
0.0
0.05
0.95
0.8
0.2
𝑤 𝑥
Positive Unlabelled Learning
44
𝑓
s = 0
s = 0
s = 1
1.0
0.0
0.05
0.95
0.8
0.2
𝑔
l = 1
l = 0
l = 1
l = 0
l = 1
l = 0
𝑤 𝑥
Charles Elkan and Keith Noto. 2008. Learning classifiers from only positive and unlabeled data. In Proceedings of KKD.
Positive Unlabelled Learning + Conversion
45
𝑓
s = 0
s = 0
s = 1
1.0
0.0
0.05
0.95
0.8
0.2
𝑔
l = 1
l = 0
l = 1
l = 0
l = 1
l = 0
1.0
Use estimate of 𝑝(𝑦 = 1)
𝑤 𝑥
Cross-Domain Check-Worthiness Detection
46
𝑓0 𝑔0
𝑔1 𝑔2
All models: BERT base
uncased pretrained
Research Questions
• To what degree is check-worthiness detection a PU
learning problem, and does this enable a unified
approach to check-worthiness detection?
• How well do statements in each dataset reflect the
definition: ”makes an assertion about the world that is
checkable?”
47
Results: Citation Needed Detection
48
• Dataset:
• ’Citation needed’ detection (Redi et al., 2019)
• 10k positive, 10k negative for training (featured articles)
• 10k positive, 10k negative for testing (flagged statements)
0,68 0,7 0,72 0,74 0,76 0,78 0,8 0,82 0,84
Redi et al. 2019
BERT
BERT + PU
BERT + PUC
Macro F1
Results: Twitter
49
• Dataset:
• Pheme Rumour Detection (Zubiaga et al., 2017)
• 5802 tweets from 5 events
• 5-fold cross-validation, event hold-out
0,54 0,56 0,58 0,6 0,62 0,64 0,66 0,68 0,7
BiLSTM
Zubiaga et al. 2017
BERT
BERT + Wiki
BERT + Wiki + PU
BERT + Wiki + PUC
Micro-F1
Results: Political Speeches
50
• Dataset:
• Claims from political speeches (Hansen et al., 2018)
• Consists of 4 political speeches from 2018 Clef CheckThat!
Competition & 3 speeches from ClaimRank
• 2602 statements overall
• 7-fold cross-validation, speech hold-out
0,28 0,29 0,3 0,31 0,32 0,33 0,34 0,35
Hansen et al. 2019
BERT
BERT + Wiki
BERT + Wiki + PU
BERT + Wiki + PUC
Mean Average
Precision
Error Analysis: How Check-Worthy are Check-
Worthy Statements?
51
• Method:
• Get top 100 positive predictions by PUC model on each
dataset
• Re-annotate those manually (2 annotators)
• Compare against ’gold’ annotations
0 0,2 0,4 0,6 0,8 1
Wikipedia
Twitter
Politics
Dataset Analysis
• Wikipedia and Twitter equally well reflect a general
definition of check-worthiness
• Political speeches have high false negative rate
• Also documented in previous literature (Konstantinovskiy et al.
2018)
• Clear examples: these two statements appear in a single
document, only one was labelled as check-worthy
• ”It’s why our unemployment rate is the lowest it’s been in so many decades”
• ”New unemployment claims are near the lowest we’ve seen in almost half a
century”
52
Key Takeaways
• Check-worthiness annotation is highly subjective
• Because of this, automatic check-worthiness detection
is a difficult task
• Approaching the problem as PU learning shows
improvements
• Transferring from Wikipedia improves on downstream
domains when PU learning is used
53
Generating Label Cohesive and
Well-Formed Adversarial Claims
Pepa Atanasova*, Dustin Wright*,
Isabelle Augenstein
Proceedings of EMNLP, 2020
*equal contributions
54
A Typical Fact Checking Pipeline
Claim Check-
Worthiness Detection
Evidence Document
Retrieval
Stance Detection /
Textual Entailment
Veracity Prediction
“Immigrants	are	a	drain	
on	the	economy”
not check-worthy
check-worthy
“Immigrants	are	a	drain	
on	the	economy”
“Immigrants	are	a	drain	on	the	
economy”,	“EU	immigrants	have	a	
positive	impact	on	public	finances”
positive
negative
neutral
true
false
not enough info“Immigrants	are	a	drain	
on	the	economy”
Generating Adversarial Claims
• Fact checking models can overfit to spurious patterns
• Making the right predictions for the wrong reasons
• This leads to vulnerabilities, which can be exploited by adversaries
(e.g. agents spreading mis- and disinformation)
• How can one reveal such vulnerabilities?
• Generating explanations for fact checking models (first part of talk)
• Generating adversarial claims (this work)
Previous Work
• Universal adversarial attacks (Gao and Oates, 2019;
Wallace et al, 2019)
• Single perturbation changes that can be applied to many instances
• Change the meaning of the input instances and thus produce
label-incoherent claims
• Are not per se semantically well-formed
• Rule-based perturbations (Riberio et al., 2018)
• Semantically well-formed, but require hand-crafting patterns
Previous Work
• FEVER 2.0 shared task (Thorne et al., 2019)
• Builders / breakers setup
• Methods of submitted systems:
• Producing claims requiring multi-hop reasoning (Niewinski et al., 2019)
• Generating adversarial claims manually (Kim and Allan, 2019)
Goals of this Work
• Generate claims fully automatically
• Preserve the meaning of the source text
• Produce semantically well-formed claims
Model
RoBERTa-based
FEVER model to
predict FC label
1) HotFlip attack
model to find triggers
2) STS auxiliary model
to preserve FC label
Examples of Generated Claims
61
EMNLP 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Evidence Triggers Generated Claim
SUPPORTS Claims
Since the 19th century, some
Romani have also migrated
to the Americas.
don,already,more,during,home Romani have moved to the
Americas during the 19th
century.
Cyprus is a major tourist
destination in the Mediter-
ranean .
foreign,biggest,major,every,friends Cyprus is a major tourist des-
tination.
The first Nobel Prize in
Chemistry was awarded in
1901 to Jacobus Henricus
van’t Hoff, of the Nether-
lands, “for his discovery of
the laws of chemical dynam-
ics and osmotic pressure in
solutions.”
later,already,quite,altern,whereas Henricus Van’t Hoff was al-
ready awarded the Nobel
Prize.
REFUTES Claims
California Attorney General phys,incarn,not,occasionally,something Kamala Harris did not defeat
Since the 19th century, some
Romani have also migrated
to the Americas.
don,already,more,during,home Romani have moved to the
Americas during the 19th
century.
Cyprus is a major tourist
destination in the Mediter-
ranean .
foreign,biggest,major,every,friends Cyprus is a major tourist des-
tination.
The first Nobel Prize in
Chemistry was awarded in
1901 to Jacobus Henricus
van’t Hoff, of the Nether-
lands, “for his discovery of
the laws of chemical dynam-
ics and osmotic pressure in
solutions.”
later,already,quite,altern,whereas Henricus Van’t Hoff was al-
ready awarded the Nobel
Prize.
REFUTES Claims
California Attorney General
Kamala Harris defeated
Sanchez , 61.6% to 38.4%.
phys,incarn,not,occasionally,something Kamala Harris did not defeat
Sanchez, 61.6% to 38.4%.
Uganda is in the African
Great Lakes region.
unless,endorsed,picks,pref,against Uganda is against the
African Great Lakes region.
Times Higher Education
World University Rankings
is an annual publication
of university rankings by
Times Higher Education
(THE) magazine.
interested,reward,visit,consumer,conclusion Times Higher Education
World University Rankings
is a consumer magazine.
NOT ENOUGH INFO Claims
ranean .
The first Nobel Prize in
Chemistry was awarded in
1901 to Jacobus Henricus
van’t Hoff, of the Nether-
lands, “for his discovery of
the laws of chemical dynam-
ics and osmotic pressure in
solutions.”
later,already,quite,altern,whereas Henricus Van’t Hoff was al-
ready awarded the Nobel
Prize.
REFUTES Claims
California Attorney General
Kamala Harris defeated
Sanchez , 61.6% to 38.4%.
phys,incarn,not,occasionally,something Kamala Harris did not defeat
Sanchez, 61.6% to 38.4%.
Uganda is in the African
Great Lakes region.
unless,endorsed,picks,pref,against Uganda is against the
African Great Lakes region.
Times Higher Education
World University Rankings
is an annual publication
of university rankings by
Times Higher Education
(THE) magazine.
interested,reward,visit,consumer,conclusion Times Higher Education
World University Rankings
is a consumer magazine.
NOT ENOUGH INFO Claims
The KGB was a military ser-
vice and was governed by
army laws and regulations ,
similar to the Soviet Army
or MVD Internal Troops.
nowhere,only,none,no,nothing The KGB was only con-
trolled by a military service.
The first Nobel Prize in
Chemistry was awarded in
1901 to Jacobus Henricus
van’t Hoff, of the Nether-
lands, “for his discovery of
the laws of chemical dynam-
ics and osmotic pressure in
solutions.”
later,already,quite,altern,whereas Henricus Van’t Hoff was al-
ready awarded the Nobel
Prize.
REFUTES Claims
California Attorney General
Kamala Harris defeated
Sanchez , 61.6% to 38.4%.
phys,incarn,not,occasionally,something Kamala Harris did not defeat
Sanchez, 61.6% to 38.4%.
Uganda is in the African
Great Lakes region.
unless,endorsed,picks,pref,against Uganda is against the
African Great Lakes region.
Times Higher Education
World University Rankings
is an annual publication
of university rankings by
Times Higher Education
(THE) magazine.
interested,reward,visit,consumer,conclusion Times Higher Education
World University Rankings
is a consumer magazine.
NOT ENOUGH INFO Claims
The KGB was a military ser-
vice and was governed by
army laws and regulations ,
similar to the Soviet Army
or MVD Internal Troops.
nowhere,only,none,no,nothing The KGB was only con-
trolled by a military service.
The series revolves around says,said,take,say,is Take Me High is about
Results: Automatic Evaluation
• How well can we generate universal adversarial triggers?
• Reduction in F1 for both trigger generation methods
• Trade-off between how potent the attack is (reduction in F1) vs. how
semantically coherent the claim is (STS)
• Most potent attack: S->R for FC Objective (F1: 0.060)
• Adds negation words, which change meaning of claim
0
0,5
1
F1
F1 of FC model
(lower = better)
No Triggers FC Objective FC + STS Objective
4
4,5
5
5,5
STS
Semantic Textual Similarity
with original claim
(higher = better)
No Triggers FC Objective FC + STS Objective
Results: Manual Evaluation
• How well can we generate universal adversarial triggers?
• Confirms results of automatic evaluation
• Macro F1 of generated claim w.r.t. original claim: 56.6 (FC Objective);
60.7 (FC + STS Objective)
• STS Objective preserves meaning more often
0
5
Overall SUPPORTS REFUTES NEI
Semantic Textual Similarity
with original claim
(higher = better)
FC Objective FC + STS Objective
0
0,5
1
Overall SUPPORTS REFUTES NEI
F1 of FC model
(lower = better)
FC Objective FC + STS Objective
Key Take-Aways
• Novel extension to the HotFlip attack for universarial
adversarial trigger generation (Ebrahimi et al., 2018)
• Conditional language model, which takes trigger tokens
and evidence, and generates a semantically coherent claim
• Resulting model generates semantically coherent claims
containing universal triggers, which preserve the label
• Trade-off between how well-formed the claim is and how
potent the attack is
Overall Take-Aways
• Automatic fact checking is still a very involved NLP task
• Different sub-tasks: 1) claim check-worthiness detection; 2)
evidence retrieval; 3) stance detection / textual entailment, 4)
veracity prediction
• What is needed to automatically generate explanations for
fact checking models?
• Common understanding of how sub-tasks should be defined
(Part 2.1 of this talk)
• Lack thereof leads to inconsistent models and findings, limited
transferability of learned representations
Overall Take-Aways
• What is needed to automatically generate explanations for
fact checking models?
• More realistic benchmarks for FC explanation generation
(Part 1 of this talk)
• Generating explanations from ruling comments in the LIAR-PLUS dataset is
somewhat useful for researching models, but not very realistic
• The dataset is also relatively small (10k training instances)
• A more realistic setup and benchmark would inspire research of e.g. multi-
modal models, and models for better understanding evidence documents
Overall Take-Aways
• What is needed to automatically generate explanations for
fact checking models?
• More research on fooling FC models
(Part 2.2 of this talk)
• Adversarial trigger generation can reveal inconsistencies and vulnerabilities
in FC models
• Those can explain what lexical patterns a FC model struggles with
• Most research focuses on instance-level perturbations of claims, whereas
universal triggers reveal more general vulnerabilities of the model
Thank you!
isabelleaugenstein.github.io
augenstein@di.ku.dk
@IAugenstein
github.com/isabelleaugenstein
27/10/2020 68
Thanks to my PhD students and colleagues!
69
CopeNLU
https://copenlu.github.io/
Pepa Atanasova Dustin Wright Jakob Grue Simonsen Christina Lioma
Presented Papers
Pepa Atanasova, Jakob Grue Simonsen, Christina Lioma, Isabelle
Augenstein. Generating Fact Checking Explanations. In Proceedings of
ACL 2020. https://arxiv.org/abs/2004.05773
Dustin Wright, Isabelle Augenstein. Claim Check-Worthiness Detection
as Positive Unlabelled Learning. In Findings of EMNLP 2020.
https://arxiv.org/abs/2003.02736
Pepa Atanasova*, Dustin Wright*, Isabelle Augenstein. Generating Label
Cohesive and Well-Formed Adversarial Claims. In Proceedings of
EMNLP, 2020. https://arxiv.org/abs/2009.08205
*equal contributions
Ad

More Related Content

What's hot (20)

Bias in AI
Bias in AIBias in AI
Bias in AI
Zeydy Ortiz, Ph. D.
 
Explainable AI (XAI)
Explainable AI (XAI)Explainable AI (XAI)
Explainable AI (XAI)
Manojkumar Parmar
 
Implementing Ethics in AI
Implementing Ethics in AIImplementing Ethics in AI
Implementing Ethics in AI
Pekka Abrahamsson / Tampere University
 
Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)
Krishnaram Kenthapadi
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Krishnaram Kenthapadi
 
Introduction to common sense reasoning
Introduction to common sense reasoningIntroduction to common sense reasoning
Introduction to common sense reasoning
Martin Molina
 
Mike Sharples - Generative AI and Large Language Models in Digital Education....
Mike Sharples - Generative AI and Large Language Models in Digital Education....Mike Sharples - Generative AI and Large Language Models in Digital Education....
Mike Sharples - Generative AI and Large Language Models in Digital Education....
EADTU
 
Introduction to Common Weakness Enumeration (CWE)
Introduction to Common Weakness Enumeration (CWE)Introduction to Common Weakness Enumeration (CWE)
Introduction to Common Weakness Enumeration (CWE)
Aung Thu Rha Hein
 
Responsible AI in Industry (ICML 2021 Tutorial)
Responsible AI in Industry (ICML 2021 Tutorial)Responsible AI in Industry (ICML 2021 Tutorial)
Responsible AI in Industry (ICML 2021 Tutorial)
Krishnaram Kenthapadi
 
STKI Israeli IT Market Study v2 August 2024.pdf
STKI Israeli IT Market Study v2 August 2024.pdfSTKI Israeli IT Market Study v2 August 2024.pdf
STKI Israeli IT Market Study v2 August 2024.pdf
Dr. Jimmy Schwarzkopf
 
Deep Learning in Healthcare
Deep Learning in HealthcareDeep Learning in Healthcare
Deep Learning in Healthcare
Dmytro Fishman
 
Identifying & Combating Misinformation w/ Fact Checking Tools
Identifying & Combating Misinformation w/ Fact Checking ToolsIdentifying & Combating Misinformation w/ Fact Checking Tools
Identifying & Combating Misinformation w/ Fact Checking Tools
Ujjwal Acharya
 
The Five Levels of Generative AI for Games
The Five Levels of Generative AI for GamesThe Five Levels of Generative AI for Games
The Five Levels of Generative AI for Games
Jon Radoff
 
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdfNeo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
Neo4j
 
Fairness and Bias in Machine Learning
Fairness and Bias in Machine LearningFairness and Bias in Machine Learning
Fairness and Bias in Machine Learning
Surya Dutta
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
ryanorban
 
AI, Machine Learning, and Data Science Concepts
AI, Machine Learning, and Data Science ConceptsAI, Machine Learning, and Data Science Concepts
AI, Machine Learning, and Data Science Concepts
Dan O'Leary
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge Graphs
Jeff Z. Pan
 
Scaling the mirrorworld with knowledge graphs
Scaling the mirrorworld with knowledge graphsScaling the mirrorworld with knowledge graphs
Scaling the mirrorworld with knowledge graphs
Alan Morrison
 
Fairness in Machine Learning and AI
Fairness in Machine Learning and AIFairness in Machine Learning and AI
Fairness in Machine Learning and AI
Seth Grimes
 
Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)
Krishnaram Kenthapadi
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Krishnaram Kenthapadi
 
Introduction to common sense reasoning
Introduction to common sense reasoningIntroduction to common sense reasoning
Introduction to common sense reasoning
Martin Molina
 
Mike Sharples - Generative AI and Large Language Models in Digital Education....
Mike Sharples - Generative AI and Large Language Models in Digital Education....Mike Sharples - Generative AI and Large Language Models in Digital Education....
Mike Sharples - Generative AI and Large Language Models in Digital Education....
EADTU
 
Introduction to Common Weakness Enumeration (CWE)
Introduction to Common Weakness Enumeration (CWE)Introduction to Common Weakness Enumeration (CWE)
Introduction to Common Weakness Enumeration (CWE)
Aung Thu Rha Hein
 
Responsible AI in Industry (ICML 2021 Tutorial)
Responsible AI in Industry (ICML 2021 Tutorial)Responsible AI in Industry (ICML 2021 Tutorial)
Responsible AI in Industry (ICML 2021 Tutorial)
Krishnaram Kenthapadi
 
STKI Israeli IT Market Study v2 August 2024.pdf
STKI Israeli IT Market Study v2 August 2024.pdfSTKI Israeli IT Market Study v2 August 2024.pdf
STKI Israeli IT Market Study v2 August 2024.pdf
Dr. Jimmy Schwarzkopf
 
Deep Learning in Healthcare
Deep Learning in HealthcareDeep Learning in Healthcare
Deep Learning in Healthcare
Dmytro Fishman
 
Identifying & Combating Misinformation w/ Fact Checking Tools
Identifying & Combating Misinformation w/ Fact Checking ToolsIdentifying & Combating Misinformation w/ Fact Checking Tools
Identifying & Combating Misinformation w/ Fact Checking Tools
Ujjwal Acharya
 
The Five Levels of Generative AI for Games
The Five Levels of Generative AI for GamesThe Five Levels of Generative AI for Games
The Five Levels of Generative AI for Games
Jon Radoff
 
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdfNeo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
Neo4j
 
Fairness and Bias in Machine Learning
Fairness and Bias in Machine LearningFairness and Bias in Machine Learning
Fairness and Bias in Machine Learning
Surya Dutta
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
ryanorban
 
AI, Machine Learning, and Data Science Concepts
AI, Machine Learning, and Data Science ConceptsAI, Machine Learning, and Data Science Concepts
AI, Machine Learning, and Data Science Concepts
Dan O'Leary
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge Graphs
Jeff Z. Pan
 
Scaling the mirrorworld with knowledge graphs
Scaling the mirrorworld with knowledge graphsScaling the mirrorworld with knowledge graphs
Scaling the mirrorworld with knowledge graphs
Alan Morrison
 
Fairness in Machine Learning and AI
Fairness in Machine Learning and AIFairness in Machine Learning and AI
Fairness in Machine Learning and AI
Seth Grimes
 

Similar to Towards Explainable Fact Checking (20)

Towards Explainable Fact Checking (DIKU Business Club presentation)
Towards Explainable Fact Checking (DIKU Business Club presentation)Towards Explainable Fact Checking (DIKU Business Club presentation)
Towards Explainable Fact Checking (DIKU Business Club presentation)
Isabelle Augenstein
 
Facts, Figures & Fictions
Facts, Figures & Fictions Facts, Figures & Fictions
Facts, Figures & Fictions
Johnbillett.com
 
Information Sharing, Dot Connecting and Intelligence Failures.docx
Information Sharing, Dot Connecting and Intelligence Failures.docxInformation Sharing, Dot Connecting and Intelligence Failures.docx
Information Sharing, Dot Connecting and Intelligence Failures.docx
annettsparrow
 
Datenjournalismus - Entdecken - Erklären - Erkennen
Datenjournalismus - Entdecken - Erklären - ErkennenDatenjournalismus - Entdecken - Erklären - Erkennen
Datenjournalismus - Entdecken - Erklären - Erkennen
Thomas Jöchler
 
In plain language and in plain sight around the globe
In plain language and in plain sight around the globeIn plain language and in plain sight around the globe
In plain language and in plain sight around the globe
Deborah S. Bosley
 
AAPOR 2012 Langer Probability
AAPOR 2012 Langer ProbabilityAAPOR 2012 Langer Probability
AAPOR 2012 Langer Probability
LangerResearch
 
It's not the documents; it's the DATA
It's not the documents; it's the DATAIt's not the documents; it's the DATA
It's not the documents; it's the DATA
J T "Tom" Johnson
 
Freedom of information and investigative journalism
Freedom of information and investigative journalismFreedom of information and investigative journalism
Freedom of information and investigative journalism
asanders88
 
Mmis 61 so2016
Mmis 61 so2016Mmis 61 so2016
Mmis 61 so2016
Stephen Abram
 
5 Types Of Essay Writing Ppt Www.Yienvisa.Com
5 Types Of Essay Writing Ppt Www.Yienvisa.Com5 Types Of Essay Writing Ppt Www.Yienvisa.Com
5 Types Of Essay Writing Ppt Www.Yienvisa.Com
Katrina Green
 
OSMC 2024 | The Subtle Art of Lying with Statistics Dave McAllister.pdf
OSMC 2024 | The Subtle Art of Lying with Statistics Dave McAllister.pdfOSMC 2024 | The Subtle Art of Lying with Statistics Dave McAllister.pdf
OSMC 2024 | The Subtle Art of Lying with Statistics Dave McAllister.pdf
NETWAYS
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)
jemille6
 
Persuasive Essay Writing Help UK Persuasive Essa
Persuasive Essay Writing Help UK Persuasive EssaPersuasive Essay Writing Help UK Persuasive Essa
Persuasive Essay Writing Help UK Persuasive Essa
Brittany Jones
 
Qualitative Research Project Analytics
Qualitative Research Project  AnalyticsQualitative Research Project  Analytics
Qualitative Research Project Analytics
Cledor Ndiaye
 
2008 paper ah
2008 paper ah2008 paper ah
2008 paper ah
Dani Cathro
 
The Case for News Literacy (The News Literacy Project)
The Case for News Literacy (The News Literacy Project)The Case for News Literacy (The News Literacy Project)
The Case for News Literacy (The News Literacy Project)
PeterNLP
 
Computer Assisted Reporting Presentation
Computer Assisted Reporting PresentationComputer Assisted Reporting Presentation
Computer Assisted Reporting Presentation
Centre for Investigative Journalism
 
Freedom of information final
Freedom of information finalFreedom of information final
Freedom of information final
asanders88
 
MAYDAY.US Report on 2014 Elections
MAYDAY.US Report on 2014 ElectionsMAYDAY.US Report on 2014 Elections
MAYDAY.US Report on 2014 Elections
MAYDAY.US
 
Student Name CATEGORY AND CRITERIA(A+)15(A).docx
Student Name CATEGORY AND CRITERIA(A+)15(A).docxStudent Name CATEGORY AND CRITERIA(A+)15(A).docx
Student Name CATEGORY AND CRITERIA(A+)15(A).docx
deanmtaylor1545
 
Towards Explainable Fact Checking (DIKU Business Club presentation)
Towards Explainable Fact Checking (DIKU Business Club presentation)Towards Explainable Fact Checking (DIKU Business Club presentation)
Towards Explainable Fact Checking (DIKU Business Club presentation)
Isabelle Augenstein
 
Facts, Figures & Fictions
Facts, Figures & Fictions Facts, Figures & Fictions
Facts, Figures & Fictions
Johnbillett.com
 
Information Sharing, Dot Connecting and Intelligence Failures.docx
Information Sharing, Dot Connecting and Intelligence Failures.docxInformation Sharing, Dot Connecting and Intelligence Failures.docx
Information Sharing, Dot Connecting and Intelligence Failures.docx
annettsparrow
 
Datenjournalismus - Entdecken - Erklären - Erkennen
Datenjournalismus - Entdecken - Erklären - ErkennenDatenjournalismus - Entdecken - Erklären - Erkennen
Datenjournalismus - Entdecken - Erklären - Erkennen
Thomas Jöchler
 
In plain language and in plain sight around the globe
In plain language and in plain sight around the globeIn plain language and in plain sight around the globe
In plain language and in plain sight around the globe
Deborah S. Bosley
 
AAPOR 2012 Langer Probability
AAPOR 2012 Langer ProbabilityAAPOR 2012 Langer Probability
AAPOR 2012 Langer Probability
LangerResearch
 
It's not the documents; it's the DATA
It's not the documents; it's the DATAIt's not the documents; it's the DATA
It's not the documents; it's the DATA
J T "Tom" Johnson
 
Freedom of information and investigative journalism
Freedom of information and investigative journalismFreedom of information and investigative journalism
Freedom of information and investigative journalism
asanders88
 
5 Types Of Essay Writing Ppt Www.Yienvisa.Com
5 Types Of Essay Writing Ppt Www.Yienvisa.Com5 Types Of Essay Writing Ppt Www.Yienvisa.Com
5 Types Of Essay Writing Ppt Www.Yienvisa.Com
Katrina Green
 
OSMC 2024 | The Subtle Art of Lying with Statistics Dave McAllister.pdf
OSMC 2024 | The Subtle Art of Lying with Statistics Dave McAllister.pdfOSMC 2024 | The Subtle Art of Lying with Statistics Dave McAllister.pdf
OSMC 2024 | The Subtle Art of Lying with Statistics Dave McAllister.pdf
NETWAYS
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)
jemille6
 
Persuasive Essay Writing Help UK Persuasive Essa
Persuasive Essay Writing Help UK Persuasive EssaPersuasive Essay Writing Help UK Persuasive Essa
Persuasive Essay Writing Help UK Persuasive Essa
Brittany Jones
 
Qualitative Research Project Analytics
Qualitative Research Project  AnalyticsQualitative Research Project  Analytics
Qualitative Research Project Analytics
Cledor Ndiaye
 
The Case for News Literacy (The News Literacy Project)
The Case for News Literacy (The News Literacy Project)The Case for News Literacy (The News Literacy Project)
The Case for News Literacy (The News Literacy Project)
PeterNLP
 
Freedom of information final
Freedom of information finalFreedom of information final
Freedom of information final
asanders88
 
MAYDAY.US Report on 2014 Elections
MAYDAY.US Report on 2014 ElectionsMAYDAY.US Report on 2014 Elections
MAYDAY.US Report on 2014 Elections
MAYDAY.US
 
Student Name CATEGORY AND CRITERIA(A+)15(A).docx
Student Name CATEGORY AND CRITERIA(A+)15(A).docxStudent Name CATEGORY AND CRITERIA(A+)15(A).docx
Student Name CATEGORY AND CRITERIA(A+)15(A).docx
deanmtaylor1545
 
Ad

More from Isabelle Augenstein (20)

Beyond Fact Checking — Modelling Information Change in Scientific Communication
Beyond Fact Checking — Modelling Information Change in Scientific CommunicationBeyond Fact Checking — Modelling Information Change in Scientific Communication
Beyond Fact Checking — Modelling Information Change in Scientific Communication
Isabelle Augenstein
 
Automatically Detecting Scientific Misinformation
Automatically Detecting Scientific MisinformationAutomatically Detecting Scientific Misinformation
Automatically Detecting Scientific Misinformation
Isabelle Augenstein
 
Accountable and Robust Automatic Fact Checking
Accountable and Robust Automatic Fact CheckingAccountable and Robust Automatic Fact Checking
Accountable and Robust Automatic Fact Checking
Isabelle Augenstein
 
Determining the Credibility of Science Communication
Determining the Credibility of Science CommunicationDetermining the Credibility of Science Communication
Determining the Credibility of Science Communication
Isabelle Augenstein
 
Tracking False Information Online
Tracking False Information OnlineTracking False Information Online
Tracking False Information Online
Isabelle Augenstein
 
What can typological knowledge bases and language representations tell us abo...
What can typological knowledge bases and language representations tell us abo...What can typological knowledge bases and language representations tell us abo...
What can typological knowledge bases and language representations tell us abo...
Isabelle Augenstein
 
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
Isabelle Augenstein
 
Learning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyondLearning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyond
Isabelle Augenstein
 
Learning to read for automated fact checking
Learning to read for automated fact checkingLearning to read for automated fact checking
Learning to read for automated fact checking
Isabelle Augenstein
 
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
Isabelle Augenstein
 
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
Isabelle Augenstein
 
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
Isabelle Augenstein
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Isabelle Augenstein
 
Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine Reading
Isabelle Augenstein
 
USFD at SemEval-2016 - Stance Detection on Twitter with Autoencoders
USFD at SemEval-2016 - Stance Detection on Twitter with AutoencodersUSFD at SemEval-2016 - Stance Detection on Twitter with Autoencoders
USFD at SemEval-2016 - Stance Detection on Twitter with Autoencoders
Isabelle Augenstein
 
Distant Supervision with Imitation Learning
Distant Supervision with Imitation LearningDistant Supervision with Imitation Learning
Distant Supervision with Imitation Learning
Isabelle Augenstein
 
Extracting Relations between Non-Standard Entities using Distant Supervision ...
Extracting Relations between Non-Standard Entities using Distant Supervision ...Extracting Relations between Non-Standard Entities using Distant Supervision ...
Extracting Relations between Non-Standard Entities using Distant Supervision ...
Isabelle Augenstein
 
Information Extraction with Linked Data
Information Extraction with Linked DataInformation Extraction with Linked Data
Information Extraction with Linked Data
Isabelle Augenstein
 
Lodifier: Generating Linked Data from Unstructured Text
Lodifier: Generating Linked Data from Unstructured TextLodifier: Generating Linked Data from Unstructured Text
Lodifier: Generating Linked Data from Unstructured Text
Isabelle Augenstein
 
Relation Extraction from the Web using Distant Supervision
Relation Extraction from the Web using Distant SupervisionRelation Extraction from the Web using Distant Supervision
Relation Extraction from the Web using Distant Supervision
Isabelle Augenstein
 
Beyond Fact Checking — Modelling Information Change in Scientific Communication
Beyond Fact Checking — Modelling Information Change in Scientific CommunicationBeyond Fact Checking — Modelling Information Change in Scientific Communication
Beyond Fact Checking — Modelling Information Change in Scientific Communication
Isabelle Augenstein
 
Automatically Detecting Scientific Misinformation
Automatically Detecting Scientific MisinformationAutomatically Detecting Scientific Misinformation
Automatically Detecting Scientific Misinformation
Isabelle Augenstein
 
Accountable and Robust Automatic Fact Checking
Accountable and Robust Automatic Fact CheckingAccountable and Robust Automatic Fact Checking
Accountable and Robust Automatic Fact Checking
Isabelle Augenstein
 
Determining the Credibility of Science Communication
Determining the Credibility of Science CommunicationDetermining the Credibility of Science Communication
Determining the Credibility of Science Communication
Isabelle Augenstein
 
Tracking False Information Online
Tracking False Information OnlineTracking False Information Online
Tracking False Information Online
Isabelle Augenstein
 
What can typological knowledge bases and language representations tell us abo...
What can typological knowledge bases and language representations tell us abo...What can typological knowledge bases and language representations tell us abo...
What can typological knowledge bases and language representations tell us abo...
Isabelle Augenstein
 
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
Isabelle Augenstein
 
Learning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyondLearning with limited labelled data in NLP: multi-task learning and beyond
Learning with limited labelled data in NLP: multi-task learning and beyond
Isabelle Augenstein
 
Learning to read for automated fact checking
Learning to read for automated fact checkingLearning to read for automated fact checking
Learning to read for automated fact checking
Isabelle Augenstein
 
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
Isabelle Augenstein
 
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
Isabelle Augenstein
 
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...
Isabelle Augenstein
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Isabelle Augenstein
 
Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine Reading
Isabelle Augenstein
 
USFD at SemEval-2016 - Stance Detection on Twitter with Autoencoders
USFD at SemEval-2016 - Stance Detection on Twitter with AutoencodersUSFD at SemEval-2016 - Stance Detection on Twitter with Autoencoders
USFD at SemEval-2016 - Stance Detection on Twitter with Autoencoders
Isabelle Augenstein
 
Distant Supervision with Imitation Learning
Distant Supervision with Imitation LearningDistant Supervision with Imitation Learning
Distant Supervision with Imitation Learning
Isabelle Augenstein
 
Extracting Relations between Non-Standard Entities using Distant Supervision ...
Extracting Relations between Non-Standard Entities using Distant Supervision ...Extracting Relations between Non-Standard Entities using Distant Supervision ...
Extracting Relations between Non-Standard Entities using Distant Supervision ...
Isabelle Augenstein
 
Information Extraction with Linked Data
Information Extraction with Linked DataInformation Extraction with Linked Data
Information Extraction with Linked Data
Isabelle Augenstein
 
Lodifier: Generating Linked Data from Unstructured Text
Lodifier: Generating Linked Data from Unstructured TextLodifier: Generating Linked Data from Unstructured Text
Lodifier: Generating Linked Data from Unstructured Text
Isabelle Augenstein
 
Relation Extraction from the Web using Distant Supervision
Relation Extraction from the Web using Distant SupervisionRelation Extraction from the Web using Distant Supervision
Relation Extraction from the Web using Distant Supervision
Isabelle Augenstein
 
Ad

Recently uploaded (20)

DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Journal of Soft Computing in Civil Engineering
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 

Towards Explainable Fact Checking

  • 1. FEVER Workshop 9 July 2020 Towards Explainable Fact Checking Isabelle Augenstein* [email protected] @IAugenstein http://isabelleaugenstein.github.io/ *partial slide credit: Pepa Atanasova, Dustin Wright
  • 2. Overview of Today’s Talk • Introduction • Fact checking – where are we at? • Open research questions • Part 1: Generating explanations • A first case study • Generating explanations inspired by how humans fact-check • Part 2: Understanding claims • What even is a check-worthy claim? • Automatically generating confusing claims
  • 3. Full Fact Checking Pipeline Claim Check- Worthiness Detection Evidence Document Retrieval Stance Detection / Textual Entailment Veracity Prediction “Immigrants are a drain on the economy” not check-worthy check-worthy “Immigrants are a drain on the economy” “Immigrants are a drain on the economy”, “EU immigrants have a positive impact on public finances” positive negative neutral true false not enough info“Immigrants are a drain on the economy”
  • 4. How is fact checking framed in NLP? 4 Veracity Prediction true false neutral“Immigrants are a drain on the economy” • Most early work on detecting false information follows this setup • Deception detection (Mihalcea & Strapparava, 2009; Feng et al., 2012, Morris et al., 2012) • Clickbait detection (Chen et al., 2015; Potthast et al., 2016) • Fake news detection (Rubin et al., 2016; Volkova et al., 2017; Wang, 2017) • Option for early detection of fake news or rumours (”zero-shot setting”) • Prediction not evidence-based, but strictly based on linguistic cues 👍 👎
  • 5. How is fact checking framed in NLP? 5 • Relatively well established tasks in NLP • Stance Detection (Anand et al., 2011; Mohammad et al., 2016) • Recognising Textual Entailment (Dagan et al., 2004; Bowman et al., 2015) • … more recently also studied in the context of fact checking • Fake News Challenge (Pomerleau & Rao, 2017) • Option for fact checking based on one evidence document • Central component of fact checking can be studied in isolation • Additional methods are needed for combining predictions of multiple evidence documents 👍 👎 Stance Detection / Textual Entailment “Immigrants are a drain on the economy”, “EU immigrants have a positive impact on public finances” positive negative neutral 👍
  • 6. How is fact checking framed in NLP? 6 • Setup followed by more recent research, e.g. FEVER (Thorne et al., 2018) and MultiFC (Augenstein et al., 2019) • Includes most tasks of the fact checking pipeline • Inter-dependencies of components makes it difficult to understand how a prediction is derived • The difficulty of the task means models are relatively easily fooled 👍 👎 Evidence Document Retrieval Stance Detection / Textual Entailment Veracity Prediction “Immigrants are a drain on the economy” “Immigrants are a drain on the economy”, “EU immigrants have a positive impact on public finances” positive negative neutral true false not enough info“Immigrants are a drain on the economy” 👎
  • 7. Research Challenges in Fact Checking 7 • Zero-shot or few-shot fact checking • How to fact-check claims with little available evidence? • Can we do better than the ”claim-only” setting? • Another option: prediction based on user responses on social media (e.g. Chen et al., 2019) • Combining multiple evidence documents for fact checking • Strength of evidence can differ • Learn this as part of the model (Augenstein et al. 2019) • Documents can play different roles, e.g. evidence, counter-evidence • Some early work on combining one piece of evidence with one of counter-evidence (Yin and Roth, 2018) • Documents can address different parts of the claim • Multi-hop fact checking (Nishida et al., 2019)
  • 8. Research Challenges in Fact Checking 8 • Better understanding how a fact checking prediction is derived • Explainability metrics or tools can be used • Those do not necessarily highlight inter-dependencies between claim and evidence or provide an easy human-understandable explanation Ø Addressed in this talk • Claim check-worthiness detection • This is typically seen as a preprocessing task and rarely studied as part of the fact checking pipeline • Due to being relatively understudied, the problem is ill-defined Ø Addressed in this talk
  • 9. Research Challenges in Fact Checking 9 • Fact-checking systems are easily fooled • ... even with completely non-sensical inputs • Studying what types of claims can fool models can help make them more robust (FEVER 2.0) • However, it is very challenging to fully automatially create claims that look ”natural” to humans Ø Addressed in this talk
  • 11. Generating Fact Checking Explanations Pepa Atanasova, Jakob Grue Simonsen, Christina Lioma, Isabelle Augenstein ACL 2020 11
  • 12. A Typical Fact Checking Pipeline Claim Check- Worthiness Detection Evidence Document Retrieval Stance Detection / Textual Entailment Veracity Prediction “Immigrants are a drain on the economy” not check-worthy check-worthy “Immigrants are a drain on the economy” “Immigrants are a drain on the economy”, “EU immigrants have a positive impact on public finances” positive negative neutral true false not enough info“Immigrants are a drain on the economy”
  • 15. Automating Fact Checking Statement: “The last quarter, it was just announced, our gross domestic product was below zero. Who ever heard of this? Its never below zero.” Speaker: Donald Trump Context: presidential announcement speech Label: Pants on Fire Wang, William Yang. "“Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection." Proceedings ACL’2017.
  • 16. Automating Fact Checking Claim: We’ve tested more than every country combined. Metadata: topic=healthcare, Coronavirus, speaker=Donald Trump, job=president, context=White House briefing, history={4, 10, 14, 21, 34, 14} Hybrid CNN Veracity Label 0,204 0,208 0,247 0,274 0 0,1 0,2 0,3 Validation Test Fact checking macro F1 score Majority Wang et. al Wang, William Yang. "“Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection." Proceedings ACL’2017.
  • 17. Automating Fact Checking Alhindi, Tariq, Savvas Petridis, and Smaranda Muresan. "Where is your Evidence: Improving Fact-checking by Justification Modeling." Proceedings of FEVER’2018. Claim: We’ve tested more than every country combined. Justification: Trump claimed that the United States has "tested more than every country combined." There is no reasonable way to conclude that the American system has run more diagnostics than "all other major countries combined." Just by adding up a few other nations’ totals, you can quickly see Trump’s claim fall apart. Plus, focusing on the 5 million figure distracts from the real issue — by any meaningful metric of diagnosing and tracking, the United States is still well behind countries like Germany and Canada. The president’s claim is not only inaccurate but also ridiculous. We rate it Pants on Fire! Logistic Regression Veracity Label 0,204 0,208 0,247 0,274 0,370 0,370 0 0,2 0,4 Validation Test Fact checking macro F1 score Majority Wang et. al Alhindi et. Al
  • 18. How about an explanation? Claim: We’ve tested more than every country combined. Justification: Trump claimed that the United States has "tested more than every country combined." There is no reasonable way to conclude that the American system has run more diagnostics than "all other major countries combined." Just by adding up a few other nations’ totals, you can quickly see Trump’s claim fall apart. Plus, focusing on the 5 million figure distracts from the real issue — by any meaningful metric of diagnosing and tracking, the United States is still well behind countries like Germany and Canada. The president’s claim is not only inaccurate but also ridiculous. We rate it Pants on Fire! Model Veracity Label Explanation
  • 19. Generating Explanations from Ruling Comments Claim: We’ve tested more than every country combined. Ruling Comments: Responding to weeks of criticism over his administration’s COVID-19 response, President Donald Trump claimed at a White House briefing that the United States has well surpassed other countries in testing people for the virus. "We’ve tested more than every country combined," Trump said April 27 […] We emailed the White House for comment but never heard back, so we turned to the data. Trump’s claim didn’t stand up to scrutiny. In raw numbers, the United States has tested more people than any other individual country — but nowhere near more than "every country combined" or, as he said in his tweet, more than "all major countries combined.”[…] The United States has a far bigger population than many of the "major countries" Trump often mentions. So it could have run far more tests but still have a much larger burden ahead than do nations like Germany, France or Canada.[…] Joint Model Veracity Label Justification/ Explanation
  • 20. Related Studies on Generating Explanations • Camburu et. al; Rajani et. al generate abstractive explanations • Short input text and explanations; • Large amount of annotated data. • Real world fact checking datasets are of limited size and the input consists of long documents • We take advantage of the LIAR-PLUS dataset: • Use the summary of the ruling comments as a gold explanation; • Formulate the problem as extractive summarization. • Camburu, Oana-Maria, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. "e-SNLI: Natural language inference with natural language explanations." In Advances in Neural Information Processing Systems,. 2018. • Rajani, Nazneen Fatema, Bryan McCann, Caiming Xiong, and Richard Socher. "Explain Yourself! Leveraging Language Models for Commonsense Reasoning." In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4932-4942. 2019.
  • 21. Example of an Oracle’s Gold Summary Claim: “The president promised that if he spent money on a stimulus program that unemployment would go to 5.7 percent or 6 percent. Those were his words.” Label: Mostly-False Just: Bramnick said “the president promised that if he spent money on a stimulus program that unemployment would go to 5.7 percent or 6 percent. Those werehis words.” Two economic advisers estimated in a 2009 report that with the stimulus plan, the unemployment rate would peak near 8 percent before dropping to less than 6 percent by now. Those are critical details Bramnick’s statement ignores. To comment on this ruling, go to NJ.com. Oracle: “The president promised that if he spent money on a stimulus program that unemployment would go to 5.7 percent or 6 percent. Those were his words,”Bramnick said in a Sept. 7 interview on NJToday. But with the stimulus plan, the report projected the nation’s jobless rate would peak near 8 percent in 2009 before falling to about 5.5 percent by now. So the estimates in the report were wrong.
  • 22. Which sentences should be selected for the explanation?
  • 23. What is the veracity of the claim?
  • 24. Joint Explanation and Veracity Prediction Ruder, Sebastian, et al. "Latent multi-task architecture learning." Proceedings of the AAAI’2019. Cross-stitch layer Multi-task objective
  • 25. Fact Checking Results 0,208 0,274 0,323 0,300 0,313 0,370 0,443 0 0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 0,5 Test Macro F1 score on the test split Majority Wang et. al MT Ruling Ruling Oracles Ruling Alhindi et. al Justification
  • 26. 28,11 6,96 24,38 43,57 22,23 39,26 35,70 13,51 31,5831,13 12,90 30,93 0,00 5,00 10,00 15,00 20,00 25,00 30,00 35,00 40,00 45,00 50,00 ROUGE-1 ROUGE-2 ROUGE-L ROUGE scores on the test split Lead-4 Oracle Extractive Extractive-MT Explanation Results *ROUGE-N: Overlap of N-grams between the system and reference summaries. *ROUGE-L : Longest Common Subsequence (LCS) between the system and reference summaries.
  • 27. Manual Evaluation • Explanation Quality • Coverage. The explanation contains important, salient information and does not miss any important points that contribute to the fact check. • Non-redundancy. The summary does not contain any information that is redundant/repeated/not relevant to the claim and the fact check. • Non-contradiction. The summary does not contain any pieces of information contradictory to the claim and the fact check. • Overall. Rank the explanations by their overall quality. • Explanation Informativeness. Provide a veracity label for a claim based on a veracity explanation coming from the justification, the Explain-MT, or the Explain- Extractive system.
  • 28. Explanation Quality 1,48 1,48 1,45 1,58 1,89 1,75 1,40 2,03 1,68 1,79 1,48 1,90 - 0,50 1,00 1,50 2,00 2,50 Coverage Non-redundancy Non-contradicton Overall Mean Average Rank (MAR). Lower MAR is better! (higher rank) Justification Extractive Extractive MT
  • 29. Explanation Informativeness - 0,100 0,200 0,300 0,400 0,500 Agree-Correct Agree-Not Correct Agree-Nost Sufficient Disagree Manual veracity labelling, given a particular explanation as percentages of the dis/agreeing annotator predictions. Justification Extractive Extractive-MT
  • 30. Summary • First study on generating veracity explanations • Jointly training veracity prediction and explanation • improves the performance of the classification system • improves the coverage and overall performance of the generated explanations
  • 31. Future Work • Can we generate better or even abstractive explanations given limited resources? • How to automatically evaluate the properties of the explanations? • Can explanations be extracted from evidence pages only (lots of irrelevant and multi-modal results)?
  • 33. Claim Check-Worthiness Detection as Positive Unlabelled Learning Dustin Wright, Isabelle Augenstein Findings of EMNLP, 2020 33
  • 34. A Typical Fact Checking Pipeline Claim Check- Worthiness Detection Evidence Document Retrieval Stance Detection / Textual Entailment Veracity Prediction “Immigrants are a drain on the economy” not check-worthy check-worthy “Immigrants are a drain on the economy” “Immigrants are a drain on the economy”, “EU immigrants have a positive impact on public finances” positive negative neutral true false not enough info“Immigrants are a drain on the economy”
  • 35. What is Check-Worthiness? 35 - A check-worthy statement is ”an assertion about the world that is checkable” (Konstantinovskiy et al., 2018) - However, there is more to this: - The statement should be factual - Typically, only sentences unlikely to be believed without verification are marked as check-worthy - Coupled with claim importance, which is subjective - Domain interests skew what is selected as check-worthy (e.g. only certain political claims)
  • 36. What is Check-Worthiness? 36 - Many different types of claims (Francis and FullFact, 2016) - numerical claims involving numerical properties about entities and comparisons among them - entity and event properties such as professional qualifications and event participants - position statements such as whether a political entity supported a certain policy - quote verification assessing whether the claim states precisely the source of a quote, its content, and the event at which it supposedly occurred.
  • 37. Wikipedia Citation Needed Detection Twitter Rumour Detection Political Check-Worthiness Detection Given a statement from Wikipedia, determine if the statement needs a citation in order to be accepted as true 37 Determine if the content of a tweet was verified as true or false at the time of posting Given statements from a political speech or debate, rank them by how check-worthy they are What is Check-Worthiness?
  • 39. Baseline Approach for Claim Check- Worthiness Detection: Binary Classification 39 𝑔 ∶ 𝑥 → (0, 1) Reviewers described the book as “magistral”, “encyclopaedic” and a “classic”. l = 1 As was pointed out above, Lenten traditions have developed over time. l = 0
  • 40. A Unified Approach • Utilise existing published datasets on check- worthiness detection and related tasks • Use Positive Unlabelled (PU) learning to overcome issues with subjective annotation • Transfer learning from Wikipedia citation needed detection • Evaluate how well each dataset reflects a general definition of check-worthiness 40
  • 41. Solution: Positive Unlabelled Learning 41 • Assumptions: • Not all ‘true’ instances are labelled as such • Labelled samples are selected at random from an underlying distribution • If a sample is labelled, then the label cannot be ‘false’ • Intuition: • Train another model, 𝑓, first, which estimates the confidence in a sample being ‘true’ • Automatically label unlabelled samples using 𝑓 to train model 𝑔
  • 42. A Traditional Classifier 42 𝑔 ∶ 𝑥 → (0, 1) Reviewers described the book as “magistral”, “encyclopaedic” and a “classic”. l = 1 As was pointed out above, Lenten traditions have developed over time. l = 0 Positive-Negative Classifier All three of his albums appeared on Rolling Stone magazine's list of The 500 Greatest Albums of All Time in 2003 l = 0 True negative?
  • 43. Positive Unlabelled Learning 43 𝑓 ∶ 𝑥 → (0, 1) s = 0 s = 0 s = 1 Positive-Unlabelled Classifier 1.0 0.0 0.05 0.95 0.8 0.2 𝑤 𝑥
  • 44. Positive Unlabelled Learning 44 𝑓 s = 0 s = 0 s = 1 1.0 0.0 0.05 0.95 0.8 0.2 𝑔 l = 1 l = 0 l = 1 l = 0 l = 1 l = 0 𝑤 𝑥 Charles Elkan and Keith Noto. 2008. Learning classifiers from only positive and unlabeled data. In Proceedings of KKD.
  • 45. Positive Unlabelled Learning + Conversion 45 𝑓 s = 0 s = 0 s = 1 1.0 0.0 0.05 0.95 0.8 0.2 𝑔 l = 1 l = 0 l = 1 l = 0 l = 1 l = 0 1.0 Use estimate of 𝑝(𝑦 = 1) 𝑤 𝑥
  • 46. Cross-Domain Check-Worthiness Detection 46 𝑓0 𝑔0 𝑔1 𝑔2 All models: BERT base uncased pretrained
  • 47. Research Questions • To what degree is check-worthiness detection a PU learning problem, and does this enable a unified approach to check-worthiness detection? • How well do statements in each dataset reflect the definition: ”makes an assertion about the world that is checkable?” 47
  • 48. Results: Citation Needed Detection 48 • Dataset: • ’Citation needed’ detection (Redi et al., 2019) • 10k positive, 10k negative for training (featured articles) • 10k positive, 10k negative for testing (flagged statements) 0,68 0,7 0,72 0,74 0,76 0,78 0,8 0,82 0,84 Redi et al. 2019 BERT BERT + PU BERT + PUC Macro F1
  • 49. Results: Twitter 49 • Dataset: • Pheme Rumour Detection (Zubiaga et al., 2017) • 5802 tweets from 5 events • 5-fold cross-validation, event hold-out 0,54 0,56 0,58 0,6 0,62 0,64 0,66 0,68 0,7 BiLSTM Zubiaga et al. 2017 BERT BERT + Wiki BERT + Wiki + PU BERT + Wiki + PUC Micro-F1
  • 50. Results: Political Speeches 50 • Dataset: • Claims from political speeches (Hansen et al., 2018) • Consists of 4 political speeches from 2018 Clef CheckThat! Competition & 3 speeches from ClaimRank • 2602 statements overall • 7-fold cross-validation, speech hold-out 0,28 0,29 0,3 0,31 0,32 0,33 0,34 0,35 Hansen et al. 2019 BERT BERT + Wiki BERT + Wiki + PU BERT + Wiki + PUC Mean Average Precision
  • 51. Error Analysis: How Check-Worthy are Check- Worthy Statements? 51 • Method: • Get top 100 positive predictions by PUC model on each dataset • Re-annotate those manually (2 annotators) • Compare against ’gold’ annotations 0 0,2 0,4 0,6 0,8 1 Wikipedia Twitter Politics
  • 52. Dataset Analysis • Wikipedia and Twitter equally well reflect a general definition of check-worthiness • Political speeches have high false negative rate • Also documented in previous literature (Konstantinovskiy et al. 2018) • Clear examples: these two statements appear in a single document, only one was labelled as check-worthy • ”It’s why our unemployment rate is the lowest it’s been in so many decades” • ”New unemployment claims are near the lowest we’ve seen in almost half a century” 52
  • 53. Key Takeaways • Check-worthiness annotation is highly subjective • Because of this, automatic check-worthiness detection is a difficult task • Approaching the problem as PU learning shows improvements • Transferring from Wikipedia improves on downstream domains when PU learning is used 53
  • 54. Generating Label Cohesive and Well-Formed Adversarial Claims Pepa Atanasova*, Dustin Wright*, Isabelle Augenstein Proceedings of EMNLP, 2020 *equal contributions 54
  • 55. A Typical Fact Checking Pipeline Claim Check- Worthiness Detection Evidence Document Retrieval Stance Detection / Textual Entailment Veracity Prediction “Immigrants are a drain on the economy” not check-worthy check-worthy “Immigrants are a drain on the economy” “Immigrants are a drain on the economy”, “EU immigrants have a positive impact on public finances” positive negative neutral true false not enough info“Immigrants are a drain on the economy”
  • 56. Generating Adversarial Claims • Fact checking models can overfit to spurious patterns • Making the right predictions for the wrong reasons • This leads to vulnerabilities, which can be exploited by adversaries (e.g. agents spreading mis- and disinformation) • How can one reveal such vulnerabilities? • Generating explanations for fact checking models (first part of talk) • Generating adversarial claims (this work)
  • 57. Previous Work • Universal adversarial attacks (Gao and Oates, 2019; Wallace et al, 2019) • Single perturbation changes that can be applied to many instances • Change the meaning of the input instances and thus produce label-incoherent claims • Are not per se semantically well-formed • Rule-based perturbations (Riberio et al., 2018) • Semantically well-formed, but require hand-crafting patterns
  • 58. Previous Work • FEVER 2.0 shared task (Thorne et al., 2019) • Builders / breakers setup • Methods of submitted systems: • Producing claims requiring multi-hop reasoning (Niewinski et al., 2019) • Generating adversarial claims manually (Kim and Allan, 2019)
  • 59. Goals of this Work • Generate claims fully automatically • Preserve the meaning of the source text • Produce semantically well-formed claims
  • 60. Model RoBERTa-based FEVER model to predict FC label 1) HotFlip attack model to find triggers 2) STS auxiliary model to preserve FC label
  • 61. Examples of Generated Claims 61 EMNLP 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. Evidence Triggers Generated Claim SUPPORTS Claims Since the 19th century, some Romani have also migrated to the Americas. don,already,more,during,home Romani have moved to the Americas during the 19th century. Cyprus is a major tourist destination in the Mediter- ranean . foreign,biggest,major,every,friends Cyprus is a major tourist des- tination. The first Nobel Prize in Chemistry was awarded in 1901 to Jacobus Henricus van’t Hoff, of the Nether- lands, “for his discovery of the laws of chemical dynam- ics and osmotic pressure in solutions.” later,already,quite,altern,whereas Henricus Van’t Hoff was al- ready awarded the Nobel Prize. REFUTES Claims California Attorney General phys,incarn,not,occasionally,something Kamala Harris did not defeat Since the 19th century, some Romani have also migrated to the Americas. don,already,more,during,home Romani have moved to the Americas during the 19th century. Cyprus is a major tourist destination in the Mediter- ranean . foreign,biggest,major,every,friends Cyprus is a major tourist des- tination. The first Nobel Prize in Chemistry was awarded in 1901 to Jacobus Henricus van’t Hoff, of the Nether- lands, “for his discovery of the laws of chemical dynam- ics and osmotic pressure in solutions.” later,already,quite,altern,whereas Henricus Van’t Hoff was al- ready awarded the Nobel Prize. REFUTES Claims California Attorney General Kamala Harris defeated Sanchez , 61.6% to 38.4%. phys,incarn,not,occasionally,something Kamala Harris did not defeat Sanchez, 61.6% to 38.4%. Uganda is in the African Great Lakes region. unless,endorsed,picks,pref,against Uganda is against the African Great Lakes region. Times Higher Education World University Rankings is an annual publication of university rankings by Times Higher Education (THE) magazine. interested,reward,visit,consumer,conclusion Times Higher Education World University Rankings is a consumer magazine. NOT ENOUGH INFO Claims ranean . The first Nobel Prize in Chemistry was awarded in 1901 to Jacobus Henricus van’t Hoff, of the Nether- lands, “for his discovery of the laws of chemical dynam- ics and osmotic pressure in solutions.” later,already,quite,altern,whereas Henricus Van’t Hoff was al- ready awarded the Nobel Prize. REFUTES Claims California Attorney General Kamala Harris defeated Sanchez , 61.6% to 38.4%. phys,incarn,not,occasionally,something Kamala Harris did not defeat Sanchez, 61.6% to 38.4%. Uganda is in the African Great Lakes region. unless,endorsed,picks,pref,against Uganda is against the African Great Lakes region. Times Higher Education World University Rankings is an annual publication of university rankings by Times Higher Education (THE) magazine. interested,reward,visit,consumer,conclusion Times Higher Education World University Rankings is a consumer magazine. NOT ENOUGH INFO Claims The KGB was a military ser- vice and was governed by army laws and regulations , similar to the Soviet Army or MVD Internal Troops. nowhere,only,none,no,nothing The KGB was only con- trolled by a military service. The first Nobel Prize in Chemistry was awarded in 1901 to Jacobus Henricus van’t Hoff, of the Nether- lands, “for his discovery of the laws of chemical dynam- ics and osmotic pressure in solutions.” later,already,quite,altern,whereas Henricus Van’t Hoff was al- ready awarded the Nobel Prize. REFUTES Claims California Attorney General Kamala Harris defeated Sanchez , 61.6% to 38.4%. phys,incarn,not,occasionally,something Kamala Harris did not defeat Sanchez, 61.6% to 38.4%. Uganda is in the African Great Lakes region. unless,endorsed,picks,pref,against Uganda is against the African Great Lakes region. Times Higher Education World University Rankings is an annual publication of university rankings by Times Higher Education (THE) magazine. interested,reward,visit,consumer,conclusion Times Higher Education World University Rankings is a consumer magazine. NOT ENOUGH INFO Claims The KGB was a military ser- vice and was governed by army laws and regulations , similar to the Soviet Army or MVD Internal Troops. nowhere,only,none,no,nothing The KGB was only con- trolled by a military service. The series revolves around says,said,take,say,is Take Me High is about
  • 62. Results: Automatic Evaluation • How well can we generate universal adversarial triggers? • Reduction in F1 for both trigger generation methods • Trade-off between how potent the attack is (reduction in F1) vs. how semantically coherent the claim is (STS) • Most potent attack: S->R for FC Objective (F1: 0.060) • Adds negation words, which change meaning of claim 0 0,5 1 F1 F1 of FC model (lower = better) No Triggers FC Objective FC + STS Objective 4 4,5 5 5,5 STS Semantic Textual Similarity with original claim (higher = better) No Triggers FC Objective FC + STS Objective
  • 63. Results: Manual Evaluation • How well can we generate universal adversarial triggers? • Confirms results of automatic evaluation • Macro F1 of generated claim w.r.t. original claim: 56.6 (FC Objective); 60.7 (FC + STS Objective) • STS Objective preserves meaning more often 0 5 Overall SUPPORTS REFUTES NEI Semantic Textual Similarity with original claim (higher = better) FC Objective FC + STS Objective 0 0,5 1 Overall SUPPORTS REFUTES NEI F1 of FC model (lower = better) FC Objective FC + STS Objective
  • 64. Key Take-Aways • Novel extension to the HotFlip attack for universarial adversarial trigger generation (Ebrahimi et al., 2018) • Conditional language model, which takes trigger tokens and evidence, and generates a semantically coherent claim • Resulting model generates semantically coherent claims containing universal triggers, which preserve the label • Trade-off between how well-formed the claim is and how potent the attack is
  • 65. Overall Take-Aways • Automatic fact checking is still a very involved NLP task • Different sub-tasks: 1) claim check-worthiness detection; 2) evidence retrieval; 3) stance detection / textual entailment, 4) veracity prediction • What is needed to automatically generate explanations for fact checking models? • Common understanding of how sub-tasks should be defined (Part 2.1 of this talk) • Lack thereof leads to inconsistent models and findings, limited transferability of learned representations
  • 66. Overall Take-Aways • What is needed to automatically generate explanations for fact checking models? • More realistic benchmarks for FC explanation generation (Part 1 of this talk) • Generating explanations from ruling comments in the LIAR-PLUS dataset is somewhat useful for researching models, but not very realistic • The dataset is also relatively small (10k training instances) • A more realistic setup and benchmark would inspire research of e.g. multi- modal models, and models for better understanding evidence documents
  • 67. Overall Take-Aways • What is needed to automatically generate explanations for fact checking models? • More research on fooling FC models (Part 2.2 of this talk) • Adversarial trigger generation can reveal inconsistencies and vulnerabilities in FC models • Those can explain what lexical patterns a FC model struggles with • Most research focuses on instance-level perturbations of claims, whereas universal triggers reveal more general vulnerabilities of the model
  • 69. Thanks to my PhD students and colleagues! 69 CopeNLU https://copenlu.github.io/ Pepa Atanasova Dustin Wright Jakob Grue Simonsen Christina Lioma
  • 70. Presented Papers Pepa Atanasova, Jakob Grue Simonsen, Christina Lioma, Isabelle Augenstein. Generating Fact Checking Explanations. In Proceedings of ACL 2020. https://arxiv.org/abs/2004.05773 Dustin Wright, Isabelle Augenstein. Claim Check-Worthiness Detection as Positive Unlabelled Learning. In Findings of EMNLP 2020. https://arxiv.org/abs/2003.02736 Pepa Atanasova*, Dustin Wright*, Isabelle Augenstein. Generating Label Cohesive and Well-Formed Adversarial Claims. In Proceedings of EMNLP, 2020. https://arxiv.org/abs/2009.08205 *equal contributions