SlideShare a Scribd company logo
Applying Noisy Knowledge
Graphs to Real Problems
Mayank Kejriwal
USC Information Sciences Institute
May 2019
2
Acknowledgements
Real Problems
Applying Noisy Knowledge Graphs to Real Problems
Web has lowered the barrier to entry!
5
6
7
Pump and dump schemes proliferate online
8
Quechua
Fula
Odiya
Maithili Bhojhpuri
Uighyur
Mayan languages
Aboriginal
languages
Tasmanian
languages
Fang
Umbundu
Setswana
Afro-Asiatic
Khoisan Fon
Yoruba
Peulh
Adangame
Erzya Bashkir
Khakas
Udmurt
Ingush
Tagalog
Hilgaynon
Bikol
Waray
Native American
dialects
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
12
13
What do these problems have in common
(besides being really hard)?
1. Very messy, raw data, with both
redundancy and irrelevance
2. Users are also producers i.e. we cannot
just ‘build’ the system and hand it off
3. Domains are largely non-analytic (e.g.,
we don’t have a model/equations for
human trafficking)
1. Very messy, raw data, with both
redundancy and irrelevance
2. Users are also producers i.e. we cannot
just ‘build’ the system and hand it off
3. Domains are largely non-analytic (e.g.,
we don’t have a model/equations for
human trafficking)
1. Very messy, raw data, with both
redundancy and irrelevance
2. Users are also producers i.e. we cannot
just ‘build’ the system and hand it off
3. Domains are largely non-analytic (e.g.,
we don’t have a model/equations for
human trafficking)
Applying Noisy Knowledge Graphs to Real Problems
19
Space of design decisions
Raw data
Search+GUI
?
??
Representation +
Infrastructure
? ? ?
Space of design decisions
Search+GUI
Producer Consumer
21
22
23
Space of design decisions
Raw data
Search+GUI
Knowledge
Graph
??
Representation +
Infrastructure
? ? ?
24
Domain-specific
Insight Graphs
(DIG)
25
Space of design decisions: example from human trafficking
Raw data
Search+GUI
Knowledge
Graph (KG)
Domain
discovery
Define KG
schema
Representation +
Infrastructure
Flexible
inputs
Query
reformulation
KG
Construction
Applying Noisy Knowledge Graphs to Real Problems
The Knowledge Graph is noisy…how do
we cope?
Answer: Strategize around each triangle
Search+GUI
ConsumerProducer
Example from DIG: consumer triangle
Search+GUI
Consumer
Applying Noisy Knowledge Graphs to Real Problems
31
Anti-fragile query reformulation to satisfy user intent
SELECT ?ad ?ethnicity
WHERE
{
?ad a :Ad ;
:hair_color 'Auburn' ;
:review_site_id 'cg9469f'
;
:price_per_hour '500' ;
:name ’Claire Gold’ ;
:ethnicity ?ethnicity .
}
query 1
query 2
query 3
query 4
query n
Query
Reformulation
Keyword expansion • Context broadening • Constraint
relaxation
Precision
Recall
Elastic Search
100M entities
Ranked
Candidates
32
Query-centric KG representation
33
Infrastructure: Leverage existing ecosystems (there
are many!)
Applying Noisy Knowledge Graphs to Real Problems
35
Prosecutions
36
User Testimonials
Showcasing THOR
38
Other domains
Narcotics
Illegal weapons
sales
Fraudulent shipments
Securities fraud
Causal exploration
Geopolitical forecasting
Cyberattack
prediction
THANK YOU! QUESTIONS...
39
Applying Noisy Knowledge Graphs to Real Problems
BACKUP
THOR: Text-enabled Humanitarian Operations in
Real-time
Applying Noisy Knowledge Graphs to Real Problems
Impact and
Measurements
45
Controlled (i.e. academic measurements)
0
10
20
30
40
50
60
70
80
90
100
0 - 0.1 < 0.2 < 0.3 < 0.4 < 0.5 < 0.6 < 0.7 < 0.8 < 0.9 <= 1.0
Average Precision of Retrieved Pages
DARPA MEMEX Eval (90K pages)
Point Fact Cluster ID Aggregate Facet
%Questions
Average Precision
46
In-use impact (sex trafficking)
100 million+ escort ads
3 years data coverage
2 billion triples
100 law enforcement
offices
3 convictions
47
NY County District Attorney (HTRU)
MEMEX tools getting
rolled out
48
Memex tools getting rolled out
49
Academic Output
~15 publications over the course of the program
• 7 more currently under review
• 2 best paper awards
• Upcoming special issue call on knowledge construction and management
• 2 upcoming books, incl. graduate-level textbook on knowledge graphs (MIT Press, 2018)
Multiple tutorials/demonstrations at top-tier academic conferences
• Tutorials on knowledge graph construction and data mining over Web corpora/unusual domains in
KDD17, ISWC17, AAAI18, WWW18
• At ISWC17, only full-day tutorial accepted; had near-capacity attendance
• Demos at ISWC17, AAAI18 (nominated for Best Demo)
• Case study at CHI18
Selected papers
• Knowledge Graphs for Social Good: An Entity-centric Search Engine for the Human Trafficking Domain
(IEEE Transactions on Big Data, 2017)
• Information Extraction in Illicit Domains (WWW, 2017)
• Unsupervised Entity Resolution on Multi-type Graphs (ISWC 2016)
50
Broker
Rich club
effectStar cluster
Web formation
Social Science Studies
51
• Subjective issue
• Architecture-level
evaluation
• Ablation analysis
“Ideal” Evaluation
Applying Noisy Knowledge Graphs to Real Problems
53
Space of design decisions
Raw data Search+GUI
?
??
? ?
54
Domain-specific
Insight Graphs
(DIG)
55
Structured query execution on noisy data
SELECT ?ad ?ethnicity
WHERE
{
?ad a :Ad ;
:hair_color 'Auburn' ;
:review_site_id 'cg9469f'
;
:price_per_hour '500' ;
:name ’Claire Gold’ ;
:ethnicity ?ethnicity .
}
query 1
query 2
query 3
query 4
query n
Query
Reformulation
Keyword expansion • Context broadening • Constraint
relaxation
Precision
Recall
Elastic Search
100M entities
Ranked
Candidates
56
DIG capabilities
Aggregations
Facets
Dossier Generation
Networks
Provenance
Structured Queries
Interface Customization
• Capabilities that generic search
engines like Google do not
currently support
• Domain-specific
–Allows a user to specify her schema
–No prior constraints
• Insight
–Supports aggregations, network
analysis, faceted search, dossiers...
• Graph
–Uses a knowledge graph
representation + efficient NoSQL
query reformulation
Users want Situational Awareness i.e.
equipped with actionable insights
• Advanced name matching algorithm based on
machine learning, phonetic similarity and illicit
webpage-specific word embeddings
58
How can we tell when two actors are really one and the same?
Abbie
Candy
Kim
Lea
Nicki
Abby
Kandy
Kimmy
Leah
Nikki
• Evaluated on five investigative domains beyond human trafficking,
each with its own domain-specific needs
–Narcotics
–Counterfeit Electronics Manufacturing
–Securities Fraud
–Mail Shipment Fraud
–Illegal Weapons Sales
• User engagement was high
–Investigators were able to customize their domain in just one day, with less
than an hour of training
–Have expressed interest in continuing to refine and use the search engine
internally
59
Other use-cases
Relevance score
Matching search
criteria
highlighted
Image
extraction+face
and pose
analytics using
deep learning
Original URL
Dossier term
Activity timeline
Co-occurrence
statistics
Related ads

More Related Content

What's hot (20)

PPTX
Great Expectations Presentation
Adam Doyle
 
PDF
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Neo4j
 
PPTX
Analytical tools
Aniket Joshi
 
PPTX
Big Data Analysis Patterns with Hadoop, Mahout and Solr
boorad
 
PPSX
Big Data
Neha Mehta
 
PDF
Big data landscape
Natalino Busa
 
PPT
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
 
PDF
The evolution of data analytics
Natalino Busa
 
PDF
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Databricks
 
PPTX
Big Data Analytics
Tyrone Systems
 
PDF
Future of Data - Big Data
Shankar R
 
PDF
Introduction to Big Data
AmpoolIO
 
PDF
Big Data Analytics
Sreedhar Chowdam
 
PPTX
Bigdata
Shankar R
 
PPTX
Hadoop - An Introduction
Shankar R
 
PPTX
Bigdata
Saravanan Manoharan
 
PPTX
Big Data Analysis Patterns - TriHUG 6/27/2013
boorad
 
PPTX
Big Data Analytics Using Hadoop
Srikanth VNV
 
PPTX
Top Big data Analytics tools: Emerging trends and Best practices
SpringPeople
 
PPTX
How big data and AI saved the day: critical IP almost walked out the door
DataWorks Summit
 
Great Expectations Presentation
Adam Doyle
 
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Neo4j
 
Analytical tools
Aniket Joshi
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
boorad
 
Big Data
Neha Mehta
 
Big data landscape
Natalino Busa
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
 
The evolution of data analytics
Natalino Busa
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Databricks
 
Big Data Analytics
Tyrone Systems
 
Future of Data - Big Data
Shankar R
 
Introduction to Big Data
AmpoolIO
 
Big Data Analytics
Sreedhar Chowdam
 
Bigdata
Shankar R
 
Hadoop - An Introduction
Shankar R
 
Big Data Analysis Patterns - TriHUG 6/27/2013
boorad
 
Big Data Analytics Using Hadoop
Srikanth VNV
 
Top Big data Analytics tools: Emerging trends and Best practices
SpringPeople
 
How big data and AI saved the day: critical IP almost walked out the door
DataWorks Summit
 

Similar to Applying Noisy Knowledge Graphs to Real Problems (20)

PDF
Interactive and collaborative AI for biodiversity monitoring and beyond - JWK...
SURFevents
 
PPT
Machine Learning ICS 273A
butest
 
PPT
Machine Learning ICS 273A
butest
 
PPTX
SMART Seminar Series: "From Big Data to Smart data"
SMART Infrastructure Facility
 
PDF
Question Answering over Linked Data (Reasoning Web Summer School)
Andre Freitas
 
PDF
Getting started in Data Science (April 2017, Los Angeles)
Thinkful
 
PDF
Thinkful DC - Intro to Data Science
TJ Stalcup
 
PPTX
Spark Social Media
suresh sood
 
PPTX
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Andre Freitas
 
PPT
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Jeffrey Nichols
 
PDF
An Elementary Introduction to Artificial Intelligence, Data Science and Machi...
Dozie Agbo
 
PDF
Getting Started in Data Science
Thinkful
 
PDF
JD McCreary Presentation to Williams Foundation, March 22, 2018
ICSA, LLC
 
PPTX
Session 01 designing and scoping a data science project
bodaceacat
 
PPTX
Session 01 designing and scoping a data science project
Sara-Jayne Terp
 
PDF
Thinkful - Intro to Data Science - Washington DC
TJ Stalcup
 
PDF
Getting started in data science (4:3)
Thinkful
 
PDF
Getting started in data science (4:3)
Thinkful
 
PPTX
BAS 250 Lecture 1
Wake Tech BAS
 
PPTX
Applications for Social Networking Strategies in an Agency Context
John Brisbin
 
Interactive and collaborative AI for biodiversity monitoring and beyond - JWK...
SURFevents
 
Machine Learning ICS 273A
butest
 
Machine Learning ICS 273A
butest
 
SMART Seminar Series: "From Big Data to Smart data"
SMART Infrastructure Facility
 
Question Answering over Linked Data (Reasoning Web Summer School)
Andre Freitas
 
Getting started in Data Science (April 2017, Los Angeles)
Thinkful
 
Thinkful DC - Intro to Data Science
TJ Stalcup
 
Spark Social Media
suresh sood
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Andre Freitas
 
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Jeffrey Nichols
 
An Elementary Introduction to Artificial Intelligence, Data Science and Machi...
Dozie Agbo
 
Getting Started in Data Science
Thinkful
 
JD McCreary Presentation to Williams Foundation, March 22, 2018
ICSA, LLC
 
Session 01 designing and scoping a data science project
bodaceacat
 
Session 01 designing and scoping a data science project
Sara-Jayne Terp
 
Thinkful - Intro to Data Science - Washington DC
TJ Stalcup
 
Getting started in data science (4:3)
Thinkful
 
Getting started in data science (4:3)
Thinkful
 
BAS 250 Lecture 1
Wake Tech BAS
 
Applications for Social Networking Strategies in an Agency Context
John Brisbin
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
PPTX
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 

Applying Noisy Knowledge Graphs to Real Problems

Editor's Notes