SlideShare a Scribd company logo
Searching for Meaning:
The hidden structure in unstructured data
Trey Grainger
SVP of Engineering, Lucidworks
Southern Data Science Conference
2018.04.13
Trey Grainger
SVP of Engineering
• Previously Director of Engineering @ CareerBuilder
• MBA, Management of Technology – Georgia Tech
• BA, Computer Science, Business, & Philosophy – Furman University
• Information Retrieval & Web Search - Stanford University
Other fun projects:
• Co-author of Solr in Action, plus numerous research papers
• Advisor to Presearch, the decentralized search engine
• Founder of Celiaccess.com, the gluten-free search engine
• Lucene / Solr contributor
About Me
Based in San Francisco, offices
and employees worldwide
Fusion, the platform for building
data-driven, smart apps
Over 400 customers running our
commercial software
Consulting and support for
organizations using Solr
Produces the world’s largest open
source user conference dedicated
to Lucene/Solr
Lucidworks is the primary commercial
contributor to the Apache Solr project
Employs over 40% of the active
committers on the Solr project
Contributes over 70% of Solr's
open source codebase
40%
70%
Fusion powers search for the brightest companies in the world.
Searching for Meaning
most often used in
reference to
My Three Assertions
1) Unstructured data is actually “hyper-structured” data. It is a
graph that contains much more structure than typical “structured
data.”
2) That graph is very rich, but is a compression of meaning into a
lossy format. Much of data science is essentially the
decompression from this lossy format into a reconstituted form.
3) Most Important: Every instance of a word or phrase you ever
encounter has a unique meaning.
Assertion 1:
Unstructured data is actually “hyper-
structured” data. It is a graph that
contains much more structure than
typical “structured data.”
Southern
Data Science
Structured Data
Employees Table
id name company start_date
lw100 Trey
Grainger
1234 2016-02-01
dis2 Mickey
Mouse
9123 1928-11-28
tsla1 Elon
Musk
5678 2003-07-01
Companies Table
id name start_date
1234 Lucidworks 2016-02-01
5678 Tesla 1928-11-28
9123 Disney 2003-07-01
Discrete
Values
Continuous
Values
Foreign
Key
Southern
Data Science
Unstructured Data
Trey Grainger works at Lucidworks.
He is speaking at the SDSC 2018. Southern Data
Science Conference (SDSC) is being held in Atlanta
April 12-14, 2018. Trey got his masters from
Georgia Tech.
Southern
Data Science
Unstructured Data
Southern Data Science Conference
(SDSC) is being held in Atlanta
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Southern
Data Science
Trey Grainger works at Lucidworks.
He is speaking at the SDSC 2018.
Trey’s Voicemail
Foreign Key?
Trey Grainger works at Lucidworks.
He is speaking at the SDSC 2018.
Southern Data Science Conference
(SDSC) is being held in Atlanta
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Southern
Data Science
Trey’s Voicemail
Trey Grainger works at Lucidworks.
He is speaking at the SDSC 2018.
Fuzzy Foreign Key? (Entity Resolution)
Southern Data Science Conference
(SDSC) is being held in Atlanta
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Southern
Data Science
Trey’s Voicemail
Trey Grainger works at Lucidworks.
He is speaking at the SDSC 2018.
Fuzzier Foreign Key? (metadata, latent features)
Southern Data Science Conference
(SDSC) is being held in Atlanta
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Southern
Data Science
Trey Grainger works at Lucidworks.
He is speaking at the SDSC 2018.
Fuzzier Foreign Key? (metadata, latent features)
Southern Data Science Conference
(SDSC) is being held in Atlanta
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Trey’s Voicemail
Southern
Data Science
Not so Fast!
Searching for Meaning
Searching for Meaning
Giant Graph of Relationships...
Trey Grainger works for Lucidworks.
He is speaking at the SDSC 2018.
Southern Data Science Conference
(SDSC) is being held in Atlanta
April 12-14, 2018.
Trey got his masters degree from
Georgia Tech.
Southern
Data Science
Trey’s Voicemail
Assertion 1 (Summary):
Unstructured data is actually “hyper-
structured” data. It is a graph that
contains much more structure than
typical “structured data.”
Southern
Data Science
Assertion 2:
That graph is very rich, but is a
compression of meaning into a lossy
format. Much of data science is
essentially the decompression from
this lossy format into a reconstituted
form.
Southern
Data Science
Southern
Data Science
01
Semantic Data Encoded into Free Text Content
e en eng engi engineer engineers
engineer engineersNode Type: Term
software
engineer
software
engineers
electrical
engineering
engineer
engineering software
…
…
…
Node Type:
Character Sequence
Node Type:
Term Sequence
Node Type:
Document
id: 1
text: looking for a software
engineerwith degree in
computer science or
electrical engineering
id: 2
text: apply to be a software
engineer and work with
other great software
engineers
id: 3
text: start a great careerin
electrical engineering
…
…
How do we easily harness this
“semantic graph” or relationships
within unstructured information?
Southern
Data Science
Search Engines are really good at querying
across characters sequences, term sequences,
and documents
Example Queries:
c?o CTO, CEO, CFO, …
"VP Engineering"~2 “VP of Engineering”,
VP Engineering” ,“Engineering VP”,
“VP of Infrastructure Engineering”
(Microsoft OR MS) AND Word “MS Word”, “Microsoft Word”
Term Documents
a doc1 [2x]
brown doc3 [1x] , doc5 [1x]
cat doc4 [1x]
cow doc2 [1x] , doc5 [1x]
… ...
once doc1 [1x], doc5 [1x]
over doc2 [1x], doc3 [1x]
the doc2 [2x], doc3 [2x],
doc4[2x], doc5 [1x]
… …
Document Content Field
doc1 once upon a time, in a land far,
far away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped over
the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said “moo”
once.
… …
What you SEND to Lucene/Solr:
How the content is INDEXED into
Lucene/Solr (conceptually):
An inverted index (“how a search engine works”)
Southern
Data Science
/solr/collection/select/?q=apache solr
Term Documents
… …
apache
doc1, doc3, doc4,
doc5
…
hadoop doc2, doc4, doc6
… …
solr
doc1, doc3, doc4,
doc7, doc8
… …
doc5
doc7 doc8
doc1 doc3
doc4
solr
apache
apache solr
Matching queries to documents
Southern
Data Science
Search engines also do relevancy ranking (query to doc)
Score(q, d) =
∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl )
t in q
Where:
t = term; d = document; q = query; i = index
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
|d| = ∑ 1
t in d
avgdl = = ( ∑ |d| ) / ( ∑ 1 ) )
d in i d in i
k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency
saturation point.
b = Free parameter. Usually ~0.75. Increases impact of document
normalization.
DOI: 10.1109/DSAA.2016.51
Conference: 2016 IEEE International Conference on
Data Science and Advanced Analytics (DSAA)
• “A compact, auto-generated model for real-time traversal and
ranking of any relationship within a domain”
• A multi-dimensional term-to-term (vs. term-to-document) search
engine
• A tool which enables knowledge modeling and reasoning, natural language
processing, anomaly detection, data cleansing, semantic search, analytics,
data classification, root cause analysis, and recommendations systems
• It’s kind of like Word2Vec, but vectors (or matrices) are generated
on the fly and are better suited for interpreting the nuanced intent of
typical search queries
What is the Semantic Knowledge Graph?
Open Sourced!
Southern
Data Science
Knowledge
Graph
Southern
Data Science
Knowledge
Graph
Southern
Data Science
id: 1
job_title: Software Engineer
desc: software engineer at a
great company
skills: .Net, C#, java
id: 2
job_title: Registered Nurse
desc: a registered nurse at
hospital doing hard work
skills: oncology, phlebotemy
id: 3
job_title: Java Developer
desc: a software engineer or a
java engineer doing work
skills: java, scala, hibernate
field doc term
desc
1
a
at
company
engineer
great
software
2
a
at
doing
hard
hospital
nurse
registered
work
3
a
doing
engineer
java
or
software
work
job_title 1
Software
Engineer
… … …
Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
field term postings list
doc pos
desc
a
1 4
2 1
3 1, 5
at
1 3
2 4
company 1 6
doing
2 6
3 8
engineer
1 2
3 3, 7
great 1 5
hard 2 7
hospital 2 5
java 3 6
nurse 2 3
or 3 4
registered 2 2
software
1 1
3 2
work
2 10
3 9
job_title java developer 3 1
… … … …
Southern
Data Science
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Set-theory View
Graph View
How the Graph Traversal Works
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
Data Structure View
Java
Scala Hibernate
docs
1, 2, 6
docs
3, 4
Oncology
doc 5
Southern
Data Science
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Multi-level Traversal
Data Structure View
Graph View
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
job_title:
Software
Engineer
job_title:
Data
Scientist
job_title:
Java
Developer
……
Inverted Index
Lookup
Forward Index
Lookup
Forward Index
Lookup
Inverted Index
Lookup
Java
Java
Developer
Hibernate
Scala
Software
Engineer
Data
Scientist
has_related_job_title
has_related_job_title
Scoring of Node Relationships (Edge Weights)
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness":0.9773, "popularity":369 },
{ "value":"java", "relatedness":0.9236, "popularity":15653 },
{ "value":".net", "relatedness":0.5294, "popularity":17683 },
{ "value":"bee", "relatedness":0.0, "popularity":0 },
{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },
{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }
We are essentially boosting terms which are more related to some known feature
(and ignoring terms which are equally likely to appear in the background corpus)
+
-
Foreground Query:
"Hadoop"
Knowledge
Graph
Southern
Data Science
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Multi-level Graph Traversal with Scores
software engineer*
(materialized node)
Java
C#
.NET
.NET
Developer
Java
Developer
Hibernate
ScalaVB.NET
Software
Engineer
Data
Scientist
Skill
Nodes
has_related_skillStarting
Node
Skill
Nodes
has_related_skill Job Title
Nodes
has_related_job_title
0.90
0.88 0.93
0.93
0.34
0.74
0.91
0.89
0.74
0.89
0.780.72
0.48
0.93
0.76
0.83
0.80
0.64
0.61
0.780.55
Southern
Data Science
Related term vector (for query concept expansion)
http://localhost:8983/solr/stack-exchange-health/skg
Southern
Data Science
Who’s in Love with Jean Grey?
Assertion 2 (Summary):
That graph is very rich, but is a
compression of meaning into a lossy
format. Much of data science is
essentially the decompression from
this lossy format into a reconstituted
form.
Southern
Data Science
Assertion 3:
Every instance of a word or phrase you
ever encounter has a unique meaning.
Southern
Data Science
Thought Exercise
What do you think of when I say the
word “driver”?
Southern
Data Science
Ambiguity
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Southern
Data Science
Use Case: Query Disambiguation
Example Related Keywords (representing multiple meanings)
driver truck driver, linux, windows, courier, embedded, cdl,
delivery
architect autocad drafter, designer, enterprise architect, java
architect, designer, architectural designer, data architect,
oracle, java, architectural drafter, autocad, drafter, cad,
engineer
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Southern
Data Science
Disambiguated meanings (represented as term vectors)
Example Related Keywords (Disambiguated Meanings)
architect 1: enterprise architect, java architect, data architect, oracle, java, .net
2: architectural designer, architectural drafter, autocad, autocad drafter, designer,
drafter, cad, engineer
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic,
photoshop, video
2: graphic, web designer, design, web design, graphic design, graphic designer
3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe,
structural designer, revit
… …
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Southern
Data Science
Using the disambiguated meanings
In a situation where a user searches for an ambiguous phrase, what information can we
use to pick the correct underlying meaning?
1. Any pre-existing knowledge about the user:
• User is a software engineer
• User has previously run searches for “c++” and “linux”
2. Context within the query:
User searched for windows AND driver vs. courier OR driver
3. If all else fails (and there is no context), use the most commonly occurring meaning.
driver 1: linux, windows, embedded
2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
Southern
Data Science
Thought Exercise
What do you think of when I say the
word “Apple”?
Southern
Data Science
Every term or phrase is a
Context-dependent cluster of
meaning with an ambiguous label
Southern
Data Science
Every term or phrase is a
Context-dependent cluster of
meaning with an ambiguous label
Southern
Data Science
Southern
Data Science
What does “love” mean?
http://localhost:8983/solr/thesaurus/skg
Southern
Data Science
What does “love” mean in the context of “hug”?
http://localhost:8983/solr/thesaurus/skg
Southern
Data Science
What does “love” mean in the context of “child”?
http://localhost:8983/solr/thesaurus/skg
My Three Assertions (Recap)
1) Unstructured data is actually “hyper-structured” data. It is a
graph that contains much more structure than typical “structured
data.”
2) That graph is very rich, but is a compression of meaning into a
lossy format. Much of data science is essentially the
decompression from this lossy format into a reconstituted form.
3) Most Important: Every instance of a word or phrase you ever
encounter has a unique meaning.
Why do we care?
User’s Query:
machine learning research and development Portland, OR software
engineer AND hadoop, java
Traditional Query Parsing:
(machine AND learning AND research AND development AND portland)
OR (software AND engineer AND hadoop AND java)
Semantic Query Parsing:
"machine learning" AND "research and development" AND "Portland, OR"
AND "software engineer" AND hadoop AND java
Semantically Expanded Query:
("machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence")
AND ("research and development"^10 OR "r&d") AND
AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo})
AND ("software engineer"^10 OR "software developer")
AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
Contact Info
Trey Grainger
trey.grainger@lucidworks.com
@treygrainger
http://solrinaction.com
Other presentations:
http://www.treygrainger.com
Discount code: ctwsdsc18
Southern
Data Science

More Related Content

PDF
Reflected Intelligence: Real world AI in Digital Transformation
Trey Grainger
 
PPTX
South Big Data Hub: Text Data Analysis Panel
Trey Grainger
 
PDF
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Trey Grainger
 
PPTX
The Relevance of the Apache Solr Semantic Knowledge Graph
Trey Grainger
 
PDF
AI, Search, and the Disruption of Knowledge Management
Trey Grainger
 
PDF
The Future of Search and AI
Trey Grainger
 
PDF
Natural Language Search with Knowledge Graphs (Activate 2019)
Trey Grainger
 
PDF
The Next Generation of AI-powered Search
Trey Grainger
 
Reflected Intelligence: Real world AI in Digital Transformation
Trey Grainger
 
South Big Data Hub: Text Data Analysis Panel
Trey Grainger
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Trey Grainger
 
The Relevance of the Apache Solr Semantic Knowledge Graph
Trey Grainger
 
AI, Search, and the Disruption of Knowledge Management
Trey Grainger
 
The Future of Search and AI
Trey Grainger
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Trey Grainger
 
The Next Generation of AI-powered Search
Trey Grainger
 

What's hot (20)

PDF
Measuring Relevance in the Negative Space
Trey Grainger
 
PPTX
How to Build a Semantic Search System
Trey Grainger
 
PDF
Natural Language Search with Knowledge Graphs (Haystack 2019)
Trey Grainger
 
PPTX
The Intent Algorithms of Search & Recommendation Engines
Trey Grainger
 
PPTX
Reflected Intelligence: Lucene/Solr as a self-learning data system
Trey Grainger
 
PDF
Crowdsourced query augmentation through the semantic discovery of domain spec...
Trey Grainger
 
PPTX
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Trey Grainger
 
PDF
Thought Vectors and Knowledge Graphs in AI-powered Search
Trey Grainger
 
PDF
Balancing the Dimensions of User Intent
Trey Grainger
 
PPTX
Self-learned Relevancy with Apache Solr
Trey Grainger
 
PDF
Reflected intelligence evolving self-learning data systems
Trey Grainger
 
PPTX
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
PPTX
The Semantic Knowledge Graph
Trey Grainger
 
PPTX
The Apache Solr Semantic Knowledge Graph
Trey Grainger
 
PPTX
Building Search & Recommendation Engines
Trey Grainger
 
PDF
Enhancing relevancy through personalization & semantic search
Trey Grainger
 
PPTX
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Trey Grainger
 
PDF
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
PPTX
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Trey Grainger
 
PPTX
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
OpenSource Connections
 
Measuring Relevance in the Negative Space
Trey Grainger
 
How to Build a Semantic Search System
Trey Grainger
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Trey Grainger
 
The Intent Algorithms of Search & Recommendation Engines
Trey Grainger
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Trey Grainger
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Trey Grainger
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Trey Grainger
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Trey Grainger
 
Balancing the Dimensions of User Intent
Trey Grainger
 
Self-learned Relevancy with Apache Solr
Trey Grainger
 
Reflected intelligence evolving self-learning data systems
Trey Grainger
 
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
The Semantic Knowledge Graph
Trey Grainger
 
The Apache Solr Semantic Knowledge Graph
Trey Grainger
 
Building Search & Recommendation Engines
Trey Grainger
 
Enhancing relevancy through personalization & semantic search
Trey Grainger
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Trey Grainger
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Trey Grainger
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
OpenSource Connections
 
Ad

Similar to Searching for Meaning (20)

PDF
Thinkful DC - Intro to Data Science
TJ Stalcup
 
PDF
Test Trend Analysis : Towards robust, reliable and timely tests
Hugh McCamphill
 
PDF
The Role of Data Wrangling in Driving Hadoop Adoption
Inside Analysis
 
PDF
iTrain Malaysia: Data Science by Tarun Sukhani
iTrain
 
PPTX
Ch1IntroductiontoDataScience.pptx
AbderrahmanABID2
 
PDF
Test trend analysis: Towards robust reliable and timely tests
Hugh McCamphill
 
PDF
Amundsen: From discovering to security data
markgrover
 
DOCX
Foundation of Data Science - Concept Notes.docx
pushparajra5
 
PPTX
Paper presentation
K.K. Tripathi
 
PPTX
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
arpit206900
 
PDF
Data science presentation
MSDEVMTL
 
PPTX
What is data_science_by_khawar_shehzad
KhawarShehzadMahaar
 
PDF
Intro to Data Science
TJ Stalcup
 
PPTX
How to Feed a Data Hungry Organization – by Traveloka Data Team
Traveloka
 
PPTX
Data scienceppt
Jayabalan Sekar
 
PPTX
The Five Graphs of Government: How Federal Agencies can Utilize Graph Technology
Greta Workman
 
PPTX
The Five Graphs of Government: How Federal Agencies can Utilize Graph Technology
Neo4j
 
PDF
How Graph Databases used in Police Department?
Samet KILICTAS
 
PDF
From Rocket Science to Data Science
Sanghamitra Deb
 
PDF
SDSC18 and DSATL Meetup March 2018
CareerBuilder.com
 
Thinkful DC - Intro to Data Science
TJ Stalcup
 
Test Trend Analysis : Towards robust, reliable and timely tests
Hugh McCamphill
 
The Role of Data Wrangling in Driving Hadoop Adoption
Inside Analysis
 
iTrain Malaysia: Data Science by Tarun Sukhani
iTrain
 
Ch1IntroductiontoDataScience.pptx
AbderrahmanABID2
 
Test trend analysis: Towards robust reliable and timely tests
Hugh McCamphill
 
Amundsen: From discovering to security data
markgrover
 
Foundation of Data Science - Concept Notes.docx
pushparajra5
 
Paper presentation
K.K. Tripathi
 
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
arpit206900
 
Data science presentation
MSDEVMTL
 
What is data_science_by_khawar_shehzad
KhawarShehzadMahaar
 
Intro to Data Science
TJ Stalcup
 
How to Feed a Data Hungry Organization – by Traveloka Data Team
Traveloka
 
Data scienceppt
Jayabalan Sekar
 
The Five Graphs of Government: How Federal Agencies can Utilize Graph Technology
Greta Workman
 
The Five Graphs of Government: How Federal Agencies can Utilize Graph Technology
Neo4j
 
How Graph Databases used in Police Department?
Samet KILICTAS
 
From Rocket Science to Data Science
Sanghamitra Deb
 
SDSC18 and DSATL Meetup March 2018
CareerBuilder.com
 
Ad

Recently uploaded (20)

PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPT
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
 
PDF
This slide provides an overview Technology
mineshkharadi333
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PPTX
Stamford - Community User Group Leaders_ Agentblazer Status, AI Sustainabilit...
Amol Dixit
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PPTX
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
PDF
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
 
This slide provides an overview Technology
mineshkharadi333
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Stamford - Community User Group Leaders_ Agentblazer Status, AI Sustainabilit...
Amol Dixit
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 

Searching for Meaning

  • 1. Searching for Meaning: The hidden structure in unstructured data Trey Grainger SVP of Engineering, Lucidworks Southern Data Science Conference 2018.04.13
  • 2. Trey Grainger SVP of Engineering • Previously Director of Engineering @ CareerBuilder • MBA, Management of Technology – Georgia Tech • BA, Computer Science, Business, & Philosophy – Furman University • Information Retrieval & Web Search - Stanford University Other fun projects: • Co-author of Solr in Action, plus numerous research papers • Advisor to Presearch, the decentralized search engine • Founder of Celiaccess.com, the gluten-free search engine • Lucene / Solr contributor About Me
  • 3. Based in San Francisco, offices and employees worldwide Fusion, the platform for building data-driven, smart apps Over 400 customers running our commercial software Consulting and support for organizations using Solr Produces the world’s largest open source user conference dedicated to Lucene/Solr Lucidworks is the primary commercial contributor to the Apache Solr project Employs over 40% of the active committers on the Solr project Contributes over 70% of Solr's open source codebase 40% 70%
  • 4. Fusion powers search for the brightest companies in the world.
  • 6. most often used in reference to
  • 7. My Three Assertions 1) Unstructured data is actually “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.” 2) That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. 3) Most Important: Every instance of a word or phrase you ever encounter has a unique meaning.
  • 8. Assertion 1: Unstructured data is actually “hyper- structured” data. It is a graph that contains much more structure than typical “structured data.” Southern Data Science
  • 9. Structured Data Employees Table id name company start_date lw100 Trey Grainger 1234 2016-02-01 dis2 Mickey Mouse 9123 1928-11-28 tsla1 Elon Musk 5678 2003-07-01 Companies Table id name start_date 1234 Lucidworks 2016-02-01 5678 Tesla 1928-11-28 9123 Disney 2003-07-01 Discrete Values Continuous Values Foreign Key Southern Data Science
  • 10. Unstructured Data Trey Grainger works at Lucidworks. He is speaking at the SDSC 2018. Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters from Georgia Tech. Southern Data Science
  • 11. Unstructured Data Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Southern Data Science Trey Grainger works at Lucidworks. He is speaking at the SDSC 2018. Trey’s Voicemail
  • 12. Foreign Key? Trey Grainger works at Lucidworks. He is speaking at the SDSC 2018. Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Southern Data Science Trey’s Voicemail
  • 13. Trey Grainger works at Lucidworks. He is speaking at the SDSC 2018. Fuzzy Foreign Key? (Entity Resolution) Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Southern Data Science Trey’s Voicemail
  • 14. Trey Grainger works at Lucidworks. He is speaking at the SDSC 2018. Fuzzier Foreign Key? (metadata, latent features) Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Southern Data Science
  • 15. Trey Grainger works at Lucidworks. He is speaking at the SDSC 2018. Fuzzier Foreign Key? (metadata, latent features) Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Southern Data Science Not so Fast!
  • 18. Giant Graph of Relationships... Trey Grainger works for Lucidworks. He is speaking at the SDSC 2018. Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Southern Data Science Trey’s Voicemail
  • 19. Assertion 1 (Summary): Unstructured data is actually “hyper- structured” data. It is a graph that contains much more structure than typical “structured data.” Southern Data Science
  • 20. Assertion 2: That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. Southern Data Science
  • 21. Southern Data Science 01 Semantic Data Encoded into Free Text Content e en eng engi engineer engineers engineer engineersNode Type: Term software engineer software engineers electrical engineering engineer engineering software … … … Node Type: Character Sequence Node Type: Term Sequence Node Type: Document id: 1 text: looking for a software engineerwith degree in computer science or electrical engineering id: 2 text: apply to be a software engineer and work with other great software engineers id: 3 text: start a great careerin electrical engineering … …
  • 22. How do we easily harness this “semantic graph” or relationships within unstructured information? Southern Data Science
  • 23. Search Engines are really good at querying across characters sequences, term sequences, and documents Example Queries: c?o CTO, CEO, CFO, … "VP Engineering"~2 “VP of Engineering”, VP Engineering” ,“Engineering VP”, “VP of Infrastructure Engineering” (Microsoft OR MS) AND Word “MS Word”, “Microsoft Word”
  • 24. Term Documents a doc1 [2x] brown doc3 [1x] , doc5 [1x] cat doc4 [1x] cow doc2 [1x] , doc5 [1x] … ... once doc1 [1x], doc5 [1x] over doc2 [1x], doc3 [1x] the doc2 [2x], doc3 [2x], doc4[2x], doc5 [1x] … … Document Content Field doc1 once upon a time, in a land far, far away doc2 the cow jumped over the moon. doc3 the quick brown fox jumped over the lazy dog. doc4 the cat in the hat doc5 The brown cow said “moo” once. … … What you SEND to Lucene/Solr: How the content is INDEXED into Lucene/Solr (conceptually): An inverted index (“how a search engine works”) Southern Data Science
  • 25. /solr/collection/select/?q=apache solr Term Documents … … apache doc1, doc3, doc4, doc5 … hadoop doc2, doc4, doc6 … … solr doc1, doc3, doc4, doc7, doc8 … … doc5 doc7 doc8 doc1 doc3 doc4 solr apache apache solr Matching queries to documents Southern Data Science
  • 26. Search engines also do relevancy ranking (query to doc) Score(q, d) = ∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl ) t in q Where: t = term; d = document; q = query; i = index tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) |d| = ∑ 1 t in d avgdl = = ( ∑ |d| ) / ( ∑ 1 ) ) d in i d in i k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point. b = Free parameter. Usually ~0.75. Increases impact of document normalization.
  • 27. DOI: 10.1109/DSAA.2016.51 Conference: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
  • 28. • “A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain” • A multi-dimensional term-to-term (vs. term-to-document) search engine • A tool which enables knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems • It’s kind of like Word2Vec, but vectors (or matrices) are generated on the fly and are better suited for interpreting the nuanced intent of typical search queries What is the Semantic Knowledge Graph?
  • 32. Southern Data Science id: 1 job_title: Software Engineer desc: software engineer at a great company skills: .Net, C#, java id: 2 job_title: Registered Nurse desc: a registered nurse at hospital doing hard work skills: oncology, phlebotemy id: 3 job_title: Java Developer desc: a software engineer or a java engineer doing work skills: java, scala, hibernate field doc term desc 1 a at company engineer great software 2 a at doing hard hospital nurse registered work 3 a doing engineer java or software work job_title 1 Software Engineer … … … Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph field term postings list doc pos desc a 1 4 2 1 3 1, 5 at 1 3 2 4 company 1 6 doing 2 6 3 8 engineer 1 2 3 3, 7 great 1 5 hard 2 7 hospital 2 5 java 3 6 nurse 2 3 or 3 4 registered 2 2 software 1 1 3 2 work 2 10 3 9 job_title java developer 3 1 … … … …
  • 33. Southern Data Science Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Set-theory View Graph View How the Graph Traversal Works skill: Java skill: Scala skill: Hibernate skill: Oncology doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 skill: Java skill: Java skill: Scala skill: Hibernate skill: Oncology Data Structure View Java Scala Hibernate docs 1, 2, 6 docs 3, 4 Oncology doc 5
  • 34. Southern Data Science Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Multi-level Traversal Data Structure View Graph View doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 skill: Java skill: Java skill: Scala skill: Hibernate skill: Oncology doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 job_title: Software Engineer job_title: Data Scientist job_title: Java Developer …… Inverted Index Lookup Forward Index Lookup Forward Index Lookup Inverted Index Lookup Java Java Developer Hibernate Scala Software Engineer Data Scientist has_related_job_title has_related_job_title
  • 35. Scoring of Node Relationships (Edge Weights) Foreground vs. Background Analysis Every term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context. countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x))) { "type":"keywords”, "values":[ { "value":"hive", "relatedness":0.9773, "popularity":369 }, { "value":"java", "relatedness":0.9236, "popularity":15653 }, { "value":".net", "relatedness":0.5294, "popularity":17683 }, { "value":"bee", "relatedness":0.0, "popularity":0 }, { "value":"teacher", "relatedness":-0.2380, "popularity":9923 }, { "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] } We are essentially boosting terms which are more related to some known feature (and ignoring terms which are equally likely to appear in the background corpus) + - Foreground Query: "Hadoop" Knowledge Graph
  • 36. Southern Data Science Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Multi-level Graph Traversal with Scores software engineer* (materialized node) Java C# .NET .NET Developer Java Developer Hibernate ScalaVB.NET Software Engineer Data Scientist Skill Nodes has_related_skillStarting Node Skill Nodes has_related_skill Job Title Nodes has_related_job_title 0.90 0.88 0.93 0.93 0.34 0.74 0.91 0.89 0.74 0.89 0.780.72 0.48 0.93 0.76 0.83 0.80 0.64 0.61 0.780.55
  • 37. Southern Data Science Related term vector (for query concept expansion) http://localhost:8983/solr/stack-exchange-health/skg
  • 38. Southern Data Science Who’s in Love with Jean Grey?
  • 39. Assertion 2 (Summary): That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. Southern Data Science
  • 40. Assertion 3: Every instance of a word or phrase you ever encounter has a unique meaning. Southern Data Science
  • 41. Thought Exercise What do you think of when I say the word “driver”? Southern Data Science
  • 42. Ambiguity Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Southern Data Science
  • 43. Use Case: Query Disambiguation Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Southern Data Science
  • 44. Disambiguated meanings (represented as term vectors) Example Related Keywords (Disambiguated Meanings) architect 1: enterprise architect, java architect, data architect, oracle, java, .net 2: architectural designer, architectural drafter, autocad, autocad drafter, designer, drafter, cad, engineer driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic, photoshop, video 2: graphic, web designer, design, web design, graphic design, graphic designer 3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe, structural designer, revit … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Southern Data Science
  • 45. Using the disambiguated meanings In a situation where a user searches for an ambiguous phrase, what information can we use to pick the correct underlying meaning? 1. Any pre-existing knowledge about the user: • User is a software engineer • User has previously run searches for “c++” and “linux” 2. Context within the query: User searched for windows AND driver vs. courier OR driver 3. If all else fails (and there is no context), use the most commonly occurring meaning. driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Southern Data Science
  • 46. Thought Exercise What do you think of when I say the word “Apple”? Southern Data Science
  • 47. Every term or phrase is a Context-dependent cluster of meaning with an ambiguous label Southern Data Science
  • 48. Every term or phrase is a Context-dependent cluster of meaning with an ambiguous label Southern Data Science
  • 49. Southern Data Science What does “love” mean? http://localhost:8983/solr/thesaurus/skg
  • 50. Southern Data Science What does “love” mean in the context of “hug”? http://localhost:8983/solr/thesaurus/skg
  • 51. Southern Data Science What does “love” mean in the context of “child”? http://localhost:8983/solr/thesaurus/skg
  • 52. My Three Assertions (Recap) 1) Unstructured data is actually “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.” 2) That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. 3) Most Important: Every instance of a word or phrase you ever encounter has a unique meaning.
  • 53. Why do we care? User’s Query: machine learning research and development Portland, OR software engineer AND hadoop, java Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java) Semantic Query Parsing: "machine learning" AND "research and development" AND "Portland, OR" AND "software engineer" AND hadoop AND java Semantically Expanded Query: ("machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence") AND ("research and development"^10 OR "r&d") AND AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo}) AND ("software engineer"^10 OR "software developer") AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
  • 54. Contact Info Trey Grainger [email protected] @treygrainger http://solrinaction.com Other presentations: http://www.treygrainger.com Discount code: ctwsdsc18 Southern Data Science