SlideShare a Scribd company logo
NLP Structured Data Investigation on Non-Text
Casey Stella
@casey_stella
2016
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Table of Contents
Preliminaries
Borrowing from NLP
Demo
Questions
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Introduction
Hi, I’m Casey Stella!
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Domain Challenges in Data Science
A data scientist has to merge analytical skills with domain expertise.
• Often we’re thrown into places where we have insufficient domain experience.
• Gaining this expertise can be challenging and time-consuming.
• Unsupervised machine learning techniques can be very useful to understand complex
data relationships.
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Domain Challenges in Data Science
A data scientist has to merge analytical skills with domain expertise.
• Often we’re thrown into places where we have insufficient domain experience.
• Gaining this expertise can be challenging and time-consuming.
• Unsupervised machine learning techniques can be very useful to understand complex
data relationships.
We’ll use an unsupervised structure learning algorithm borrowed from NLP to look at
medical data.
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Word2Vec
Word2Vec is a vectorization model created by Google [1] that attempts to learn
relationships between words automatically given a large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the vector space
with cosine similarity.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Word2Vec
Word2Vec is a vectorization model created by Google [1] that attempts to learn
relationships between words automatically given a large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the vector space
with cosine similarity.
• Uses a neural network to learn vector representations.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Word2Vec
Word2Vec is a vectorization model created by Google [1] that attempts to learn
relationships between words automatically given a large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the vector space
with cosine similarity.
• Uses a neural network to learn vector representations.
• Work by Pennington, Socher, and Manning [2] shows that the word2vec model is
equivalent to a word co-occurance matrix weighting based on window distance and
lowering the dimension by matrix factorization.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Word2Vec
Word2Vec is a vectorization model created by Google [1] that attempts to learn
relationships between words automatically given a large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the vector space
with cosine similarity.
• Uses a neural network to learn vector representations.
• Work by Pennington, Socher, and Manning [2] shows that the word2vec model is
equivalent to a word co-occurance matrix weighting based on window distance and
lowering the dimension by matrix factorization.
Takeaway: The technique boils down, intuitively, to a riff on word co-occurence. See
here1 for more.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Clinical Data as Sentences
Clinical encounters form a sort of sentence over time. For a given encounter:
• Vitals are measured (e.g. height, weight, BMI).
• Labs are performed and results are recorded (e.g. blood tests).
• Procedures are performed.
• Diagnoses are made (e.g. Diabetes).
• Drugs are prescribed.
Each of these can be considered clinical “words” and the encounter forms a clinical
“sentence”.
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Clinical Data as Sentences
Clinical encounters form a sort of sentence over time. For a given encounter:
• Vitals are measured (e.g. height, weight, BMI).
• Labs are performed and results are recorded (e.g. blood tests).
• Procedures are performed.
• Diagnoses are made (e.g. Diabetes).
• Drugs are prescribed.
Each of these can be considered clinical “words” and the encounter forms a clinical
“sentence”.
Idea: We can use word2vec to investigate connections between these clinical concepts.
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Demo
As part of a Kaggle competition2, Practice Fusion, a digital electronic medical records
provider released depersonalized clinical records of 10,000 patients. I ingested and
preprocessed these records into 197,340 clinical “sentences” using Pig and Hive.
2
https://www.kaggle.com/c/pf2012-diabetes
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Demo
As part of a Kaggle competition2, Practice Fusion, a digital electronic medical records
provider released depersonalized clinical records of 10,000 patients. I ingested and
preprocessed these records into 197,340 clinical “sentences” using Pig and Hive.
MLLib from Spark now contains an implementation of word2vec, so let’s use pyspark
and IPython Notebook to explore this dataset on Hadoop.
2
https://www.kaggle.com/c/pf2012-diabetes
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Questions
Thanks for your attention! Questions?
• Code & scripts for this talk available on my github presentation page.3
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
3
http://github.com/cestella/presentations/
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Bibliography
[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space. CoRR, abs/1301.3781, 2013.
[2] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global
vectors for word representation. In Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages 1532–1543. Association
for Computational Linguistics, 2014.
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Ad

More Related Content

What's hot (14)

Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Paul Groth
 
Dt35682686
Dt35682686Dt35682686
Dt35682686
IJERA Editor
 
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph DatabaseAnalyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Neo4j
 
Words, Documents and Distance: Deep Learning and Semantic Analysis
Words, Documents and Distance: Deep Learning and Semantic AnalysisWords, Documents and Distance: Deep Learning and Semantic Analysis
Words, Documents and Distance: Deep Learning and Semantic Analysis
Ray Poynter
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
Greg Landrum
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosas
Merce Crosas
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
Karry Lu
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
c.titus.brown
 
Detecting word substitution in text
Detecting word substitution in textDetecting word substitution in text
Detecting word substitution in text
abhishek13bansal
 
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use CasesFrom Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
Neo4j
 
Link Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataLink Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked Data
Michel Dumontier
 
BIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCPBIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCP
Robert Oostenveld
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
Greg Landrum
 
Return to the Materials Digital Humanities Conference 2013
Return to the Materials Digital Humanities Conference 2013Return to the Materials Digital Humanities Conference 2013
Return to the Materials Digital Humanities Conference 2013
Sean Connolly
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Paul Groth
 
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph DatabaseAnalyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Neo4j
 
Words, Documents and Distance: Deep Learning and Semantic Analysis
Words, Documents and Distance: Deep Learning and Semantic AnalysisWords, Documents and Distance: Deep Learning and Semantic Analysis
Words, Documents and Distance: Deep Learning and Semantic Analysis
Ray Poynter
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
Greg Landrum
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosas
Merce Crosas
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
Karry Lu
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
c.titus.brown
 
Detecting word substitution in text
Detecting word substitution in textDetecting word substitution in text
Detecting word substitution in text
abhishek13bansal
 
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use CasesFrom Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
Neo4j
 
Link Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataLink Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked Data
Michel Dumontier
 
BIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCPBIOMAG2018 - Denis Engemann - MNE-HCP
BIOMAG2018 - Denis Engemann - MNE-HCP
Robert Oostenveld
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
Greg Landrum
 
Return to the Materials Digital Humanities Conference 2013
Return to the Materials Digital Humanities Conference 2013Return to the Materials Digital Humanities Conference 2013
Return to the Materials Digital Humanities Conference 2013
Sean Connolly
 

Viewers also liked (10)

LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data
DataWorks Summit/Hadoop Summit
 
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
DataWorks Summit/Hadoop Summit
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
YARN Federation
YARN Federation YARN Federation
YARN Federation
DataWorks Summit/Hadoop Summit
 
The Elephant in the Clouds
The Elephant in the CloudsThe Elephant in the Clouds
The Elephant in the Clouds
DataWorks Summit/Hadoop Summit
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJDataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.
DataWorks Summit/Hadoop Summit
 
Comparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBaseComparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBase
DataWorks Summit/Hadoop Summit
 
Ad

Similar to NLP Structured Data Investigation on Non-Text (20)

NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaNLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey Stella
Spark Summit
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
Kai Li
 
Using Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive ComputingUsing Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive Computing
Artificial Intelligence Institute at UofSC
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...
Anubhav Jain
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Anubhav Jain
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
Carole Goble
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...
Anubhav Jain
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018
Andre Freitas
 
Data Preparation for Data Science
Data Preparation for Data ScienceData Preparation for Data Science
Data Preparation for Data Science
DataWorks Summit/Hadoop Summit
 
Data Preparation of Data Science
Data Preparation of Data ScienceData Preparation of Data Science
Data Preparation of Data Science
DataWorks Summit/Hadoop Summit
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
Anubhav Jain
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learning
csandit
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
Bhaskar Mitra
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & Statistics
Paul Hofmann
 
Idcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckleIdcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckle
Eric Kansa
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
Andy Petrella
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Anubhav Jain
 
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
Susanna-Assunta Sansone
 
From Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsFrom Linked Data to Semantic Applications
From Linked Data to Semantic Applications
Andre Freitas
 
NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaNLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey Stella
Spark Summit
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
Kai Li
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...
Anubhav Jain
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Anubhav Jain
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
Carole Goble
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...
Anubhav Jain
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018
Andre Freitas
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
Anubhav Jain
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learning
csandit
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
Bhaskar Mitra
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & Statistics
Paul Hofmann
 
Idcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckleIdcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckle
Eric Kansa
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
Andy Petrella
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Anubhav Jain
 
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
Susanna-Assunta Sansone
 
From Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsFrom Linked Data to Semantic Applications
From Linked Data to Semantic Applications
Andre Freitas
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 

NLP Structured Data Investigation on Non-Text

  • 1. NLP Structured Data Investigation on Non-Text Casey Stella @casey_stella 2016 Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 2. Table of Contents Preliminaries Borrowing from NLP Demo Questions Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 3. Introduction Hi, I’m Casey Stella! Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 4. Domain Challenges in Data Science A data scientist has to merge analytical skills with domain expertise. • Often we’re thrown into places where we have insufficient domain experience. • Gaining this expertise can be challenging and time-consuming. • Unsupervised machine learning techniques can be very useful to understand complex data relationships. Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 5. Domain Challenges in Data Science A data scientist has to merge analytical skills with domain expertise. • Often we’re thrown into places where we have insufficient domain experience. • Gaining this expertise can be challenging and time-consuming. • Unsupervised machine learning techniques can be very useful to understand complex data relationships. We’ll use an unsupervised structure learning algorithm borrowed from NLP to look at medical data. Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 6. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 7. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. • Uses a neural network to learn vector representations. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 8. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. • Uses a neural network to learn vector representations. • Work by Pennington, Socher, and Manning [2] shows that the word2vec model is equivalent to a word co-occurance matrix weighting based on window distance and lowering the dimension by matrix factorization. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 9. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. • Uses a neural network to learn vector representations. • Work by Pennington, Socher, and Manning [2] shows that the word2vec model is equivalent to a word co-occurance matrix weighting based on window distance and lowering the dimension by matrix factorization. Takeaway: The technique boils down, intuitively, to a riff on word co-occurence. See here1 for more. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 10. Clinical Data as Sentences Clinical encounters form a sort of sentence over time. For a given encounter: • Vitals are measured (e.g. height, weight, BMI). • Labs are performed and results are recorded (e.g. blood tests). • Procedures are performed. • Diagnoses are made (e.g. Diabetes). • Drugs are prescribed. Each of these can be considered clinical “words” and the encounter forms a clinical “sentence”. Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 11. Clinical Data as Sentences Clinical encounters form a sort of sentence over time. For a given encounter: • Vitals are measured (e.g. height, weight, BMI). • Labs are performed and results are recorded (e.g. blood tests). • Procedures are performed. • Diagnoses are made (e.g. Diabetes). • Drugs are prescribed. Each of these can be considered clinical “words” and the encounter forms a clinical “sentence”. Idea: We can use word2vec to investigate connections between these clinical concepts. Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 12. Demo As part of a Kaggle competition2, Practice Fusion, a digital electronic medical records provider released depersonalized clinical records of 10,000 patients. I ingested and preprocessed these records into 197,340 clinical “sentences” using Pig and Hive. 2 https://www.kaggle.com/c/pf2012-diabetes Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 13. Demo As part of a Kaggle competition2, Practice Fusion, a digital electronic medical records provider released depersonalized clinical records of 10,000 patients. I ingested and preprocessed these records into 197,340 clinical “sentences” using Pig and Hive. MLLib from Spark now contains an implementation of word2vec, so let’s use pyspark and IPython Notebook to explore this dataset on Hadoop. 2 https://www.kaggle.com/c/pf2012-diabetes Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 14. Questions Thanks for your attention! Questions? • Code & scripts for this talk available on my github presentation page.3 • Find me at http://caseystella.com • Twitter handle: @casey_stella • Email address: [email protected] 3 http://github.com/cestella/presentations/ Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 15. Bibliography [1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013. [2] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Association for Computational Linguistics, 2014. Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016