SlideShare a Scribd company logo
Smart Searching Through
Trillion Research Papers
with Apache Spark ML
Himanshu Gupta, Knoldus Inc.
#SAISEco3
About Me
❑ Lead Consultant (Engineering) at Knoldus Inc.
❑ Work on reactive and streaming fast data solutions by leveraging Scala/Spark
ecosystem.
#SAISEco3
Agenda
The Need
Challenges
Our Solution
Future work
#SAISEco3
S
The Need: Make Better Decisions Faster
How much does
it cost to get a
Car from Concept
phase to
Sales floor ?
Typically it takes
2-5 years and
$1 Billion to
do that.
Journalist Auto Industry Expert
#SAISEco3
S
The Need: Make Better Decisions Faster (contd.)
How long does it
take to get a
new drug to
market?
It takes
10-12 years and
$2.5 Billion to
do that.
Journalist Pharmaceutical
Scientist
#SAISEco3
❑ In June, 2018, Tata motors produced just one unit of Nano (world’s cheapest car).
❑ In case of few diseases the success rate of new drug being approved is less than 20%.
Surprises can be Costly
#SAISEco3
Best Solution:
Leverage the Work Done
Pharma companies partner with Research
Organizations and Academic Institutes to
reduce R&D cost up to 30%.
.
Cars in India uses common
engines.
60%
The Challenge:
It is Difficult
#SAISEco3
❑ R&D data is extremely complex.
❑ Each and every research work have a specific Aim which
can overlap with other research work or not.
❑ The test environment of R&D work is different than actual
world.
❑ There are many factors which are either assumed or
ignored while conducting research.
❑ Facts are scattered over multiple research work.
❑ Where all the work done (research papers/articles) are collated.
❑ Allow easy access to the relevant research work.
❑ Discover new fields and concepts.
Our Solution: Build a Platform
#SAISEco3
Design Philosophy
#SAISEco3
❑ Extracting content from Research papers/articles is a
time consuming and tiring process.
❑ Requires expertise of SME(s).
❑ However, if done by systems, can become blazingly
fast and cost efficient.
❑ Systems extract content from research papers/articles
and store them into a database from where it can be
explored
Step 1: Extract Content
#SAISEco3
Read
Documents
Read Research Papers
/Articles from S3 / HDFS
Index
Index the content
in to:
• Title
• Structure
• Special Objects
Enrich
• Prepare N-Grams
• Create MxN matrix
• Index the Matrix
Save
Save the enriched
data into
Database (Cassandra)
01 02 03 04
Step 1: Extract Content (Process)
To scale the extraction process we leveraged Apache Spark’s distributed computing feature
#SAISEco3
Step 1: Extract Content (Output)
#SAISEco3
Word1 Word2 Word3
Doc1 Count Count Count
Doc2 Count Count Count
Doc3 Count Count Count
... ... ... ...
... ... ... ...
DocM Count Count Count
Step 2: Analyze Content (First Iteration)
#SAISEco3
It takes in a collection of documents
as vectors of word counts along with
parameters: k, optimizer,
docConcentration &
topicConcentration,
MaxIterations, & checkpointInterval
The input was a Feature
Vector of Word Counts
(for each word in a bag of
words)
Store the results for future
reference and tuning
Step 2: Analyze Content (LDA Output)
#SAISEco3
● Above words with term weights may not necessarily be the final chosen phrase to be identified as cluster(s).
● Because the number words that belong to cluster can be high (which is good, considering there will be several words
that are ambiguous), one need to use different ways to identify phrases.
Topic1 Topic2 Topic3
Word1 Term Weight Term Weight Term Weight
Word2 Term Weight Term Weight Term Weight
Word3 Term Weight Term Weight Term Weight
... ... ... ...
... ... ... ...
WordN Term Weight Term Weight Term Weight
Step 2: Analyze Content (Identify Clusters)
#SAISEco3
#SAISEco3
Step 2: Analyze Content (Output)
Cluster of words formed from the research papers on Tuberculosis
Step 3: Store Facts (Indexing Documents)
#SAISEco3
Step 3: Store Facts (Output)
#SAISEco3
Now we can search documents on the basis of terms we want to:
select * from facts where coreterms like ‘metallurgy’
Doc Id Content Cluster ID
Core
Terms
Similarity Index
Doc1 Content1 Cluster Id1
Cluster1
Terms
Between
(0-1)
Cluster Id1:Start:Length
Doc2 Content2
Cluster
Id2
Cluster2
Terms
Between
(0-1)
Cluster Id2:Start:Length
... ... ... ... ... ...
DocN ContentN Cluster Id1
Cluster1
Terms
Between
(0-1)
Cluster Id1:Start:Length
Semantic Search
❑ Index Data in Elasticsearch/Solr
❑ Run semantic query over indexed data
❑ Like, How Can we Separate Gold From Mercury? Or Which are the compounds which have recursive
bonding with Carbon and Iron?
Quality Workbench
❑ To measure the relevance of search.
❑ To tune the performance of ML algorithms.
Future Work
#SAISEco3
+(1) 647-467-4396
https://www.facebook.com/KnoldusS
oftware/
@himanshug735
Thank You!
Stay in Touch

More Related Content

What's hot (9)

PPTX
Text analytics showcase_ss
Stuart Palmer
 
PDF
AllegroGraph - Cognitive Probability Graph webcast
Franz Inc. - AllegroGraph
 
PDF
Big Data - Fast Machine Learning at Scale + Couchbase
Fujio Turner
 
DOCX
Ginix generalized inverted index for keyword search
IEEEFINALYEARPROJECTS
 
PDF
Scalable, Collaborative, Reproducible, and Extensible analysis of TCGA data i...
Brandi Davis-Dusenbery
 
PDF
Manage your Datasets
Eng Teong Cheah
 
PDF
The Power of Machine Learning and Graphs
Franz Inc. - AllegroGraph
 
PDF
Big data analysis and modelling
keivan mahdavi
 
PDF
HospETL - Delivering a Healthcare Analytics Platform
Angela Razzell
 
Text analytics showcase_ss
Stuart Palmer
 
AllegroGraph - Cognitive Probability Graph webcast
Franz Inc. - AllegroGraph
 
Big Data - Fast Machine Learning at Scale + Couchbase
Fujio Turner
 
Ginix generalized inverted index for keyword search
IEEEFINALYEARPROJECTS
 
Scalable, Collaborative, Reproducible, and Extensible analysis of TCGA data i...
Brandi Davis-Dusenbery
 
Manage your Datasets
Eng Teong Cheah
 
The Power of Machine Learning and Graphs
Franz Inc. - AllegroGraph
 
Big data analysis and modelling
keivan mahdavi
 
HospETL - Delivering a Healthcare Analytics Platform
Angela Razzell
 

Similar to Smart Searching Through Trillion of Research Papers with Apache Spark ML (20)

PPTX
Redefining Perspectives - June 2015
sapientindia
 
PDF
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Access Innovations, Inc.
 
PDF
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...
UKSG: connecting the knowledge community
 
PPTX
How to Build a Semantic Search System
Trey Grainger
 
PPTX
The Apache Solr Semantic Knowledge Graph
Trey Grainger
 
PPTX
BrightTALK - Semantic AI
Semantic Web Company
 
PDF
Data Science - Part XI - Text Analytics
Derek Kane
 
PDF
Exploration of Call Transcripts with MapReduce and Zipf’s Law
Tom Donoghue
 
PDF
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
Connected Data World
 
PPTX
The Relevance of the Apache Solr Semantic Knowledge Graph
Trey Grainger
 
PPT
Applications of Semantic Technology in the Real World Today
Amit Sheth
 
PPTX
Taxonomies in Search
TSoholt
 
PPTX
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Max Irwin
 
PDF
Multikeyword Hunt on Progressive Graphs
IRJET Journal
 
PDF
Session 0.0 poster minutes madness
semanticsconference
 
PPT
Predictive Text Analytics
Seth Grimes
 
PPT
Business Intelligence Solution Using Search Engine
ankur881120
 
PPT
Implementing Semantic Search
Paul Wlodarczyk
 
PDF
ECIR-2014: Multilanguage Content Discovery Through Entity Driven Search
Antonio David Pérez Morales
 
PDF
Content Discovery Through Entity Driven Search
Alessandro Benedetti
 
Redefining Perspectives - June 2015
sapientindia
 
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Access Innovations, Inc.
 
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...
UKSG: connecting the knowledge community
 
How to Build a Semantic Search System
Trey Grainger
 
The Apache Solr Semantic Knowledge Graph
Trey Grainger
 
BrightTALK - Semantic AI
Semantic Web Company
 
Data Science - Part XI - Text Analytics
Derek Kane
 
Exploration of Call Transcripts with MapReduce and Zipf’s Law
Tom Donoghue
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
Connected Data World
 
The Relevance of the Apache Solr Semantic Knowledge Graph
Trey Grainger
 
Applications of Semantic Technology in the Real World Today
Amit Sheth
 
Taxonomies in Search
TSoholt
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Max Irwin
 
Multikeyword Hunt on Progressive Graphs
IRJET Journal
 
Session 0.0 poster minutes madness
semanticsconference
 
Predictive Text Analytics
Seth Grimes
 
Business Intelligence Solution Using Search Engine
ankur881120
 
Implementing Semantic Search
Paul Wlodarczyk
 
ECIR-2014: Multilanguage Content Discovery Through Entity Driven Search
Antonio David Pérez Morales
 
Content Discovery Through Entity Driven Search
Alessandro Benedetti
 
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
PPTX
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
PPTX
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
PPTX
Java 17 features and implementation.pptx
Knoldus Inc.
 
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
PPTX
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
PPTX
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
PPTX
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
PPTX
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
PPTX
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
PPTX
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
PPTX
Intro to Azure Container App Presentation
Knoldus Inc.
 
PPTX
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
PPTX
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
PPTX
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
PPTX
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
Java 17 features and implementation.pptx
Knoldus Inc.
 
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
Intro to Azure Container App Presentation
Knoldus Inc.
 
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Ad

Recently uploaded (20)

PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 

Smart Searching Through Trillion of Research Papers with Apache Spark ML

  • 1. Smart Searching Through Trillion Research Papers with Apache Spark ML Himanshu Gupta, Knoldus Inc. #SAISEco3
  • 2. About Me ❑ Lead Consultant (Engineering) at Knoldus Inc. ❑ Work on reactive and streaming fast data solutions by leveraging Scala/Spark ecosystem. #SAISEco3
  • 4. S The Need: Make Better Decisions Faster How much does it cost to get a Car from Concept phase to Sales floor ? Typically it takes 2-5 years and $1 Billion to do that. Journalist Auto Industry Expert #SAISEco3
  • 5. S The Need: Make Better Decisions Faster (contd.) How long does it take to get a new drug to market? It takes 10-12 years and $2.5 Billion to do that. Journalist Pharmaceutical Scientist #SAISEco3
  • 6. ❑ In June, 2018, Tata motors produced just one unit of Nano (world’s cheapest car). ❑ In case of few diseases the success rate of new drug being approved is less than 20%. Surprises can be Costly #SAISEco3
  • 7. Best Solution: Leverage the Work Done Pharma companies partner with Research Organizations and Academic Institutes to reduce R&D cost up to 30%. . Cars in India uses common engines. 60%
  • 8. The Challenge: It is Difficult #SAISEco3 ❑ R&D data is extremely complex. ❑ Each and every research work have a specific Aim which can overlap with other research work or not. ❑ The test environment of R&D work is different than actual world. ❑ There are many factors which are either assumed or ignored while conducting research. ❑ Facts are scattered over multiple research work.
  • 9. ❑ Where all the work done (research papers/articles) are collated. ❑ Allow easy access to the relevant research work. ❑ Discover new fields and concepts. Our Solution: Build a Platform #SAISEco3
  • 11. ❑ Extracting content from Research papers/articles is a time consuming and tiring process. ❑ Requires expertise of SME(s). ❑ However, if done by systems, can become blazingly fast and cost efficient. ❑ Systems extract content from research papers/articles and store them into a database from where it can be explored Step 1: Extract Content #SAISEco3
  • 12. Read Documents Read Research Papers /Articles from S3 / HDFS Index Index the content in to: • Title • Structure • Special Objects Enrich • Prepare N-Grams • Create MxN matrix • Index the Matrix Save Save the enriched data into Database (Cassandra) 01 02 03 04 Step 1: Extract Content (Process) To scale the extraction process we leveraged Apache Spark’s distributed computing feature #SAISEco3
  • 13. Step 1: Extract Content (Output) #SAISEco3 Word1 Word2 Word3 Doc1 Count Count Count Doc2 Count Count Count Doc3 Count Count Count ... ... ... ... ... ... ... ... DocM Count Count Count
  • 14. Step 2: Analyze Content (First Iteration) #SAISEco3 It takes in a collection of documents as vectors of word counts along with parameters: k, optimizer, docConcentration & topicConcentration, MaxIterations, & checkpointInterval The input was a Feature Vector of Word Counts (for each word in a bag of words) Store the results for future reference and tuning
  • 15. Step 2: Analyze Content (LDA Output) #SAISEco3 ● Above words with term weights may not necessarily be the final chosen phrase to be identified as cluster(s). ● Because the number words that belong to cluster can be high (which is good, considering there will be several words that are ambiguous), one need to use different ways to identify phrases. Topic1 Topic2 Topic3 Word1 Term Weight Term Weight Term Weight Word2 Term Weight Term Weight Term Weight Word3 Term Weight Term Weight Term Weight ... ... ... ... ... ... ... ... WordN Term Weight Term Weight Term Weight
  • 16. Step 2: Analyze Content (Identify Clusters) #SAISEco3
  • 17. #SAISEco3 Step 2: Analyze Content (Output) Cluster of words formed from the research papers on Tuberculosis
  • 18. Step 3: Store Facts (Indexing Documents) #SAISEco3
  • 19. Step 3: Store Facts (Output) #SAISEco3 Now we can search documents on the basis of terms we want to: select * from facts where coreterms like ‘metallurgy’ Doc Id Content Cluster ID Core Terms Similarity Index Doc1 Content1 Cluster Id1 Cluster1 Terms Between (0-1) Cluster Id1:Start:Length Doc2 Content2 Cluster Id2 Cluster2 Terms Between (0-1) Cluster Id2:Start:Length ... ... ... ... ... ... DocN ContentN Cluster Id1 Cluster1 Terms Between (0-1) Cluster Id1:Start:Length
  • 20. Semantic Search ❑ Index Data in Elasticsearch/Solr ❑ Run semantic query over indexed data ❑ Like, How Can we Separate Gold From Mercury? Or Which are the compounds which have recursive bonding with Carbon and Iron? Quality Workbench ❑ To measure the relevance of search. ❑ To tune the performance of ML algorithms. Future Work #SAISEco3