Smart Searching Through Trillion of Research Papers with Apache Spark ML

Smart Searching Through
Trillion Research Papers
with Apache Spark ML
Himanshu Gupta, Knoldus Inc.
#SAISEco3

About Me
❑ Lead Consultant (Engineering) at Knoldus Inc.
❑ Work on reactive and streaming fast data solutions by leveraging Scala/Spark
ecosystem.
#SAISEco3

Agenda
The Need
Challenges
Our Solution
Future work
#SAISEco3

S
The Need: Make Better Decisions Faster
How much does
it cost to get a
Car from Concept
phase to
Sales floor ?
Typically it takes
2-5 years and
$1 Billion to
do that.
Journalist Auto Industry Expert
#SAISEco3

S
The Need: Make Better Decisions Faster (contd.)
How long does it
take to get a
new drug to
market?
It takes
10-12 years and
$2.5 Billion to
do that.
Journalist Pharmaceutical
Scientist
#SAISEco3

❑ In June, 2018, Tata motors produced just one unit of Nano (world’s cheapest car).
❑ In case of few diseases the success rate of new drug being approved is less than 20%.
Surprises can be Costly
#SAISEco3

Best Solution:
Leverage the Work Done
Pharma companies partner with Research
Organizations and Academic Institutes to
reduce R&D cost up to 30%.
.
Cars in India uses common
engines.
60%

The Challenge:
It is Difficult
#SAISEco3
❑ R&D data is extremely complex.
❑ Each and every research work have a specific Aim which
can overlap with other research work or not.
❑ The test environment of R&D work is different than actual
world.
❑ There are many factors which are either assumed or
ignored while conducting research.
❑ Facts are scattered over multiple research work.

❑ Where all the work done (research papers/articles) are collated.
❑ Allow easy access to the relevant research work.
❑ Discover new fields and concepts.
Our Solution: Build a Platform
#SAISEco3

❑ Extracting content from Research papers/articles is a
time consuming and tiring process.
❑ Requires expertise of SME(s).
❑ However, if done by systems, can become blazingly
fast and cost efficient.
❑ Systems extract content from research papers/articles
and store them into a database from where it can be
explored
Step 1: Extract Content
#SAISEco3

Read
Documents
Read Research Papers
/Articles from S3 / HDFS
Index
Index the content
in to:
• Title
• Structure
• Special Objects
Enrich
• Prepare N-Grams
• Create MxN matrix
• Index the Matrix
Save
Save the enriched
data into
Database (Cassandra)
01 02 03 04
Step 1: Extract Content (Process)
To scale the extraction process we leveraged Apache Spark’s distributed computing feature
#SAISEco3

Step 1: Extract Content (Output)
#SAISEco3
Word1 Word2 Word3
Doc1 Count Count Count
... ... ... ...
... ... ... ...
DocM Count Count Count

Step 2: Analyze Content (First Iteration)
#SAISEco3
It takes in a collection of documents
as vectors of word counts along with
parameters: k, optimizer,
docConcentration &
topicConcentration,
MaxIterations, & checkpointInterval
The input was a Feature
Vector of Word Counts
(for each word in a bag of
words)
Store the results for future
reference and tuning

Step 2: Analyze Content (LDA Output)
#SAISEco3
● Above words with term weights may not necessarily be the final chosen phrase to be identified as cluster(s).
● Because the number words that belong to cluster can be high (which is good, considering there will be several words
that are ambiguous), one need to use different ways to identify phrases.
Topic1 Topic2 Topic3
Word1 Term Weight Term Weight Term Weight
... ... ... ...
... ... ... ...
WordN Term Weight Term Weight Term Weight

Step 2: Analyze Content (Identify Clusters)
#SAISEco3

#SAISEco3
Step 2: Analyze Content (Output)
Cluster of words formed from the research papers on Tuberculosis

Step 3: Store Facts (Indexing Documents)
#SAISEco3

Step 3: Store Facts (Output)
#SAISEco3
Now we can search documents on the basis of terms we want to:
select * from facts where coreterms like ‘metallurgy’
Doc Id Content Cluster ID
Core
Terms
Similarity Index
Doc1 Content1 Cluster Id1
Cluster1
Terms
Between
(0-1)
Cluster Id1:Start:Length
Doc2 Content2
Cluster
Id2
Cluster2
Terms
Between
(0-1)
... ... ... ... ... ...
DocN ContentN Cluster Id1
Cluster1
Terms
Between
(0-1)

Semantic Search
❑ Index Data in Elasticsearch/Solr
❑ Run semantic query over indexed data
❑ Like, How Can we Separate Gold From Mercury? Or Which are the compounds which have recursive
bonding with Carbon and Iron?
Quality Workbench
❑ To measure the relevance of search.
❑ To tune the performance of ML algorithms.
Future Work
#SAISEco3

+(1) 647-467-4396
https://www.facebook.com/KnoldusS
oftware/
@himanshug735
Thank You!
Stay in Touch

Smart Searching Through Trillion of Research Papers with Apache Spark ML

More Related Content

What's hot (9)

Similar to Smart Searching Through Trillion of Research Papers with Apache Spark ML (20)

More from Knoldus Inc. (20)

Recently uploaded (20)

Smart Searching Through Trillion of Research Papers with Apache Spark ML