SlideShare a Scribd company logo
How to build your own google ...
artur.grzadziel@gmail.com
Data Wizards
Dec 2015
Artur Grządziel
few words about me
email: artur.grzadziel@gmail.com
Currently: BigData and Machine Learning Leader
From Jan 2016: BigData Solution Architect at General Electric
PhD in progress at PAN (Polish Academy of Sciences) Systems Research Institute
Graduated from Warsaw University of Technology and Warsaw School of Economics
BigData & Machine Learning enthusiast focused on leveraging Big Data and Machine Learning
in real business cases
Privately, husband and father
pl.linkedin.com/in/ArturGrzadziel
Introduction
Data Wizards
Artur represents „Data Wizards” group – informal group of
BigData/Machine Learning/Data Science professionals located in
Poland and interested in knowledge sharing and addressing business
challenges leveraging modern BigData and Machine Learning
methods.
Agenda
1. Cloudera search
2. How it works?
MySearch
very high level architecture
Data
Source
Index
Cloudera search
Apache Solr and Tika
1.
Other
Sources
Cloudera Search
Cloudera Search is one of Cloudera's near-real-time access products.
Cloudera Search enables non-technical users to search and explore data stored
in or ingested into Hadoop and HBase. Users do not need SQL or programming
skills to use Cloudera Search because it provides a simple, full-text interface for
searching.
Cloudera Search incorporates Apache Solr, which includes Apache Lucene,
SolrCloud, Apache Tika, and Solr Cell. Cloudera Search is tightly integrated
with Cloudera's Distribution, including Apache Hadoop (CDH). Cloudera Search
provides these key capabilities:
- Near-real-time indexing
- Batch indexing
- Simple, full-text data exploration and navigated drill down
http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-
0/Cloudera-Search-User-Guide/csug_introducing.html
Cloudera search
Tika
https://tika.apache.org/download.html
Cloudera search
Tika – image
Cloudera search
Tika – PDF file
Cloudera search
Tika – gazeta.pl
Cloudera search
Tika – formats
Supported Document Formats
• HyperText Markup Language
• XML and derived formats
• Microsoft Office document formats
• OpenDocument Format
• Portable Document Format
• Electronic Publication Format
• Rich Text Format
• Compression and packaging formats
• Text formats
• Audio formats
• Image formats
• Video formats
• Java class files and archives
• The mbox format
https://tika.apache.org/1.4/formats.html
Cloudera search
Solr – how to start it …
.binsolr start –e cloud -noprompt http://lucene.apache.org/solr/
Cloudera Search
Administration
Cloudera Search
Data
id cat name price inStock author series_t sequence_i genre_s
553573403 book A Game of Thrones 7.99 TRUE George R.R. Martin A Song of Ice and Fire 1 fantasy
553579908 book A Clash of Kings 7.99 TRUE George R.R. Martin A Song of Ice and Fire 2 fantasy
055357342X book A Storm of Swords 7.99 TRUE George R.R. Martin A Song of Ice and Fire 3 fantasy
553293354 book Foundation 7.99 TRUE Isaac Asimov Foundation Novels 1 scifi
812521390 book The Black Company 6.99 FALSE Glen Cook The Chronicles of The Black Company 1 fantasy
812550706 book Ender's Game 6.99 TRUE Orson Scott Card Ender 1 scifi
441385532 book Jhereg 7.95 FALSE Steven Brust Vlad Taltos 1 fantasy
380014300 book
Nine Princes In
Amber 6.99 TRUE Roger Zelazny the Chronicles of Amber 1 fantasy
805080481 book The Book of Three 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 1 fantasy
080508049X book The Black Cauldron 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 2 fantasy
Cloudera Search
Output format
Cloudera Search
Simple query
Cloudera Search
Simple query
Cloudera Search
More advanced query
Cloudera Search
Query with facets
Cloudera search
Solr – other features
The MoreLikeThis search component enables users to query for documents
similar to a document in their result list. It is achieved leveraging terms from the
original document to find similar documents in the index
The SpellCheck component is designed to provide inline query suggestions
based on other, similar, terms.
Highlighting in Solr allows fragments of documents that match the user's query
to be included with the query response.
Synonyms, stop words
Cloudera search
Solr – other features – geospacial search
Solr has sophisticated geospatial support, including searching within a
specified distance range of a given location (or within a bounding box),
sorting by distance, or even boosting results by the distance
http://lucene.apache.org/solr/quickstart.html
Cloudera Search
Common Use Cases
Cloudera Search lets your entire business explore and analyze data quickly and
easily for a variety of critical use cases all within a single platform, including:
- Threat detection
- Customer 360-degree visibility
- Improved user experience
- Interactive market segmentation
- Accessible global knowledge base
https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-
solr.html
Cloudera Search
Other Use Cases
Instagram: Instagram (a Facebook company) is one of the famous sites, and it
uses Solr to power its geosearch API
WhiteHouse.gov: The Obama administration's website is inbuilt in Drupal and
Solr
Netflix: Solr powers basic movie searching on this extremely busy site
StubHub.com: This ticket reseller uses Solr to help visitors search for concerts
and sporting events.
https://www.safaribooksonline.com/library/view/scaling-apache-
solr/9781783981748/ch01s05.html
How it works ... ?
How it works … ?
Data Source – documents …
Document Content
1 John has a cat
2 John has a dog
3 Eva has a cat
4 George has a dog
How it works … ?
Data Source – documents … space of unique terms
Document Content
1 John has a cat
2 John has a dog
3 Eva has a cat
4 George has a dog
1 2 3 4
1 2 3 5
6 2 3 4
7 2 3 4
List of unique
words:
1. John
2. has
3. a
4. cat
5. dog
6. Eva
7. George
How it works … ?
Data Source – Documents … boolean search with inverted
index
Term Tot. freq.
John 2
has 4
a 4
cat 2
dog 2
Eva 1
George 1
Doc #
1
2
1
2
3
4
1
2
3
4
1
3
2
4
3
4
Dictionary Documents
How it works … ?
Data Source – documents as vectors
Documents
document 1 John has a cat
document 2 John has a dog
document 3 Eva has a cat
document 4 George has a dog
Space of unique terms -> John has a cat dog Eva George
vector representing doc1 -> 1 1 1 1 0 0 0
vector representing doc2 -> 1 1 1 0 1 0 0
vector representing doc3 -> 0 1 1 1 0 1 0
vector representing doc4 -> 0 1 1 0 1 0 1
How it works … ?
Data Source – Documents … vectors
Summary
1.
Other
Sources
Thank you
Data Wizards
E-mail: artur.grzadziel@gmail.com
Links:
• Cloudera Search:
http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-
3-0/Cloudera-Search-User-Guide/csug_introducing.html
• Tika
https://tika.apache.org/
• Apache Solr
http://lucene.apache.org/solr/
https://www.cloudera.com/content/www/en-us/products/apache-
hadoop/apache-solr.html
• Vectors, Inversed Index, Frequency Matrix, etc. ...
http://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htm

More Related Content

What's hot (7)

2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain
Christian Martorella
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked DataAn introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked Data
Fabien Gandon
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
Prof. Wim Van Criekinge
 
Linked Data:Libraries and Beyond
Linked Data:Libraries and BeyondLinked Data:Libraries and Beyond
Linked Data:Libraries and Beyond
Jessica Hedgecock and John Shannon
 
Creating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFCreating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDF
donaldlsmithjr
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
Prof. Wim Van Criekinge
 
Search engines coh m
Search engines coh mSearch engines coh m
Search engines coh m
cpcmattc
 
2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain
Christian Martorella
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked DataAn introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked Data
Fabien Gandon
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
Prof. Wim Van Criekinge
 
Creating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFCreating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDF
donaldlsmithjr
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
Prof. Wim Van Criekinge
 
Search engines coh m
Search engines coh mSearch engines coh m
Search engines coh m
cpcmattc
 

Viewers also liked (20)

Ask Data Anything
Ask Data AnythingAsk Data Anything
Ask Data Anything
Data Science Warsaw
 
Małe dane, duży wpływ - Dominik Batorski ICM
Małe dane, duży wpływ - Dominik Batorski ICMMałe dane, duży wpływ - Dominik Batorski ICM
Małe dane, duży wpływ - Dominik Batorski ICM
Data Science Warsaw
 
Oracle Big Data Discovery - ludzka twarz Hadoop'a
Oracle Big Data Discovery - ludzka twarz Hadoop'aOracle Big Data Discovery - ludzka twarz Hadoop'a
Oracle Big Data Discovery - ludzka twarz Hadoop'a
Data Science Warsaw
 
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Big  Data, Wearable, sztuczna inteligencja i ekonomia współpracyBig  Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Data Science Warsaw
 
Data science warsaw inaugural meetup
Data science warsaw   inaugural meetupData science warsaw   inaugural meetup
Data science warsaw inaugural meetup
Data Science Warsaw
 
Online content popularity prediction
Online content popularity predictionOnline content popularity prediction
Online content popularity prediction
Data Science Warsaw
 
Data Exchange - the missing link in the big data value chain
Data Exchange - the missing link in the big data value chainData Exchange - the missing link in the big data value chain
Data Exchange - the missing link in the big data value chain
Data Science Warsaw
 
Data Science Warsaw
Data Science WarsawData Science Warsaw
Data Science Warsaw
Data Science Warsaw
 
Analiza języka naturalnego
Analiza języka naturalnegoAnaliza języka naturalnego
Analiza języka naturalnego
Data Science Warsaw
 
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Data Science Warsaw
 
unidad 1
unidad 1unidad 1
unidad 1
erika jhoanna vargas rincon
 
Trash Talk
Trash TalkTrash Talk
Trash Talk
Crystal Ouellette
 
unidad 1
unidad 1unidad 1
unidad 1
erika jhoanna vargas rincon
 
Data science w ubezpieczeniach
Data science w ubezpieczeniachData science w ubezpieczeniach
Data science w ubezpieczeniach
Data Science Warsaw
 
Rozwiązywanie problemów optymalizacyjnych
Rozwiązywanie problemów optymalizacyjnychRozwiązywanie problemów optymalizacyjnych
Rozwiązywanie problemów optymalizacyjnych
Data Science Warsaw
 
ARTRITIS – ENCEFALITIS CAPRINA
ARTRITIS – ENCEFALITIS CAPRINAARTRITIS – ENCEFALITIS CAPRINA
ARTRITIS – ENCEFALITIS CAPRINA
Edgar Mrtinez
 
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia SeahorseWizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Data Science Warsaw
 
QIIP
QIIPQIIP
QIIP
Marwa Badra
 
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Data Science Warsaw
 
To się w ram ie nie zmieści
To się w ram ie nie zmieściTo się w ram ie nie zmieści
To się w ram ie nie zmieści
Data Science Warsaw
 
Małe dane, duży wpływ - Dominik Batorski ICM
Małe dane, duży wpływ - Dominik Batorski ICMMałe dane, duży wpływ - Dominik Batorski ICM
Małe dane, duży wpływ - Dominik Batorski ICM
Data Science Warsaw
 
Oracle Big Data Discovery - ludzka twarz Hadoop'a
Oracle Big Data Discovery - ludzka twarz Hadoop'aOracle Big Data Discovery - ludzka twarz Hadoop'a
Oracle Big Data Discovery - ludzka twarz Hadoop'a
Data Science Warsaw
 
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Big  Data, Wearable, sztuczna inteligencja i ekonomia współpracyBig  Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Data Science Warsaw
 
Data science warsaw inaugural meetup
Data science warsaw   inaugural meetupData science warsaw   inaugural meetup
Data science warsaw inaugural meetup
Data Science Warsaw
 
Online content popularity prediction
Online content popularity predictionOnline content popularity prediction
Online content popularity prediction
Data Science Warsaw
 
Data Exchange - the missing link in the big data value chain
Data Exchange - the missing link in the big data value chainData Exchange - the missing link in the big data value chain
Data Exchange - the missing link in the big data value chain
Data Science Warsaw
 
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Data Science Warsaw
 
Rozwiązywanie problemów optymalizacyjnych
Rozwiązywanie problemów optymalizacyjnychRozwiązywanie problemów optymalizacyjnych
Rozwiązywanie problemów optymalizacyjnych
Data Science Warsaw
 
ARTRITIS – ENCEFALITIS CAPRINA
ARTRITIS – ENCEFALITIS CAPRINAARTRITIS – ENCEFALITIS CAPRINA
ARTRITIS – ENCEFALITIS CAPRINA
Edgar Mrtinez
 
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia SeahorseWizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Data Science Warsaw
 
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Data Science Warsaw
 

Similar to How to build your own google (20)

Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
gregchanan
 
Cloudera Search Webinar: Big Data Search, Bigger Insights
Cloudera Search Webinar: Big Data Search, Bigger InsightsCloudera Search Webinar: Big Data Search, Bigger Insights
Cloudera Search Webinar: Big Data Search, Bigger Insights
Cloudera, Inc.
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Lucidworks
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
gregchanan
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
gregchanan
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
Cloudera, Inc.
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
Trey Grainger
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Trey Grainger
 
Solr 3.1 and beyond
Solr 3.1 and beyondSolr 3.1 and beyond
Solr 3.1 and beyond
Lucidworks (Archived)
 
Introduction to solr
Introduction to solrIntroduction to solr
Introduction to solr
Sematext Group, Inc.
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
Ramzi Alqrainy
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
Lucidworks
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Ecommerce Solution Provider SysIQ
 
A Practical Introduction to Apache Solr
A Practical Introduction to Apache SolrA Practical Introduction to Apache Solr
A Practical Introduction to Apache Solr
Angel Borroy López
 
Find it, possibly also near you!
Find it, possibly also near you!Find it, possibly also near you!
Find it, possibly also near you!
Paul Borgermans
 
Building a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation EngineBuilding a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation Engine
lucenerevolution
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
Apache solr
Apache solrApache solr
Apache solr
Dipen Rangwani
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
gregchanan
 
Cloudera Search Webinar: Big Data Search, Bigger Insights
Cloudera Search Webinar: Big Data Search, Bigger InsightsCloudera Search Webinar: Big Data Search, Bigger Insights
Cloudera Search Webinar: Big Data Search, Bigger Insights
Cloudera, Inc.
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Lucidworks
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
gregchanan
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
gregchanan
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
Cloudera, Inc.
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
Trey Grainger
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Trey Grainger
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
Ramzi Alqrainy
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
Lucidworks
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Ecommerce Solution Provider SysIQ
 
A Practical Introduction to Apache Solr
A Practical Introduction to Apache SolrA Practical Introduction to Apache Solr
A Practical Introduction to Apache Solr
Angel Borroy López
 
Find it, possibly also near you!
Find it, possibly also near you!Find it, possibly also near you!
Find it, possibly also near you!
Paul Borgermans
 
Building a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation EngineBuilding a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation Engine
lucenerevolution
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 

More from Data Science Warsaw (7)

CRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining ProjectsCRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining Projects
Data Science Warsaw
 
Ile informacji jest w danych?
Ile informacji jest w danych?Ile informacji jest w danych?
Ile informacji jest w danych?
Data Science Warsaw
 
Otwarte Miasta
Otwarte MiastaOtwarte Miasta
Otwarte Miasta
Data Science Warsaw
 
Azure - Duże zbiory w chmurze
Azure - Duże zbiory w chmurzeAzure - Duże zbiory w chmurze
Azure - Duże zbiory w chmurze
Data Science Warsaw
 
As simple as Apache Spark
As simple as Apache SparkAs simple as Apache Spark
As simple as Apache Spark
Data Science Warsaw
 
Metody logiczne w analizie danych
Metody logiczne w analizie danych Metody logiczne w analizie danych
Metody logiczne w analizie danych
Data Science Warsaw
 
Haven 2 0
Haven 2 0 Haven 2 0
Haven 2 0
Data Science Warsaw
 

Recently uploaded (20)

Brain, Bytes & Bias: ML Interview Questions You Can’t Miss!
Brain, Bytes & Bias: ML Interview Questions You Can’t Miss!Brain, Bytes & Bias: ML Interview Questions You Can’t Miss!
Brain, Bytes & Bias: ML Interview Questions You Can’t Miss!
yashikanigam1
 
apidays New York 2025 - API Platform Survival Guide by James Higginbotham (La...
apidays New York 2025 - API Platform Survival Guide by James Higginbotham (La...apidays New York 2025 - API Platform Survival Guide by James Higginbotham (La...
apidays New York 2025 - API Platform Survival Guide by James Higginbotham (La...
apidays
 
The fundamental concept of nature of knowledge
The fundamental concept of nature of knowledgeThe fundamental concept of nature of knowledge
The fundamental concept of nature of knowledge
tarrebulehora
 
Lec 11.pdfgghjuuyffhkiiiiuuiiiiiiuhffghjiu
Lec 11.pdfgghjuuyffhkiiiiuuiiiiiiuhffghjiuLec 11.pdfgghjuuyffhkiiiiuuiiiiiiuhffghjiu
Lec 11.pdfgghjuuyffhkiiiiuuiiiiiiuhffghjiu
saifalroby72
 
Embracing AI in Project Management: Final Insights & Future Vision
Embracing AI in Project Management: Final Insights & Future VisionEmbracing AI in Project Management: Final Insights & Future Vision
Embracing AI in Project Management: Final Insights & Future Vision
KavehMomeni1
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 
apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...
apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...
apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...
apidays
 
Block chauin techncology by engineer saniya samreen
Block chauin techncology by engineer saniya samreenBlock chauin techncology by engineer saniya samreen
Block chauin techncology by engineer saniya samreen
Shoyeb16
 
ch068.pptnsnsnjsjjzjzjdjdjdjdjdjdjjdjdjdjdjxj
ch068.pptnsnsnjsjjzjzjdjdjdjdjdjdjjdjdjdjdjxjch068.pptnsnsnjsjjzjzjdjdjdjdjdjdjjdjdjdjdjxj
ch068.pptnsnsnjsjjzjzjdjdjdjdjdjdjjdjdjdjdjxj
MikkoPlanas
 
Understanding Large Language Model Hallucinations: Exploring Causes, Detectio...
Understanding Large Language Model Hallucinations: Exploring Causes, Detectio...Understanding Large Language Model Hallucinations: Exploring Causes, Detectio...
Understanding Large Language Model Hallucinations: Exploring Causes, Detectio...
Tamanna36
 
apidays New York 2025 - How AI is Transforming Product Management by Shereen ...
apidays New York 2025 - How AI is Transforming Product Management by Shereen ...apidays New York 2025 - How AI is Transforming Product Management by Shereen ...
apidays New York 2025 - How AI is Transforming Product Management by Shereen ...
apidays
 
15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf
AffinityCore
 
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Karim Baïna
 
apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...
apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...
apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...
apidays
 
Role_Based_Permissions_Kick-off_Deck_202203.pptx
Role_Based_Permissions_Kick-off_Deck_202203.pptxRole_Based_Permissions_Kick-off_Deck_202203.pptx
Role_Based_Permissions_Kick-off_Deck_202203.pptx
SystemsBenya
 
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptxrefractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
KannanDamodaram
 
Splunk_ITSI_Interview_Prep_Deck.pptx interview
Splunk_ITSI_Interview_Prep_Deck.pptx interviewSplunk_ITSI_Interview_Prep_Deck.pptx interview
Splunk_ITSI_Interview_Prep_Deck.pptx interview
willmorekanan
 
Lec 12.pdfghhjjhhjkkkkkkkkkkkjfcvhiiugcvvh
Lec 12.pdfghhjjhhjkkkkkkkkkkkjfcvhiiugcvvhLec 12.pdfghhjjhhjkkkkkkkkkkkjfcvhiiugcvvh
Lec 12.pdfghhjjhhjkkkkkkkkkkkjfcvhiiugcvvh
saifalroby72
 
Monterey College of Law’s mission is to z
Monterey College of Law’s mission is to zMonterey College of Law’s mission is to z
Monterey College of Law’s mission is to z
seoali2660
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 
Brain, Bytes & Bias: ML Interview Questions You Can’t Miss!
Brain, Bytes & Bias: ML Interview Questions You Can’t Miss!Brain, Bytes & Bias: ML Interview Questions You Can’t Miss!
Brain, Bytes & Bias: ML Interview Questions You Can’t Miss!
yashikanigam1
 
apidays New York 2025 - API Platform Survival Guide by James Higginbotham (La...
apidays New York 2025 - API Platform Survival Guide by James Higginbotham (La...apidays New York 2025 - API Platform Survival Guide by James Higginbotham (La...
apidays New York 2025 - API Platform Survival Guide by James Higginbotham (La...
apidays
 
The fundamental concept of nature of knowledge
The fundamental concept of nature of knowledgeThe fundamental concept of nature of knowledge
The fundamental concept of nature of knowledge
tarrebulehora
 
Lec 11.pdfgghjuuyffhkiiiiuuiiiiiiuhffghjiu
Lec 11.pdfgghjuuyffhkiiiiuuiiiiiiuhffghjiuLec 11.pdfgghjuuyffhkiiiiuuiiiiiiuhffghjiu
Lec 11.pdfgghjuuyffhkiiiiuuiiiiiiuhffghjiu
saifalroby72
 
Embracing AI in Project Management: Final Insights & Future Vision
Embracing AI in Project Management: Final Insights & Future VisionEmbracing AI in Project Management: Final Insights & Future Vision
Embracing AI in Project Management: Final Insights & Future Vision
KavehMomeni1
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 
apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...
apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...
apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...
apidays
 
Block chauin techncology by engineer saniya samreen
Block chauin techncology by engineer saniya samreenBlock chauin techncology by engineer saniya samreen
Block chauin techncology by engineer saniya samreen
Shoyeb16
 
ch068.pptnsnsnjsjjzjzjdjdjdjdjdjdjjdjdjdjdjxj
ch068.pptnsnsnjsjjzjzjdjdjdjdjdjdjjdjdjdjdjxjch068.pptnsnsnjsjjzjzjdjdjdjdjdjdjjdjdjdjdjxj
ch068.pptnsnsnjsjjzjzjdjdjdjdjdjdjjdjdjdjdjxj
MikkoPlanas
 
Understanding Large Language Model Hallucinations: Exploring Causes, Detectio...
Understanding Large Language Model Hallucinations: Exploring Causes, Detectio...Understanding Large Language Model Hallucinations: Exploring Causes, Detectio...
Understanding Large Language Model Hallucinations: Exploring Causes, Detectio...
Tamanna36
 
apidays New York 2025 - How AI is Transforming Product Management by Shereen ...
apidays New York 2025 - How AI is Transforming Product Management by Shereen ...apidays New York 2025 - How AI is Transforming Product Management by Shereen ...
apidays New York 2025 - How AI is Transforming Product Management by Shereen ...
apidays
 
15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf
AffinityCore
 
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Karim Baïna
 
apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...
apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...
apidays New York 2025 - To tune or not to tune by Anamitra Dutta Majumdar (In...
apidays
 
Role_Based_Permissions_Kick-off_Deck_202203.pptx
Role_Based_Permissions_Kick-off_Deck_202203.pptxRole_Based_Permissions_Kick-off_Deck_202203.pptx
Role_Based_Permissions_Kick-off_Deck_202203.pptx
SystemsBenya
 
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptxrefractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
KannanDamodaram
 
Splunk_ITSI_Interview_Prep_Deck.pptx interview
Splunk_ITSI_Interview_Prep_Deck.pptx interviewSplunk_ITSI_Interview_Prep_Deck.pptx interview
Splunk_ITSI_Interview_Prep_Deck.pptx interview
willmorekanan
 
Lec 12.pdfghhjjhhjkkkkkkkkkkkjfcvhiiugcvvh
Lec 12.pdfghhjjhhjkkkkkkkkkkkjfcvhiiugcvvhLec 12.pdfghhjjhhjkkkkkkkkkkkjfcvhiiugcvvh
Lec 12.pdfghhjjhhjkkkkkkkkkkkjfcvhiiugcvvh
saifalroby72
 
Monterey College of Law’s mission is to z
Monterey College of Law’s mission is to zMonterey College of Law’s mission is to z
Monterey College of Law’s mission is to z
seoali2660
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 

How to build your own google

  • 1. How to build your own google ... [email protected] Data Wizards Dec 2015
  • 2. Artur Grządziel few words about me email: [email protected] Currently: BigData and Machine Learning Leader From Jan 2016: BigData Solution Architect at General Electric PhD in progress at PAN (Polish Academy of Sciences) Systems Research Institute Graduated from Warsaw University of Technology and Warsaw School of Economics BigData & Machine Learning enthusiast focused on leveraging Big Data and Machine Learning in real business cases Privately, husband and father pl.linkedin.com/in/ArturGrzadziel
  • 3. Introduction Data Wizards Artur represents „Data Wizards” group – informal group of BigData/Machine Learning/Data Science professionals located in Poland and interested in knowledge sharing and addressing business challenges leveraging modern BigData and Machine Learning methods.
  • 5. MySearch very high level architecture Data Source Index
  • 6. Cloudera search Apache Solr and Tika 1. Other Sources
  • 7. Cloudera Search Cloudera Search is one of Cloudera's near-real-time access products. Cloudera Search enables non-technical users to search and explore data stored in or ingested into Hadoop and HBase. Users do not need SQL or programming skills to use Cloudera Search because it provides a simple, full-text interface for searching. Cloudera Search incorporates Apache Solr, which includes Apache Lucene, SolrCloud, Apache Tika, and Solr Cell. Cloudera Search is tightly integrated with Cloudera's Distribution, including Apache Hadoop (CDH). Cloudera Search provides these key capabilities: - Near-real-time indexing - Batch indexing - Simple, full-text data exploration and navigated drill down http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3- 0/Cloudera-Search-User-Guide/csug_introducing.html
  • 12. Cloudera search Tika – formats Supported Document Formats • HyperText Markup Language • XML and derived formats • Microsoft Office document formats • OpenDocument Format • Portable Document Format • Electronic Publication Format • Rich Text Format • Compression and packaging formats • Text formats • Audio formats • Image formats • Video formats • Java class files and archives • The mbox format https://tika.apache.org/1.4/formats.html
  • 13. Cloudera search Solr – how to start it … .binsolr start –e cloud -noprompt http://lucene.apache.org/solr/
  • 15. Cloudera Search Data id cat name price inStock author series_t sequence_i genre_s 553573403 book A Game of Thrones 7.99 TRUE George R.R. Martin A Song of Ice and Fire 1 fantasy 553579908 book A Clash of Kings 7.99 TRUE George R.R. Martin A Song of Ice and Fire 2 fantasy 055357342X book A Storm of Swords 7.99 TRUE George R.R. Martin A Song of Ice and Fire 3 fantasy 553293354 book Foundation 7.99 TRUE Isaac Asimov Foundation Novels 1 scifi 812521390 book The Black Company 6.99 FALSE Glen Cook The Chronicles of The Black Company 1 fantasy 812550706 book Ender's Game 6.99 TRUE Orson Scott Card Ender 1 scifi 441385532 book Jhereg 7.95 FALSE Steven Brust Vlad Taltos 1 fantasy 380014300 book Nine Princes In Amber 6.99 TRUE Roger Zelazny the Chronicles of Amber 1 fantasy 805080481 book The Book of Three 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 1 fantasy 080508049X book The Black Cauldron 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 2 fantasy
  • 21. Cloudera search Solr – other features The MoreLikeThis search component enables users to query for documents similar to a document in their result list. It is achieved leveraging terms from the original document to find similar documents in the index The SpellCheck component is designed to provide inline query suggestions based on other, similar, terms. Highlighting in Solr allows fragments of documents that match the user's query to be included with the query response. Synonyms, stop words
  • 22. Cloudera search Solr – other features – geospacial search Solr has sophisticated geospatial support, including searching within a specified distance range of a given location (or within a bounding box), sorting by distance, or even boosting results by the distance http://lucene.apache.org/solr/quickstart.html
  • 23. Cloudera Search Common Use Cases Cloudera Search lets your entire business explore and analyze data quickly and easily for a variety of critical use cases all within a single platform, including: - Threat detection - Customer 360-degree visibility - Improved user experience - Interactive market segmentation - Accessible global knowledge base https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache- solr.html
  • 24. Cloudera Search Other Use Cases Instagram: Instagram (a Facebook company) is one of the famous sites, and it uses Solr to power its geosearch API WhiteHouse.gov: The Obama administration's website is inbuilt in Drupal and Solr Netflix: Solr powers basic movie searching on this extremely busy site StubHub.com: This ticket reseller uses Solr to help visitors search for concerts and sporting events. https://www.safaribooksonline.com/library/view/scaling-apache- solr/9781783981748/ch01s05.html
  • 25. How it works ... ?
  • 26. How it works … ? Data Source – documents … Document Content 1 John has a cat 2 John has a dog 3 Eva has a cat 4 George has a dog
  • 27. How it works … ? Data Source – documents … space of unique terms Document Content 1 John has a cat 2 John has a dog 3 Eva has a cat 4 George has a dog 1 2 3 4 1 2 3 5 6 2 3 4 7 2 3 4 List of unique words: 1. John 2. has 3. a 4. cat 5. dog 6. Eva 7. George
  • 28. How it works … ? Data Source – Documents … boolean search with inverted index Term Tot. freq. John 2 has 4 a 4 cat 2 dog 2 Eva 1 George 1 Doc # 1 2 1 2 3 4 1 2 3 4 1 3 2 4 3 4 Dictionary Documents
  • 29. How it works … ? Data Source – documents as vectors Documents document 1 John has a cat document 2 John has a dog document 3 Eva has a cat document 4 George has a dog Space of unique terms -> John has a cat dog Eva George vector representing doc1 -> 1 1 1 1 0 0 0 vector representing doc2 -> 1 1 1 0 1 0 0 vector representing doc3 -> 0 1 1 1 0 1 0 vector representing doc4 -> 0 1 1 0 1 0 1
  • 30. How it works … ? Data Source – Documents … vectors
  • 32. Thank you Data Wizards E-mail: [email protected] Links: • Cloudera Search: http://www.cloudera.com/content/www/en-us/documentation/archive/search/1- 3-0/Cloudera-Search-User-Guide/csug_introducing.html • Tika https://tika.apache.org/ • Apache Solr http://lucene.apache.org/solr/ https://www.cloudera.com/content/www/en-us/products/apache- hadoop/apache-solr.html • Vectors, Inversed Index, Frequency Matrix, etc. ... http://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htm