SlideShare a Scribd company logo
Search Analytics

      Business Value
            &
      NoSQL Backend


Otis Gospodnetić – Sematext International
  @otisg ◦ @sematext ◦ sematext.com

    sematext.com/search-analytics
About Otis Gospodnetić
• ASF Member: Lucene, Solr, Nutch, Mahout

• Author: Lucene in Action 1 & 2


• Entrepreneur: Sematext, Simpy




                                                                     2
               Copyright 2011 Sematext Int'l. All rights reserved.
Sematext Metrics
●   100% organic: no GMO, no VC
●   4 years old
●   < 10 people
●   7 countries
●   3 timezones
●   2 continents
●   > 100 customers


                                                                         3
                   Copyright 2011 Sematext Int'l. All rights reserved.
About Sematext
    Products & Services
    Consulting, Development, Tech Support:

●   Search (Lucene, Solr, ElasticSearch...)
●   Big Data (Hadoop, HBase, Voldemort...)
●   Web Crawling (Nutch, Droids)
●   Machine Learning (Mahout)


                                                                        4
                  Copyright 2011 Sematext Int'l. All rights reserved.
Agenda

●   What is Search Analytics and why it matters
●   Example reports and their value
●   What we built, why, and how




                                                                          5
                    Copyright 2011 Sematext Int'l. All rights reserved.
Communication
●   twitter.com/sematext
●   twitter.com/otisg
●   hash tags: #stsa or #stanalytics
●   http://sematext.com/search-analytics/index.html
●   Raise your hand!
●   otis@sematext.com



                                                                        6
                  Copyright 2011 Sematext Int'l. All rights reserved.
The Compass


     Search logs are your Map
     Search Analytics is your Compass




                                                                 7
           Copyright 2011 Sematext Int'l. All rights reserved.
High Level Why


                         search
                          users


                      search
                    experience



                       search
                      providers




                                                                8
          Copyright 2011 Sematext Int'l. All rights reserved.
High Level Why
                                                             This search sucks!
                                                   It takes 17 tries to find anything here!
                                                              F!?@#$%^&?!?


                         search
                          users


                      search
                    experience



                       search
                      providers
                                                          Cool, the latest search tweaks
                                                           made our site really sticky!
                                                                     Awesome!



                                                                                           9
          Copyright 2011 Sematext Int'l. All rights reserved.
Don't Be Like This Dude




                                                                10
          Copyright 2011 Sematext Int'l. All rights reserved.
Got Clue?

                Performance Monitoring




    Tuning      Search Analytics                                   UI




                   Quality Assurance




                                                                        11
             Copyright 2011 Sematext Int'l. All rights reserved.
More Concrete Why
●   Measure and monitor everything. Introspection.
●   Supports (re)design, navigation choices
●   Helps with content acquisition & enhancement
●   Improve search experience
●   Mula




                                                                       12
                 Copyright 2011 Sematext Int'l. All rights reserved.
The Moment of Truth
       Question for the audience #1

   What do you use for Search Analytics?

   a) Home grown stuff
   b) Google Analytics
   c) Omniture
   d) Webtrends
   e) Other
   f ) Nothing

                                                                   13
             Copyright 2011 Sematext Int'l. All rights reserved.
Search Analytics Outline
●   Collect: queries & clicks & interactions & ...
●   Analyze: actions / xactions / conversions
●   Output: reports – over time
●   Output++: feedback loop                                             remember this




●   The means, not the goal
●   Ongoing, not one-off


                                                                                        14
                  Copyright 2011 Sematext Int'l. All rights reserved.
Search vs. Web Analytics
●   User intent and information needs vs. inferring
●   Hand in hand
●   Ideally you can relate data from both or even
    unify it




                                                                         15
                   Copyright 2011 Sematext Int'l. All rights reserved.
Example Core Reports
●   Rate & Volume, Latency (mean, avg, 90%)
●   Click Through Rate, Mean Reciprocal Rank
●   Top Queries by count, clicks, 0 hits...
●   Query Trending
●   Top Seen Docs, Top Clicked Docs (msft)
●   Page & Click Depth
●   Facet & Sort Usage
●   ...
                                                                        16
                  Copyright 2011 Sematext Int'l. All rights reserved.
More Reports in More Detail
●   See Search Analytics What? Why?
    How?

    http://blog.sematext.com/tag/analytics/




                                                                        17
                  Copyright 2011 Sematext Int'l. All rights reserved.
Part Dos
     Switching gears... Juno digs NoSQL




                                                                  18
            Copyright 2011 Sematext Int'l. All rights reserved.
What We've Built
●   Search Analytics SaaS
    ●   Numerous reports (e.g. query volume,
        rate, latency, term frequencies /
        comparisons, hit buckets, search origins,
        etc.)
    ●   Trending over time
    ●   Comparisons of time periods
    ●   Top N reports
    ●   Filter, slice and dice


                                                                            19
                      Copyright 2011 Sematext Int'l. All rights reserved.
Who Needs a Compass?
●   We need it
    ●   search-hadoop.com & search-lucene.com

●   Our customers need it!

●   You?




                                                                         20
                   Copyright 2011 Sematext Int'l. All rights reserved.
Sematext Search Analytics




                                                                21
          Copyright 2011 Sematext Int'l. All rights reserved.
Big Dreams
●   SaaS
●   Multitenant
●   Large Scale – Massive Data
●   Cloud




                                                                        22
                  Copyright 2011 Sematext Int'l. All rights reserved.
Storage Choices
●   RDBMS: MySQL, PostgreSQL
●   HDFS
●   Hive
●   HBase
●   Cassandra




                                                                      23
                Copyright 2011 Sematext Int'l. All rights reserved.
SaaS vs. In-House
     Question for the audience #2

     SaaS vs in-house Search Analytics?

     a) SaaS
     b) in-house




                                                                  24
            Copyright 2011 Sematext Int'l. All rights reserved.
Sematext Search Analytics




                                                                25
          Copyright 2011 Sematext Int'l. All rights reserved.
Sematext Search Analytics




                                                                26
          Copyright 2011 Sematext Int'l. All rights reserved.
Sematext Search Analytics




                                                                27
          Copyright 2011 Sematext Int'l. All rights reserved.
Sematext Search Analytics




                                                                28
          Copyright 2011 Sematext Int'l. All rights reserved.
Data Flow
●   See Search Analytics with Flume and HBase
     http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/




                                                                                  29
                           Copyright 2011 Sematext Int'l. All rights reserved.
Data Collection
●   See Search Analytics with Flume and HBase
    http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/




                                                                                     30
                           Copyright 2011 Sematext Int'l. All rights reserved.
Core Tech
●   JavaScript Beacons
●   Metric Capture Web App aka Receiver
●   Flume Agents, Collectors, Sinks
●   HBase
●   MapReduce Aggregations
●   Search Analytics Reporting Web App



                                                                       31
                 Copyright 2011 Sematext Int'l. All rights reserved.
What is Flume
●   Distributed data/log collection service
●   Scalable, configurable, extensible
●   Centrally manageable, open source

●   Agents get data from app, Collectors save it
●   Abstractions: Source → Decorator(s) → Sink



                                                                         32
                   Copyright 2011 Sematext Int'l. All rights reserved.
What is HBase
●   Scalable, reliable, distributed, column-oriented DB
●   On top of HDFS
●   MapReducable




                                                                        33
                  Copyright 2011 Sematext Int'l. All rights reserved.
Data Flow, Detailed




                                                                 34
           Copyright 2011 Sematext Int'l. All rights reserved.
Why Flume
●   Reliable delivery
    ●   e.g. queue msgs locally if destination unreachable
●   Easy, centralized management via Web UI or
    console
●   Good community, good progress, now @ASF
●   But: more complex, more moving parts
●   On Flume: slideshare.net/cloudera/inside-flume
●   Alternatives: Kafka, Scribe...

                                                                            35
                      Copyright 2011 Sematext Int'l. All rights reserved.
Why HBase
●   Scalable raw & aggregate data storage
●   MapReduce data input
●   Fast scans for time ranges, fast key lookups
●   Easy storage and compute power expansion
●   Good looking roadmap, community, progress




                                                                        36
                  Copyright 2011 Sematext Int'l. All rights reserved.
Open Sourcing
●   2 open-source projects:
    github.com/sematext/HBaseWD
    github.com/sematext/HBaseHUT
●   See sematext.com/open-source/index.html

●   Patches for Flume and HBase
    blog.sematext.com/tag/flume/


                                                                        37
                  Copyright 2011 Sematext Int'l. All rights reserved.
Challenges
●   Data size. Solutions:
    ●   Compression (4-5x smaller with lzo)
    ●   Data pruning (variable levels)
●   Query string distribution: very long-tail
    ●   Lots of data to process, update, aggregate
●   Young tools: Flume, HBase
●   Poor IO on EC2
●   Hadoop distributions

                                                                           38
                     Copyright 2011 Sematext Int'l. All rights reserved.
Output++
●   AutoComplete - $MM improvement
●   Better DYM Spellchecker
●   Related Searches
●   Recommendations
●   Relevance Feedback
●   ...



                                                                      39
                Copyright 2011 Sematext Int'l. All rights reserved.
Closing the Loop

                         search
                          users



                      search
                    experience




                        search
                       providers




                                                                40
          Copyright 2011 Sematext Int'l. All rights reserved.
Resource
                                      Search Analytics for Your Site
                                                  Louis Rosenfeld




           http://rosenfeldmedia.com/books/searchanalytics/




                                                                       41
              Copyright 2011 Sematext Int'l. All rights reserved.
We're Hiring
    Dig Search?
    Dig Analytics?
    Dig Big Data?
    Dig Performance?
    Dig working with and in open-source?
    We're hiring world-wide!
    http://sematext.com/about/jobs.html


                                                                  42
            Copyright 2011 Sematext Int'l. All rights reserved.
Contact
      sematext.com
      blog.sematext.com
      @sematext
      @otisg
      otis@sematext.com

      Want SA? Grab me or go to:
          sematext.com/search-analytics

      Hash tags: #stsa or #stanalytics
                                                                  43
            Copyright 2011 Sematext Int'l. All rights reserved.
Ad

More Related Content

Similar to Search Analytics Business Value & NoSQL Backend (20)

Content Analytics for Better Search
Content Analytics for Better SearchContent Analytics for Better Search
Content Analytics for Better Search
Seth Grimes
 
Getting The Most Out of Google Analytics
Getting The Most Out of Google AnalyticsGetting The Most Out of Google Analytics
Getting The Most Out of Google Analytics
Kat Jenkins
 
Getting the Most Out of Google Analytics
Getting the Most Out of Google AnalyticsGetting the Most Out of Google Analytics
Getting the Most Out of Google Analytics
Sanger & Eby
 
Measuring web performance. Velocity EU 2011
Measuring web performance. Velocity EU 2011Measuring web performance. Velocity EU 2011
Measuring web performance. Velocity EU 2011
Stephen Thair
 
Search Systems Redux
Search Systems ReduxSearch Systems Redux
Search Systems Redux
ECM-Search Consultant - EContent Magazine
 
Samepoint API
Samepoint APISamepoint API
Samepoint API
Darren Culbreath
 
Search Analytics What? Why? How?
Search Analytics What? Why? How?Search Analytics What? Why? How?
Search Analytics What? Why? How?
Lucidworks (Archived)
 
Five Pillars of SharePoint Governance Supportability
Five Pillars of SharePoint Governance SupportabilityFive Pillars of SharePoint Governance Supportability
Five Pillars of SharePoint Governance Supportability
Sentri
 
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018
Timothy Spann
 
Digital Asset Management with Alfresco
Digital Asset Management with AlfrescoDigital Asset Management with Alfresco
Digital Asset Management with Alfresco
rivetlogic
 
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
mosaicnet
 
Making most of marketing dashboards
Making most of marketing dashboardsMaking most of marketing dashboards
Making most of marketing dashboards
Stratigent
 
TIRTA ERP
TIRTA ERPTIRTA ERP
TIRTA ERP
Wildan Maulana
 
Empower your Enterprise with language intelligence_Francisco Webber
Empower your Enterprise with language intelligence_Francisco Webber Empower your Enterprise with language intelligence_Francisco Webber
Empower your Enterprise with language intelligence_Francisco Webber
Dataconomy Media
 
Real Time Analytics for Big Data - A twitter inspired case study
Real Time Analytics for Big Data - A twitter inspired case studyReal Time Analytics for Big Data - A twitter inspired case study
Real Time Analytics for Big Data - A twitter inspired case study
Uri Cohen
 
Dude where's my backlog?
Dude where's my backlog?Dude where's my backlog?
Dude where's my backlog?
Robin Dymond
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFi
Timothy Spann
 
Deploying Enterprise Search in PLM Context with Aras
Deploying Enterprise Search in PLM Context with ArasDeploying Enterprise Search in PLM Context with Aras
Deploying Enterprise Search in PLM Context with Aras
Aras
 
UPA 2011 - Better Usability Through Visualization
UPA 2011 - Better Usability Through VisualizationUPA 2011 - Better Usability Through Visualization
UPA 2011 - Better Usability Through Visualization
OneSpring LLC
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
Content Analytics for Better Search
Content Analytics for Better SearchContent Analytics for Better Search
Content Analytics for Better Search
Seth Grimes
 
Getting The Most Out of Google Analytics
Getting The Most Out of Google AnalyticsGetting The Most Out of Google Analytics
Getting The Most Out of Google Analytics
Kat Jenkins
 
Getting the Most Out of Google Analytics
Getting the Most Out of Google AnalyticsGetting the Most Out of Google Analytics
Getting the Most Out of Google Analytics
Sanger & Eby
 
Measuring web performance. Velocity EU 2011
Measuring web performance. Velocity EU 2011Measuring web performance. Velocity EU 2011
Measuring web performance. Velocity EU 2011
Stephen Thair
 
Five Pillars of SharePoint Governance Supportability
Five Pillars of SharePoint Governance SupportabilityFive Pillars of SharePoint Governance Supportability
Five Pillars of SharePoint Governance Supportability
Sentri
 
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018
Timothy Spann
 
Digital Asset Management with Alfresco
Digital Asset Management with AlfrescoDigital Asset Management with Alfresco
Digital Asset Management with Alfresco
rivetlogic
 
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
mosaicnet
 
Making most of marketing dashboards
Making most of marketing dashboardsMaking most of marketing dashboards
Making most of marketing dashboards
Stratigent
 
Empower your Enterprise with language intelligence_Francisco Webber
Empower your Enterprise with language intelligence_Francisco Webber Empower your Enterprise with language intelligence_Francisco Webber
Empower your Enterprise with language intelligence_Francisco Webber
Dataconomy Media
 
Real Time Analytics for Big Data - A twitter inspired case study
Real Time Analytics for Big Data - A twitter inspired case studyReal Time Analytics for Big Data - A twitter inspired case study
Real Time Analytics for Big Data - A twitter inspired case study
Uri Cohen
 
Dude where's my backlog?
Dude where's my backlog?Dude where's my backlog?
Dude where's my backlog?
Robin Dymond
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFi
Timothy Spann
 
Deploying Enterprise Search in PLM Context with Aras
Deploying Enterprise Search in PLM Context with ArasDeploying Enterprise Search in PLM Context with Aras
Deploying Enterprise Search in PLM Context with Aras
Aras
 
UPA 2011 - Better Usability Through Visualization
UPA 2011 - Better Usability Through VisualizationUPA 2011 - Better Usability Through Visualization
UPA 2011 - Better Usability Through Visualization
OneSpring LLC
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 

More from Sematext Group, Inc. (20)

Tweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities ExplainedTweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities Explained
Sematext Group, Inc.
 
OOPs, OOMs, oh my! Containerizing JVM apps
OOPs, OOMs, oh my! Containerizing JVM appsOOPs, OOMs, oh my! Containerizing JVM apps
OOPs, OOMs, oh my! Containerizing JVM apps
Sematext Group, Inc.
 
Is observability good for your brain?
Is observability good for your brain?Is observability good for your brain?
Is observability good for your brain?
Sematext Group, Inc.
 
Introducing log analysis to your organization
Introducing log analysis to your organization Introducing log analysis to your organization
Introducing log analysis to your organization
Sematext Group, Inc.
 
Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for You
Sematext Group, Inc.
 
Solr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the UglySolr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the Ugly
Sematext Group, Inc.
 
Monitoring and Log Management for
Monitoring and Log Management forMonitoring and Log Management for
Monitoring and Log Management for
Sematext Group, Inc.
 
Introduction to solr
Introduction to solrIntroduction to solr
Introduction to solr
Sematext Group, Inc.
 
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Building Resilient Log Aggregation Pipeline with Elasticsearch & KafkaBuilding Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Sematext Group, Inc.
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
Sematext Group, Inc.
 
How to Run Solr on Docker and Why
How to Run Solr on Docker and WhyHow to Run Solr on Docker and Why
How to Run Solr on Docker and Why
Sematext Group, Inc.
 
Tuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for LogsTuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for Logs
Sematext Group, Inc.
 
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Running High Performance & Fault-tolerant Elasticsearch Clusters on DockerRunning High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Sematext Group, Inc.
 
Top Node.js Metrics to Watch
Top Node.js Metrics to WatchTop Node.js Metrics to Watch
Top Node.js Metrics to Watch
Sematext Group, Inc.
 
Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker
Running High Performance and Fault Tolerant Elasticsearch Clusters on DockerRunning High Performance and Fault Tolerant Elasticsearch Clusters on Docker
Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker
Sematext Group, Inc.
 
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Sematext Group, Inc.
 
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
Sematext Group, Inc.
 
Docker Logging Webinar
Docker Logging  WebinarDocker Logging  Webinar
Docker Logging Webinar
Sematext Group, Inc.
 
Docker Monitoring Webinar
Docker Monitoring  WebinarDocker Monitoring  Webinar
Docker Monitoring Webinar
Sematext Group, Inc.
 
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
Metrics, Logs, Transaction Traces, Anomaly Detection at ScaleMetrics, Logs, Transaction Traces, Anomaly Detection at Scale
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
Sematext Group, Inc.
 
Tweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities ExplainedTweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities Explained
Sematext Group, Inc.
 
OOPs, OOMs, oh my! Containerizing JVM apps
OOPs, OOMs, oh my! Containerizing JVM appsOOPs, OOMs, oh my! Containerizing JVM apps
OOPs, OOMs, oh my! Containerizing JVM apps
Sematext Group, Inc.
 
Is observability good for your brain?
Is observability good for your brain?Is observability good for your brain?
Is observability good for your brain?
Sematext Group, Inc.
 
Introducing log analysis to your organization
Introducing log analysis to your organization Introducing log analysis to your organization
Introducing log analysis to your organization
Sematext Group, Inc.
 
Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for You
Sematext Group, Inc.
 
Solr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the UglySolr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the Ugly
Sematext Group, Inc.
 
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Building Resilient Log Aggregation Pipeline with Elasticsearch & KafkaBuilding Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Sematext Group, Inc.
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
Sematext Group, Inc.
 
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Running High Performance & Fault-tolerant Elasticsearch Clusters on DockerRunning High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Sematext Group, Inc.
 
Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker
Running High Performance and Fault Tolerant Elasticsearch Clusters on DockerRunning High Performance and Fault Tolerant Elasticsearch Clusters on Docker
Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker
Sematext Group, Inc.
 
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Sematext Group, Inc.
 
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
Sematext Group, Inc.
 
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
Metrics, Logs, Transaction Traces, Anomaly Detection at ScaleMetrics, Logs, Transaction Traces, Anomaly Detection at Scale
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
Sematext Group, Inc.
 
Ad

Recently uploaded (20)

Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Ad

Search Analytics Business Value & NoSQL Backend

  • 1. Search Analytics Business Value & NoSQL Backend Otis Gospodnetić – Sematext International @otisg ◦ @sematext ◦ sematext.com sematext.com/search-analytics
  • 2. About Otis Gospodnetić • ASF Member: Lucene, Solr, Nutch, Mahout • Author: Lucene in Action 1 & 2 • Entrepreneur: Sematext, Simpy 2 Copyright 2011 Sematext Int'l. All rights reserved.
  • 3. Sematext Metrics ● 100% organic: no GMO, no VC ● 4 years old ● < 10 people ● 7 countries ● 3 timezones ● 2 continents ● > 100 customers 3 Copyright 2011 Sematext Int'l. All rights reserved.
  • 4. About Sematext Products & Services Consulting, Development, Tech Support: ● Search (Lucene, Solr, ElasticSearch...) ● Big Data (Hadoop, HBase, Voldemort...) ● Web Crawling (Nutch, Droids) ● Machine Learning (Mahout) 4 Copyright 2011 Sematext Int'l. All rights reserved.
  • 5. Agenda ● What is Search Analytics and why it matters ● Example reports and their value ● What we built, why, and how 5 Copyright 2011 Sematext Int'l. All rights reserved.
  • 6. Communication ● twitter.com/sematext ● twitter.com/otisg ● hash tags: #stsa or #stanalytics ● http://sematext.com/search-analytics/index.html ● Raise your hand! ● [email protected] 6 Copyright 2011 Sematext Int'l. All rights reserved.
  • 7. The Compass Search logs are your Map Search Analytics is your Compass 7 Copyright 2011 Sematext Int'l. All rights reserved.
  • 8. High Level Why search users search experience search providers 8 Copyright 2011 Sematext Int'l. All rights reserved.
  • 9. High Level Why This search sucks! It takes 17 tries to find anything here! F!?@#$%^&?!? search users search experience search providers Cool, the latest search tweaks made our site really sticky! Awesome! 9 Copyright 2011 Sematext Int'l. All rights reserved.
  • 10. Don't Be Like This Dude 10 Copyright 2011 Sematext Int'l. All rights reserved.
  • 11. Got Clue? Performance Monitoring Tuning Search Analytics UI Quality Assurance 11 Copyright 2011 Sematext Int'l. All rights reserved.
  • 12. More Concrete Why ● Measure and monitor everything. Introspection. ● Supports (re)design, navigation choices ● Helps with content acquisition & enhancement ● Improve search experience ● Mula 12 Copyright 2011 Sematext Int'l. All rights reserved.
  • 13. The Moment of Truth Question for the audience #1 What do you use for Search Analytics? a) Home grown stuff b) Google Analytics c) Omniture d) Webtrends e) Other f ) Nothing 13 Copyright 2011 Sematext Int'l. All rights reserved.
  • 14. Search Analytics Outline ● Collect: queries & clicks & interactions & ... ● Analyze: actions / xactions / conversions ● Output: reports – over time ● Output++: feedback loop remember this ● The means, not the goal ● Ongoing, not one-off 14 Copyright 2011 Sematext Int'l. All rights reserved.
  • 15. Search vs. Web Analytics ● User intent and information needs vs. inferring ● Hand in hand ● Ideally you can relate data from both or even unify it 15 Copyright 2011 Sematext Int'l. All rights reserved.
  • 16. Example Core Reports ● Rate & Volume, Latency (mean, avg, 90%) ● Click Through Rate, Mean Reciprocal Rank ● Top Queries by count, clicks, 0 hits... ● Query Trending ● Top Seen Docs, Top Clicked Docs (msft) ● Page & Click Depth ● Facet & Sort Usage ● ... 16 Copyright 2011 Sematext Int'l. All rights reserved.
  • 17. More Reports in More Detail ● See Search Analytics What? Why? How? http://blog.sematext.com/tag/analytics/ 17 Copyright 2011 Sematext Int'l. All rights reserved.
  • 18. Part Dos Switching gears... Juno digs NoSQL 18 Copyright 2011 Sematext Int'l. All rights reserved.
  • 19. What We've Built ● Search Analytics SaaS ● Numerous reports (e.g. query volume, rate, latency, term frequencies / comparisons, hit buckets, search origins, etc.) ● Trending over time ● Comparisons of time periods ● Top N reports ● Filter, slice and dice 19 Copyright 2011 Sematext Int'l. All rights reserved.
  • 20. Who Needs a Compass? ● We need it ● search-hadoop.com & search-lucene.com ● Our customers need it! ● You? 20 Copyright 2011 Sematext Int'l. All rights reserved.
  • 21. Sematext Search Analytics 21 Copyright 2011 Sematext Int'l. All rights reserved.
  • 22. Big Dreams ● SaaS ● Multitenant ● Large Scale – Massive Data ● Cloud 22 Copyright 2011 Sematext Int'l. All rights reserved.
  • 23. Storage Choices ● RDBMS: MySQL, PostgreSQL ● HDFS ● Hive ● HBase ● Cassandra 23 Copyright 2011 Sematext Int'l. All rights reserved.
  • 24. SaaS vs. In-House Question for the audience #2 SaaS vs in-house Search Analytics? a) SaaS b) in-house 24 Copyright 2011 Sematext Int'l. All rights reserved.
  • 25. Sematext Search Analytics 25 Copyright 2011 Sematext Int'l. All rights reserved.
  • 26. Sematext Search Analytics 26 Copyright 2011 Sematext Int'l. All rights reserved.
  • 27. Sematext Search Analytics 27 Copyright 2011 Sematext Int'l. All rights reserved.
  • 28. Sematext Search Analytics 28 Copyright 2011 Sematext Int'l. All rights reserved.
  • 29. Data Flow ● See Search Analytics with Flume and HBase http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/ 29 Copyright 2011 Sematext Int'l. All rights reserved.
  • 30. Data Collection ● See Search Analytics with Flume and HBase http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/ 30 Copyright 2011 Sematext Int'l. All rights reserved.
  • 31. Core Tech ● JavaScript Beacons ● Metric Capture Web App aka Receiver ● Flume Agents, Collectors, Sinks ● HBase ● MapReduce Aggregations ● Search Analytics Reporting Web App 31 Copyright 2011 Sematext Int'l. All rights reserved.
  • 32. What is Flume ● Distributed data/log collection service ● Scalable, configurable, extensible ● Centrally manageable, open source ● Agents get data from app, Collectors save it ● Abstractions: Source → Decorator(s) → Sink 32 Copyright 2011 Sematext Int'l. All rights reserved.
  • 33. What is HBase ● Scalable, reliable, distributed, column-oriented DB ● On top of HDFS ● MapReducable 33 Copyright 2011 Sematext Int'l. All rights reserved.
  • 34. Data Flow, Detailed 34 Copyright 2011 Sematext Int'l. All rights reserved.
  • 35. Why Flume ● Reliable delivery ● e.g. queue msgs locally if destination unreachable ● Easy, centralized management via Web UI or console ● Good community, good progress, now @ASF ● But: more complex, more moving parts ● On Flume: slideshare.net/cloudera/inside-flume ● Alternatives: Kafka, Scribe... 35 Copyright 2011 Sematext Int'l. All rights reserved.
  • 36. Why HBase ● Scalable raw & aggregate data storage ● MapReduce data input ● Fast scans for time ranges, fast key lookups ● Easy storage and compute power expansion ● Good looking roadmap, community, progress 36 Copyright 2011 Sematext Int'l. All rights reserved.
  • 37. Open Sourcing ● 2 open-source projects: github.com/sematext/HBaseWD github.com/sematext/HBaseHUT ● See sematext.com/open-source/index.html ● Patches for Flume and HBase blog.sematext.com/tag/flume/ 37 Copyright 2011 Sematext Int'l. All rights reserved.
  • 38. Challenges ● Data size. Solutions: ● Compression (4-5x smaller with lzo) ● Data pruning (variable levels) ● Query string distribution: very long-tail ● Lots of data to process, update, aggregate ● Young tools: Flume, HBase ● Poor IO on EC2 ● Hadoop distributions 38 Copyright 2011 Sematext Int'l. All rights reserved.
  • 39. Output++ ● AutoComplete - $MM improvement ● Better DYM Spellchecker ● Related Searches ● Recommendations ● Relevance Feedback ● ... 39 Copyright 2011 Sematext Int'l. All rights reserved.
  • 40. Closing the Loop search users search experience search providers 40 Copyright 2011 Sematext Int'l. All rights reserved.
  • 41. Resource Search Analytics for Your Site Louis Rosenfeld http://rosenfeldmedia.com/books/searchanalytics/ 41 Copyright 2011 Sematext Int'l. All rights reserved.
  • 42. We're Hiring Dig Search? Dig Analytics? Dig Big Data? Dig Performance? Dig working with and in open-source? We're hiring world-wide! http://sematext.com/about/jobs.html 42 Copyright 2011 Sematext Int'l. All rights reserved.
  • 43. Contact sematext.com blog.sematext.com @sematext @otisg [email protected] Want SA? Grab me or go to: sematext.com/search-analytics Hash tags: #stsa or #stanalytics 43 Copyright 2011 Sematext Int'l. All rights reserved.