SlideShare a Scribd company logo
1
Searching Wikipedia with Amazon CloudSearch
2
Agenda
• Project Background
• High-level Architecture
• Summary & Observations
3
Project Background
• Amazon contracted with Search Technologies
to help with beta-testing, prior to the launch of
Amazon CloudSearch
• Decision to use Wikipedia as a convenient data
set for testing purposes
3
4
High-level Architecture
4
5
Indexing
• Wikipedia provides content in a series of large xml files
• Amazon CloudSearch ingests xml in a specified form
• Various content processing tasks to perform
• Splitting into individual documents
• Date normalization
• Metadata extraction & mapping
• Cleanup, etc.
• We used Aspire for these tasks
5
6
Aspire in Brief
• Based on Apache Felix / OSGi
• Thread-safe, multi-threaded, distributable
• Any number of pipelines, conditional branching
• Plug-in components individually testable & upgradable
• In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA.
• Tested with Elasticsearch and SP 2013
6
7
XML Input
7
8
Indexing
• Streaming Wikipedia Dump Files directly into
CloudSearch
• 500 docs/second achieved without much effort
• Using 4 x XL instances of CloudSearch
• 1 x XL EC2 instance for Aspire
8
9
Searching
• Amazon CloudSearch provides a RESTful/XML
interface for search purposes
• For the Wikipedia project, we needed a UI
• Chose to use Twigkit
• Wrote a Java API for CloudSearch
• The Java API is freely downloadable (with source) at
http://www.searchtechnologies.com/java-api-amazon-
cloudsearch.html
9
10
Searching
• Supports navigators and
relevancy customization
• E.g. a “PageRank” style link
analysis was performed
• Limits set high: E.g.
retrieve 500,000 results in a
single list, delivered in just a
few seconds
• Very useful for analysis
applications
• So, what does it look like?
10
11wikipedia.searchtechnologies.com 11
12wikipedia.searchtechnologies.com 12
13
Summary & Observations
• A capable and scalable “raw” engine
• xml in, RESTful/xml out
• Easy to set up – much the same as an EC2
instance
• Elastic scalability
13
14
Summary & Observations
• Cost effective
• From $75 per month, including management /
maintenance
• Extremely convenient
• Switch on / off at leisure
• Promotes experimentation & agility
14
15

More Related Content

PDF
Kibana + timelion: time series with the elastic stack
PPTX
Elk meetup boston - logz.io
PPTX
Meetup #3: Migrate a fast scale system to AWS
PPT
DevOpsCon Cloud Workshop
PPTX
Cloudsolutionday 2016: Getting Started with Severless Architecture
PPT
Stacktician - CloudStack Collab Conference 2014
PPTX
Cloudsolutionday 2016: Compliance and cost controlling on AWS
PPTX
Meetup #3: Migrating an Oracle Application from on-premise to AWS
Kibana + timelion: time series with the elastic stack
Elk meetup boston - logz.io
Meetup #3: Migrate a fast scale system to AWS
DevOpsCon Cloud Workshop
Cloudsolutionday 2016: Getting Started with Severless Architecture
Stacktician - CloudStack Collab Conference 2014
Cloudsolutionday 2016: Compliance and cost controlling on AWS
Meetup #3: Migrating an Oracle Application from on-premise to AWS

What's hot (20)

PPTX
104 meets cloud
PPTX
Sas 2015 event_driven
PPTX
AWS Cloudformation Session 01
PPTX
Greetings from AWS User Group Taiwan
PPTX
Kubernetes as Orchestrator for A10 Lightning Controller
PDF
Intro to Serverless
PPTX
Ansible
PDF
Big problems Big Data, simple solutions
PPTX
Spark volume requirements 2018
PPTX
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
PDF
Cloudsolutionday 2016: DevOps workflow with Docker on AWS
PPTX
Getting started with Laravel & Elasticsearch
PDF
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
PDF
Easy Object Storage Import/Export Using the S3 Connector on Jetstream
PPTX
Kubernetes on OpenStack @eBay
PDF
Building & Testing Scalable Rails Applications
PPTX
Laravel and SOLR
PPTX
Apache CloudStack 4.2: A First Look
PPTX
OpenStack in the Enterprise
PPTX
Scaling Traffic from 0 to 139 Million Unique Visitors
104 meets cloud
Sas 2015 event_driven
AWS Cloudformation Session 01
Greetings from AWS User Group Taiwan
Kubernetes as Orchestrator for A10 Lightning Controller
Intro to Serverless
Ansible
Big problems Big Data, simple solutions
Spark volume requirements 2018
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Cloudsolutionday 2016: DevOps workflow with Docker on AWS
Getting started with Laravel & Elasticsearch
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Easy Object Storage Import/Export Using the S3 Connector on Jetstream
Kubernetes on OpenStack @eBay
Building & Testing Scalable Rails Applications
Laravel and SOLR
Apache CloudStack 4.2: A First Look
OpenStack in the Enterprise
Scaling Traffic from 0 to 139 Million Unique Visitors
Ad

Viewers also liked (8)

PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
PPT
Implementing Powerful IT Search on the Cloud
PDF
Practical Search in the Cloud - By Marc Krellenstein
PDF
Semantic search in the cloud
DOCX
Amazon cloud search comparison report
PPTX
Cloud powered search
PDF
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
PDF
Scaling search with Solr Cloud
Real-time Inverted Search in the Cloud Using Lucene and Storm
Implementing Powerful IT Search on the Cloud
Practical Search in the Cloud - By Marc Krellenstein
Semantic search in the cloud
Amazon cloud search comparison report
Cloud powered search
Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11
Scaling search with Solr Cloud
Ad

Similar to Wikipedia Cloud Search Webinar (20)

PDF
Apereo OAE - Bootcamp
PPTX
Rootconf 2017 - State of the Open Source monitoring landscape
PDF
Oracle Fusion Middleware on Exalogic Best Practises
KEY
Eclipse Enterprise Content Repository (ECR)
PDF
OpenStack Block Storage 101
PDF
Apereo OAE - Architectural overview
PDF
Eclipse Apricot
PDF
WCM-5 WCM Solutions with Drupal and Alfresco
KEY
Introducing Apricot, The Eclipse Content Management Platform
PPT
Real World Rails Deployment
PPTX
Melbourne User Group OAK and MongoDB
PPTX
OpenStack Swift
PDF
Webinar Alpakka 2018-08-16
PDF
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
PDF
Bitnami Bootcamp. OpenStack
PDF
Current State of Affairs – Cloud Computing - Indicthreads Cloud Computing Con...
KEY
A Platform Approach to Enterprise Content Management with Eclipse Apricot, CM...
PPTX
06 integrate elasticsearch
PPTX
Managing storage on Prem and in Cloud
PDF
Fusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks
Apereo OAE - Bootcamp
Rootconf 2017 - State of the Open Source monitoring landscape
Oracle Fusion Middleware on Exalogic Best Practises
Eclipse Enterprise Content Repository (ECR)
OpenStack Block Storage 101
Apereo OAE - Architectural overview
Eclipse Apricot
WCM-5 WCM Solutions with Drupal and Alfresco
Introducing Apricot, The Eclipse Content Management Platform
Real World Rails Deployment
Melbourne User Group OAK and MongoDB
OpenStack Swift
Webinar Alpakka 2018-08-16
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
Bitnami Bootcamp. OpenStack
Current State of Affairs – Cloud Computing - Indicthreads Cloud Computing Con...
A Platform Approach to Enterprise Content Management with Eclipse Apricot, CM...
06 integrate elasticsearch
Managing storage on Prem and in Cloud
Fusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks

More from Search Technologies (6)

PPTX
The Evolution of Search and Big Data
PPTX
Enterprise Search Summit Keynote: A Big Data Architecture for Search
PPTX
Advanced Query Parsing Techniques
PDF
The things you need to know about SharePoint 2013 Search
PPTX
Enterprise Search Best Practices Webinar 4.2013
PDF
Advanced Relevancy Ranking
The Evolution of Search and Big Data
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Advanced Query Parsing Techniques
The things you need to know about SharePoint 2013 Search
Enterprise Search Best Practices Webinar 4.2013
Advanced Relevancy Ranking

Recently uploaded (20)

PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PDF
Top Generative AI Tools for Patent Drafting in 2025.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Sensors and Actuators in IoT Systems using pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
PPTX
Belt and Road Supply Chain Finance Blockchain Solution
PDF
REPORT: Heating appliances market in Poland 2024
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PDF
Google’s NotebookLM Unveils Video Overviews
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
KodekX | Application Modernization Development
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Transforming Manufacturing operations through Intelligent Integrations
PDF
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
Top Generative AI Tools for Patent Drafting in 2025.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Sensors and Actuators in IoT Systems using pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
Belt and Road Supply Chain Finance Blockchain Solution
REPORT: Heating appliances market in Poland 2024
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Google’s NotebookLM Unveils Video Overviews
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
GamePlan Trading System Review: Professional Trader's Honest Take
KodekX | Application Modernization Development
Chapter 3 Spatial Domain Image Processing.pdf
Transforming Manufacturing operations through Intelligent Integrations
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?

Wikipedia Cloud Search Webinar

  • 1. 1 Searching Wikipedia with Amazon CloudSearch
  • 2. 2 Agenda • Project Background • High-level Architecture • Summary & Observations
  • 3. 3 Project Background • Amazon contracted with Search Technologies to help with beta-testing, prior to the launch of Amazon CloudSearch • Decision to use Wikipedia as a convenient data set for testing purposes 3
  • 5. 5 Indexing • Wikipedia provides content in a series of large xml files • Amazon CloudSearch ingests xml in a specified form • Various content processing tasks to perform • Splitting into individual documents • Date normalization • Metadata extraction & mapping • Cleanup, etc. • We used Aspire for these tasks 5
  • 6. 6 Aspire in Brief • Based on Apache Felix / OSGi • Thread-safe, multi-threaded, distributable • Any number of pipelines, conditional branching • Plug-in components individually testable & upgradable • In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA. • Tested with Elasticsearch and SP 2013 6
  • 8. 8 Indexing • Streaming Wikipedia Dump Files directly into CloudSearch • 500 docs/second achieved without much effort • Using 4 x XL instances of CloudSearch • 1 x XL EC2 instance for Aspire 8
  • 9. 9 Searching • Amazon CloudSearch provides a RESTful/XML interface for search purposes • For the Wikipedia project, we needed a UI • Chose to use Twigkit • Wrote a Java API for CloudSearch • The Java API is freely downloadable (with source) at http://www.searchtechnologies.com/java-api-amazon- cloudsearch.html 9
  • 10. 10 Searching • Supports navigators and relevancy customization • E.g. a “PageRank” style link analysis was performed • Limits set high: E.g. retrieve 500,000 results in a single list, delivered in just a few seconds • Very useful for analysis applications • So, what does it look like? 10
  • 13. 13 Summary & Observations • A capable and scalable “raw” engine • xml in, RESTful/xml out • Easy to set up – much the same as an EC2 instance • Elastic scalability 13
  • 14. 14 Summary & Observations • Cost effective • From $75 per month, including management / maintenance • Extremely convenient • Switch on / off at leisure • Promotes experimentation & agility 14
  • 15. 15

Editor's Notes

  • #7: For further information about Aspire, see http://www.searchtechnologies.com/aspire.html
  • #10: The Java API for Amazon CloudSearch can be downloaded from http://www.searchtechnologies.com/java-api-amazon-cloudsearch.html