Wikipedia Cloud Search Webinar

1
Searching Wikipedia with Amazon CloudSearch

2
Agenda
• Project Background
• High-level Architecture
• Summary & Observations

3
Project Background
• Amazon contracted with Search Technologies
to help with beta-testing, prior to the launch of
Amazon CloudSearch
• Decision to use Wikipedia as a convenient data
set for testing purposes
3

5
Indexing
• Wikipedia provides content in a series of large xml files
• Amazon CloudSearch ingests xml in a specified form
• Various content processing tasks to perform
• Splitting into individual documents
• Date normalization
• Metadata extraction & mapping
• Cleanup, etc.
• We used Aspire for these tasks
5

6
Aspire in Brief
• Based on Apache Felix / OSGi
• Thread-safe, multi-threaded, distributable
• Any number of pipelines, conditional branching
• Plug-in components individually testable & upgradable
• In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA.
• Tested with Elasticsearch and SP 2013
6

8
Indexing
• Streaming Wikipedia Dump Files directly into
CloudSearch
• 500 docs/second achieved without much effort
• Using 4 x XL instances of CloudSearch
• 1 x XL EC2 instance for Aspire
8

9
Searching
• Amazon CloudSearch provides a RESTful/XML
interface for search purposes
• For the Wikipedia project, we needed a UI
• Chose to use Twigkit
• Wrote a Java API for CloudSearch
• The Java API is freely downloadable (with source) at
http://www.searchtechnologies.com/java-api-amazon-
cloudsearch.html
9

10
Searching
• Supports navigators and
relevancy customization
• E.g. a “PageRank” style link
analysis was performed
• Limits set high: E.g.
retrieve 500,000 results in a
single list, delivered in just a
few seconds
• Very useful for analysis
applications
• So, what does it look like?
10

11wikipedia.searchtechnologies.com 11

12wikipedia.searchtechnologies.com 12

13
Summary & Observations
• A capable and scalable “raw” engine
• xml in, RESTful/xml out
• Easy to set up – much the same as an EC2
instance
• Elastic scalability
13

14
Summary & Observations
• Cost effective
• From $75 per month, including management /
maintenance
• Extremely convenient
• Switch on / off at leisure
• Promotes experimentation & agility
14

Wikipedia Cloud Search Webinar

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Wikipedia Cloud Search Webinar (20)

More from Search Technologies (6)

Recently uploaded (20)

Wikipedia Cloud Search Webinar

Editor's Notes