Sphinx at Craigslist in 2012

Sphinx at Craigslist
Jeremy Zawodny
craigslist

CL Sphinx Infrastructure
• Live Sphinx
• ~30 million postings
• end users searching for stuff on craigslist
• Team Sphinx
• ~100 million postings
• additional indexes of postings for internal use
(including non-live postings)

CL Sphinx Infrastructure
• Archive Sphinx
• older postings (~3 billion)
• constantly growing in size
• Real-Time Sphinx
• last ~2 days worth of postings
• Forums Sphinx
• ~150 million forum postings

Back in 2008
• MySQL FULL TEXT (MyISAM)
• 25 Servers
• Melted Down Frequently
• Desperately Needed a Solution
• This was my ﬁrst project at craigslist...
• Looked at Solr, Sphinx, Xapian
• Sphinx felt like the right ﬁt

Making Sphinx Work
• Benchmarking showed promising results
• Query performance was great
• ~800qps/instance
• back then we only needed 1,200/sec
• Indexing performance too
• Can index documents far faster than I can
make the XML for input (from Perl)
• Can’t index and serve at the same time, though...

“Live” Sphinx
• One index per city (~700 indexes)
• Main + Delta
• xmlpipe2 input
• Data all ﬁts on a single machine
• 32bit ids
• High churn rate
• Settled on Master/Slave model w/rsync replication
• Deployed in January, 2009

Master/Slave Clusters
• Number of slaves varies (typically 3-7)
master master
slave slave slave slave

master master
slave slave slave slave

Main+Delta Indexes

delta
Regular Merge
from transient delta
today

Periodic Merge Logical
to clean house Index

index

Early Issues
• Monitoring
• Persistent Connections w/prefork
• hacked up my own initially
• Index merge crashes/bugs
• We’re aways running svn snapshots

Early Success
• Replaced the 25 MySQL servers
• Used 10 sphinx servers (2 masters, 8 slaves)
• Search trafﬁc continued to increase
• Tons of headroom!
• Typical search is under 5ms
• New Features
• “nearby” search
• sort by: recent, price, best match

Early Mistakes
• Stopwords
• Not setting query limits
• Sphinx handled this just ﬁne!
• ASCII-only
• Query mangling
• need to understand how users search and what
they expect to ﬁnd
• UpdateAttributes (no kill lists!)

Growth
• Wanted Sphinx for “internal” use
• Created internal “team sphinx” with more indexed
data
• includes not visible postings
• includes additional ﬁelds
• Space became an issue, so had to build some simple
sharding into our code
• 2 clusters: even/odd split for indexes

Live Sphinx Today
• 300+ million queries/day
• 5,000 queries/sec peak load
• removed stopwords
• threaded workers
• dict=keywords
• wildcard search enabled
• UTF-8 (mostly) and charset_table
• blend_chars
• kill lists (no searchd on masters)
• sharded (3 masters, 18 slaves) on blades

Archive Sphinx
• The Archive Project!
• 2.5 billion postings
• Growing by ~1.6 million daily
• String attributes
• 4 shards, each is a 1 master, 2 slave cluster
• Bucket based on UserID (not city)
• Low query volume
• Need a way to reindex all docs

Real-Time Sphinx
• There’s a delay in indexing data on the master and
replicating to the slaves...
• What if we want to offer “real-time search” of your
own postings?

So I built something...
• Known as rtsd (real-time search daemon)
• Sphinx instance with MySQL Protocol
• Primarily uses in-memory indexes
• Used to bridge the gap between “now” and
“archive sphinx”
• Conﬁgured as an N day rolling window
• Runs on archive sphinx master hosts

Sphinx Time Horizons
Classic Team Archive rtsd

0-20min All

20m-1day Visible All All

1-60 days Visible All All

60+ days All

Note:Visible postings are ﬁndable on the site.

rtsd overview

PostingInfo table

rtsd_consumer redis queue

rtsd_indexer PostingCache

rtsd_sphinx

webbie webbie webbie

Daily Posting Buckets
• 3 indexes
• yesterday
• today
• tomorrow
• (DayofYear(PostedDate)%3) = $index_num
• Nightly cron to “TRUNCATE RTINDEX” on the
“tomorrow” index
• sponsored feature!

Future Work
• autonomous nodes (no master/slave)
• many-core blades with SSD storage
• better performance metrics
• we drop a lot of data on the ﬂoor
• log mining and analysis
• sphinx for “table of contents” (browsing)
• haproxy in front of sphinx
• generic sharding code
• testing framework

Sphinx Wishlist
• 32 -> 64 bit migration tool
• capture stats at daemon shut down
• RT optimizations for DELETE (high churn)
• distributed search (agent) conﬁg with multiple
servers per index (for failover and load):

Craigslist is Hiring!
• Developers
• Back-end
• Front-end
• Systems Administrators
• Network Engineers
• Email: z@craiglist.org plain text resume!

Sphinx at Craigslist in 2012

Recommended

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to Sphinx at Craigslist in 2012 (20)

Recently uploaded (20)

Sphinx at Craigslist in 2012

Editor's Notes