SlideShare a Scribd company logo
Sphinx at Craigslist
      Jeremy Zawodny
          craigslist
Brief Overview
CL Sphinx Infrastructure
• Live Sphinx
 • ~30 million postings
 • end users searching for stuff on craigslist
• Team Sphinx
 • ~100 million postings
 • additional indexes of postings for internal use
    (including non-live postings)
CL Sphinx Infrastructure
• Archive Sphinx
 • older postings (~3 billion)
 • constantly growing in size
• Real-Time Sphinx
 • last ~2 days worth of postings
• Forums Sphinx
 • ~150 million forum postings
How We Got Here
Back in 2008
• MySQL FULL TEXT (MyISAM)
• 25 Servers
• Melted Down Frequently
• Desperately Needed a Solution
• This was my first project at craigslist...
• Looked at Solr, Sphinx, Xapian
• Sphinx felt like the right fit
Making Sphinx Work
• Benchmarking showed promising results
 • Query performance was great
   • ~800qps/instance
   • back then we only needed 1,200/sec
 • Indexing performance too
   • Can index documents far faster than I can
      make the XML for input (from Perl)
• Can’t index and serve at the same time, though...
“Live” Sphinx
• One index per city (~700 indexes)
 • Main + Delta
 • xmlpipe2 input
• Data all fits on a single machine
• 32bit ids
• High churn rate
• Settled on Master/Slave model w/rsync replication
• Deployed in January, 2009
Master/Slave Clusters
• Number of slaves varies (typically 3-7)
          master                    master
       slave   slave             slave   slave


          master                    master
       slave   slave             slave   slave
Main+Delta Indexes

                         delta
     Regular Merge
  from transient delta
                         today



     Periodic Merge              Logical
     to clean house               Index




                         index
Early Issues
• Monitoring
• Persistent Connections w/prefork
 • hacked up my own initially
• Index merge crashes/bugs
• We’re aways running svn snapshots
Early Success
• Replaced the 25 MySQL servers
• Used 10 sphinx servers (2 masters, 8 slaves)
• Search traffic continued to increase
• Tons of headroom!
• Typical search is under 5ms
• New Features
 • “nearby” search
 • sort by: recent, price, best match
Early Mistakes
• Stopwords
• Not setting query limits
 • Sphinx handled this just fine!
• ASCII-only
• Query mangling
 • need to understand how users search and what
    they expect to find
• UpdateAttributes (no kill lists!)
What Then?
Growth
• Wanted Sphinx for “internal” use
• Created internal “team sphinx” with more indexed
  data
 • includes not visible postings
 • includes additional fields
• Space became an issue, so had to build some simple
  sharding into our code
 • 2 clusters: even/odd split for indexes
Live Sphinx Today
•   300+ million queries/day
•   5,000 queries/sec peak load
•   removed stopwords
•   threaded workers
•   dict=keywords
•   wildcard search enabled
•   UTF-8 (mostly) and charset_table
•   blend_chars
•   kill lists (no searchd on masters)
•   sharded (3 masters, 18 slaves) on blades
Sharding
Query Volume
Archive Sphinx
• The Archive Project!
• 2.5 billion postings
• Growing by ~1.6 million daily
• String attributes
• 4 shards, each is a 1 master, 2 slave cluster
• Bucket based on UserID (not city)
• Low query volume
• Need a way to reindex all docs
Real-Time Sphinx
• There’s a delay in indexing data on the master and
  replicating to the slaves...
• What if we want to offer “real-time search” of your
  own postings?
So I built something...
• Known as rtsd (real-time search daemon)
• Sphinx instance with MySQL Protocol
• Primarily uses in-memory indexes
• Used to bridge the gap between “now” and
  “archive sphinx”
• Configured as an N day rolling window
• Runs on archive sphinx master hosts
Sphinx Time Horizons
            Classic     Team      Archive       rtsd

0-20min                                         All

20m-1day    Visible      All         All

1-60 days   Visible      All         All

60+ days                             All

    Note:Visible postings are findable on the site.
rtsd overview

PostingInfo table




rtsd_consumer       redis queue




                    rtsd_indexer   PostingCache




                    rtsd_sphinx




    webbie            webbie         webbie
Daily Posting Buckets
• 3 indexes
 • yesterday
 • today
 • tomorrow
• (DayofYear(PostedDate)%3) = $index_num
• Nightly cron to “TRUNCATE RTINDEX” on the
  “tomorrow” index
 • sponsored feature!
rtsd indexes
rtsd virtual indexes
rtsd virtual indexes
Future Work
•   autonomous nodes (no master/slave)
    •   many-core blades with SSD storage
•   better performance metrics
    •   we drop a lot of data on the floor
•   log mining and analysis
•   sphinx for “table of contents” (browsing)
•   haproxy in front of sphinx
•   generic sharding code
•   testing framework
Sphinx Wishlist
• 32 -> 64 bit migration tool
• capture stats at daemon shut down
• RT optimizations for DELETE (high churn)
• distributed search (agent) config with multiple
  servers per index (for failover and load):
Sphinx Wishlist
• 32 -> 64 bit migration tool
• capture stats at daemon shut down
• RT optimizations for DELETE (high churn)
• distributed search (agent) config with multiple
  servers per index (for failover and load):
Craigslist is Hiring!
• Developers
 • Back-end
 • Front-end
• Systems Administrators
• Network Engineers
• Email: z@craiglist.org plain text resume!
Ad

More Related Content

What's hot (20)

«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub
it-people
 
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChangerZero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
MongoDB
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
andrew311
 
MongoDB World 2016: Poster Sessions eBook
MongoDB World 2016: Poster Sessions eBookMongoDB World 2016: Poster Sessions eBook
MongoDB World 2016: Poster Sessions eBook
MongoDB
 
MongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log CollectorMongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log Collector
Pierre Baillet
 
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
leifwalsh
 
MongoDB Basic Concepts
MongoDB Basic ConceptsMongoDB Basic Concepts
MongoDB Basic Concepts
MongoDB
 
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops TeamManaging 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Redis Labs
 
企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用
企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用
企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用
Akira Kitauchi
 
A simple introduction to redis
A simple introduction to redisA simple introduction to redis
A simple introduction to redis
Zhichao Liang
 
MongoDB Hadoop DC
MongoDB Hadoop DCMongoDB Hadoop DC
MongoDB Hadoop DC
Mike Dirolf
 
Back to Basics Webinar 6: Production Deployment
Back to Basics Webinar 6: Production DeploymentBack to Basics Webinar 6: Production Deployment
Back to Basics Webinar 6: Production Deployment
MongoDB
 
Sharding
ShardingSharding
Sharding
MongoDB
 
Why Your MongoDB Needs Redis
Why Your MongoDB Needs RedisWhy Your MongoDB Needs Redis
Why Your MongoDB Needs Redis
Itamar Haber
 
Mongodb
MongodbMongodb
Mongodb
Scott Motte
 
Drupal meets PostgreSQL for DrupalCamp MSK 2014
Drupal meets PostgreSQL for DrupalCamp MSK 2014Drupal meets PostgreSQL for DrupalCamp MSK 2014
Drupal meets PostgreSQL for DrupalCamp MSK 2014
Kate Marshalkina
 
Webinar Back to Basics 3 - Introduzione ai Replica Set
Webinar Back to Basics 3 - Introduzione ai Replica SetWebinar Back to Basics 3 - Introduzione ai Replica Set
Webinar Back to Basics 3 - Introduzione ai Replica Set
MongoDB
 
Redis overview for Software Architecture Forum
Redis overview for Software Architecture ForumRedis overview for Software Architecture Forum
Redis overview for Software Architecture Forum
Christopher Spring
 
From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)
MongoSF
 
Memory: The New Disk
Memory: The New DiskMemory: The New Disk
Memory: The New Disk
Tim Lossen
 
«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub
it-people
 
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChangerZero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
MongoDB
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
andrew311
 
MongoDB World 2016: Poster Sessions eBook
MongoDB World 2016: Poster Sessions eBookMongoDB World 2016: Poster Sessions eBook
MongoDB World 2016: Poster Sessions eBook
MongoDB
 
MongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log CollectorMongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log Collector
Pierre Baillet
 
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
leifwalsh
 
MongoDB Basic Concepts
MongoDB Basic ConceptsMongoDB Basic Concepts
MongoDB Basic Concepts
MongoDB
 
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops TeamManaging 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Redis Labs
 
企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用
企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用
企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用
Akira Kitauchi
 
A simple introduction to redis
A simple introduction to redisA simple introduction to redis
A simple introduction to redis
Zhichao Liang
 
MongoDB Hadoop DC
MongoDB Hadoop DCMongoDB Hadoop DC
MongoDB Hadoop DC
Mike Dirolf
 
Back to Basics Webinar 6: Production Deployment
Back to Basics Webinar 6: Production DeploymentBack to Basics Webinar 6: Production Deployment
Back to Basics Webinar 6: Production Deployment
MongoDB
 
Sharding
ShardingSharding
Sharding
MongoDB
 
Why Your MongoDB Needs Redis
Why Your MongoDB Needs RedisWhy Your MongoDB Needs Redis
Why Your MongoDB Needs Redis
Itamar Haber
 
Drupal meets PostgreSQL for DrupalCamp MSK 2014
Drupal meets PostgreSQL for DrupalCamp MSK 2014Drupal meets PostgreSQL for DrupalCamp MSK 2014
Drupal meets PostgreSQL for DrupalCamp MSK 2014
Kate Marshalkina
 
Webinar Back to Basics 3 - Introduzione ai Replica Set
Webinar Back to Basics 3 - Introduzione ai Replica SetWebinar Back to Basics 3 - Introduzione ai Replica Set
Webinar Back to Basics 3 - Introduzione ai Replica Set
MongoDB
 
Redis overview for Software Architecture Forum
Redis overview for Software Architecture ForumRedis overview for Software Architecture Forum
Redis overview for Software Architecture Forum
Christopher Spring
 
From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)
MongoSF
 
Memory: The New Disk
Memory: The New DiskMemory: The New Disk
Memory: The New Disk
Tim Lossen
 

Viewers also liked (12)

Fulltext engine for non fulltext searches
Fulltext engine for non fulltext searchesFulltext engine for non fulltext searches
Fulltext engine for non fulltext searches
Adrian Nuta
 
Advanced fulltext search with Sphinx
Advanced fulltext search with SphinxAdvanced fulltext search with Sphinx
Advanced fulltext search with Sphinx
Adrian Nuta
 
SphinxSearch
SphinxSearchSphinxSearch
SphinxSearch
Przemyslaw Wroblewski
 
Managing Big Data with MySQL
Managing Big Data with MySQLManaging Big Data with MySQL
Managing Big Data with MySQL
mwasaha mwagambo
 
Sphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQLSphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQL
Nguyen Van Vuong
 
MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6
MYXPLAIN
 
MySQL Performance Tips & Best Practices
MySQL Performance Tips & Best PracticesMySQL Performance Tips & Best Practices
MySQL Performance Tips & Best Practices
Isaac Mosquera
 
Fast querying indexing for performance (4)
Fast querying   indexing for performance (4)Fast querying   indexing for performance (4)
Fast querying indexing for performance (4)
MongoDB
 
MySQL Performance Tuning: Top 10 Tips
MySQL Performance Tuning: Top 10 TipsMySQL Performance Tuning: Top 10 Tips
MySQL Performance Tuning: Top 10 Tips
OSSCube
 
How to Design Indexes, Really
How to Design Indexes, ReallyHow to Design Indexes, Really
How to Design Indexes, Really
Karwin Software Solutions LLC
 
10 SQL Tricks that You Didn't Think Were Possible
10 SQL Tricks that You Didn't Think Were Possible10 SQL Tricks that You Didn't Think Were Possible
10 SQL Tricks that You Didn't Think Were Possible
Lukas Eder
 
Indexing with MongoDB
Indexing with MongoDBIndexing with MongoDB
Indexing with MongoDB
MongoDB
 
Fulltext engine for non fulltext searches
Fulltext engine for non fulltext searchesFulltext engine for non fulltext searches
Fulltext engine for non fulltext searches
Adrian Nuta
 
Advanced fulltext search with Sphinx
Advanced fulltext search with SphinxAdvanced fulltext search with Sphinx
Advanced fulltext search with Sphinx
Adrian Nuta
 
Managing Big Data with MySQL
Managing Big Data with MySQLManaging Big Data with MySQL
Managing Big Data with MySQL
mwasaha mwagambo
 
Sphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQLSphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQL
Nguyen Van Vuong
 
MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6
MYXPLAIN
 
MySQL Performance Tips & Best Practices
MySQL Performance Tips & Best PracticesMySQL Performance Tips & Best Practices
MySQL Performance Tips & Best Practices
Isaac Mosquera
 
Fast querying indexing for performance (4)
Fast querying   indexing for performance (4)Fast querying   indexing for performance (4)
Fast querying indexing for performance (4)
MongoDB
 
MySQL Performance Tuning: Top 10 Tips
MySQL Performance Tuning: Top 10 TipsMySQL Performance Tuning: Top 10 Tips
MySQL Performance Tuning: Top 10 Tips
OSSCube
 
10 SQL Tricks that You Didn't Think Were Possible
10 SQL Tricks that You Didn't Think Were Possible10 SQL Tricks that You Didn't Think Were Possible
10 SQL Tricks that You Didn't Think Were Possible
Lukas Eder
 
Indexing with MongoDB
Indexing with MongoDBIndexing with MongoDB
Indexing with MongoDB
MongoDB
 
Ad

Similar to Sphinx at Craigslist in 2012 (20)

London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
 
My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
MySQLConference
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
Scrapinghub
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
John Adams
 
Zero mq logs
Zero mq logsZero mq logs
Zero mq logs
Tomas Doran
 
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
Redis Labs
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
Tomas Doran
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
John Adams
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
Roy Russo
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
SudheerKumar499932
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
liujianrong
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
Roger Xia
 
Percona Live London 2014: Serve out any page with an HA Sphinx environment
Percona Live London 2014: Serve out any page with an HA Sphinx environmentPercona Live London 2014: Serve out any page with an HA Sphinx environment
Percona Live London 2014: Serve out any page with an HA Sphinx environment
spil-engineering
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
MongoDB
 
Cassandra vs. Redis
Cassandra vs. RedisCassandra vs. Redis
Cassandra vs. Redis
Tim Lossen
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
Roy Russo
 
Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?
Saltmarch Media
 
Speed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisSpeed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with Redis
Ricard Clau
 
London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
 
My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
MySQLConference
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
Scrapinghub
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
John Adams
 
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
Redis Labs
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
Tomas Doran
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
John Adams
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
Roy Russo
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
Roger Xia
 
Percona Live London 2014: Serve out any page with an HA Sphinx environment
Percona Live London 2014: Serve out any page with an HA Sphinx environmentPercona Live London 2014: Serve out any page with an HA Sphinx environment
Percona Live London 2014: Serve out any page with an HA Sphinx environment
spil-engineering
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
MongoDB
 
Cassandra vs. Redis
Cassandra vs. RedisCassandra vs. Redis
Cassandra vs. Redis
Tim Lossen
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
Roy Russo
 
Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?
Saltmarch Media
 
Speed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisSpeed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with Redis
Ricard Clau
 
Ad

Recently uploaded (20)

Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 

Sphinx at Craigslist in 2012

  • 1. Sphinx at Craigslist Jeremy Zawodny craigslist
  • 3. CL Sphinx Infrastructure • Live Sphinx • ~30 million postings • end users searching for stuff on craigslist • Team Sphinx • ~100 million postings • additional indexes of postings for internal use (including non-live postings)
  • 4. CL Sphinx Infrastructure • Archive Sphinx • older postings (~3 billion) • constantly growing in size • Real-Time Sphinx • last ~2 days worth of postings • Forums Sphinx • ~150 million forum postings
  • 5. How We Got Here
  • 6. Back in 2008 • MySQL FULL TEXT (MyISAM) • 25 Servers • Melted Down Frequently • Desperately Needed a Solution • This was my first project at craigslist... • Looked at Solr, Sphinx, Xapian • Sphinx felt like the right fit
  • 7. Making Sphinx Work • Benchmarking showed promising results • Query performance was great • ~800qps/instance • back then we only needed 1,200/sec • Indexing performance too • Can index documents far faster than I can make the XML for input (from Perl) • Can’t index and serve at the same time, though...
  • 8. “Live” Sphinx • One index per city (~700 indexes) • Main + Delta • xmlpipe2 input • Data all fits on a single machine • 32bit ids • High churn rate • Settled on Master/Slave model w/rsync replication • Deployed in January, 2009
  • 9. Master/Slave Clusters • Number of slaves varies (typically 3-7) master master slave slave slave slave master master slave slave slave slave
  • 10. Main+Delta Indexes delta Regular Merge from transient delta today Periodic Merge Logical to clean house Index index
  • 11. Early Issues • Monitoring • Persistent Connections w/prefork • hacked up my own initially • Index merge crashes/bugs • We’re aways running svn snapshots
  • 12. Early Success • Replaced the 25 MySQL servers • Used 10 sphinx servers (2 masters, 8 slaves) • Search traffic continued to increase • Tons of headroom! • Typical search is under 5ms • New Features • “nearby” search • sort by: recent, price, best match
  • 13. Early Mistakes • Stopwords • Not setting query limits • Sphinx handled this just fine! • ASCII-only • Query mangling • need to understand how users search and what they expect to find • UpdateAttributes (no kill lists!)
  • 15. Growth • Wanted Sphinx for “internal” use • Created internal “team sphinx” with more indexed data • includes not visible postings • includes additional fields • Space became an issue, so had to build some simple sharding into our code • 2 clusters: even/odd split for indexes
  • 16. Live Sphinx Today • 300+ million queries/day • 5,000 queries/sec peak load • removed stopwords • threaded workers • dict=keywords • wildcard search enabled • UTF-8 (mostly) and charset_table • blend_chars • kill lists (no searchd on masters) • sharded (3 masters, 18 slaves) on blades
  • 19. Archive Sphinx • The Archive Project! • 2.5 billion postings • Growing by ~1.6 million daily • String attributes • 4 shards, each is a 1 master, 2 slave cluster • Bucket based on UserID (not city) • Low query volume • Need a way to reindex all docs
  • 20. Real-Time Sphinx • There’s a delay in indexing data on the master and replicating to the slaves... • What if we want to offer “real-time search” of your own postings?
  • 21. So I built something... • Known as rtsd (real-time search daemon) • Sphinx instance with MySQL Protocol • Primarily uses in-memory indexes • Used to bridge the gap between “now” and “archive sphinx” • Configured as an N day rolling window • Runs on archive sphinx master hosts
  • 22. Sphinx Time Horizons Classic Team Archive rtsd 0-20min All 20m-1day Visible All All 1-60 days Visible All All 60+ days All Note:Visible postings are findable on the site.
  • 23. rtsd overview PostingInfo table rtsd_consumer redis queue rtsd_indexer PostingCache rtsd_sphinx webbie webbie webbie
  • 24. Daily Posting Buckets • 3 indexes • yesterday • today • tomorrow • (DayofYear(PostedDate)%3) = $index_num • Nightly cron to “TRUNCATE RTINDEX” on the “tomorrow” index • sponsored feature!
  • 28. Future Work • autonomous nodes (no master/slave) • many-core blades with SSD storage • better performance metrics • we drop a lot of data on the floor • log mining and analysis • sphinx for “table of contents” (browsing) • haproxy in front of sphinx • generic sharding code • testing framework
  • 29. Sphinx Wishlist • 32 -> 64 bit migration tool • capture stats at daemon shut down • RT optimizations for DELETE (high churn) • distributed search (agent) config with multiple servers per index (for failover and load):
  • 30. Sphinx Wishlist • 32 -> 64 bit migration tool • capture stats at daemon shut down • RT optimizations for DELETE (high churn) • distributed search (agent) config with multiple servers per index (for failover and load):
  • 31. Craigslist is Hiring! • Developers • Back-end • Front-end • Systems Administrators • Network Engineers • Email: [email protected] plain text resume!

Editor's Notes