SlideShare a Scribd company logo
The Search for Better Search
Nick Caldwell, Chris Slowe & Luis Bitencourt-Emilio
What is Reddit?
Nick Caldwell, VP of Engineering
What is Reddit?
Reddit is the frontpage of the internet
A social network where there are tens of thousands of communities
around whatever passions or interests you might have
It’s where people converse about the things that
are most important to them
Reddit by the numbers
Alexa Rank (US/World)
MAU
Communities
Posts per day
Comments day
Votes per day
Searches per Day
4th/7th
320M
1.1M
1M
5M
75M
70M
SCALE
ESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS
ENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTEN
ESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS
So, what are we doing with all that power
Cat Walking a HumanCat Fist Bumping
Wait, it’s not just cat pictures!
Community > Content > Individual
● Authenticity
● Creative freedom
● Empathy @ scale
r/confession
Secrets that if revealed
would change your life
forever?
The Search for Better Search at Reddit - Nick Caldwell, Chris Slowe, and Luis Bitencourt-Emilio, Reddit
r/assistance
Empathy and support at
scale
None of that matters if you
can’t FIND the content!
Storytime: History of Search @ Reddit
Chris Slowe, CTO
In the beginning (2005), there was PostGres...
And it was mostly good.
● Tsearch2 is pretty sweet
○ “Oh wow it does stemming!”
● We also really liked TRIGGERs back then (“No, it’s cool. The database does
all the work and it’s guaranteed to be accurate”)
● Eventually grew so we were bogging down the majority of Postgres queries
with a small minority (~2%) of search traffic
2007: Oh! There’s a tool for that. Enter Lucene.
And it was mostly better.
● This was actually implemented (by me!) just over 10 years ago in July 2007.
● The runner up was a Google Search Appliance. Remember those!
○ Would have made a nice addition to our one rack
● Implemented in Python as a hand-rolled RPC server over TCP
○ Lucene Index files all on a single machine
○ “We’ll scale it later.”
● Actually had posts and comments
2008: What’s this “Solr” the kids are on about?
● Continuing the long tradition of “make the new guy fix it” we passed the “fix
search pls” task onto our third hire, David King (u/ketralnis)
● Implemented in early 2008 and pre-Solr 1.3
● Set up an actual search cluster.
● Wrapper/driver code in python for both indexing and searching.
● Scalable and tunable relevance!
2010: Up and to the right!
● As the site continued to grow and we first cracked a billion pageviews/month
● Engineering team of four, we put all of our effort into:
○ 503 mitigation
○ continuing to add Postgres read slaves
○ adding more cache
○ add very early version of Cassandra (0.6.0)
● Oh. Right. Search. How’s that going?
Oh.
2010: If you love something, you outsource it
Said no one ever.
● For starters, we were out of new guys to fix it.
● Also, no chance to really focus on it full time
○ Only about 2% of traffic
○ Because if you don’t build it, they won’t come…
● Contract with a company called IndexTank who provided a nifty drop-in
replacement and we can stop worrying about it.
● “We launched a new search engine yesterday. Calm down. It’s okay. I know.
You’ve been hurt before.” - David King ‘10
2012: start worrying!
● IndexTank got bought by LinkedIn (yay!)
○ ....and shut down their API (boo!)
○ ….with a 6 month grace period (d’oh!)
● AWS CloudSearch to the rescue!
Mostly...
2017: Fix it Fix it Fix it!
● You’ve heard of “five nines”? We had “nine fives”
● Giant CloudSearch Cluster, terrible performance
● Worked closely with AWS on it
● Did the equivalent of “turn it off and on again”
○ Full index rebuild
○ Drop seldom used indices
○ Consider blood sacrifice
○ Success!
What I’m saying is, Search is hard...
“we fixed a bug in the search results ordering” - Steve Huffman ‘06
“We updated the search system this morning to help alleviate some load problems” - Steve ‘06
“Jeremy is working on search! It’s not a complicated fix (basically, the sorting is whacky).” - Steve ‘06
“Search works much better, tagging and user-controlled subreddits are right around the corner” - Steve ‘07
“Search is better, but not quite where we’d like it.” - Steve ‘07
“Stats and search are temporarily disabled, but will be coming back as soon as we can get them repaired.” - Steve ‘07
“we were hoping to include an upgraded search, which, unlike the last version, was actually useful and helped you find what you were looking for.
Unfortunately, the version we settled on didn’t quite load test as nicely” - Steve ‘ 07
“I made a quick fix to search that I hope helps until we get a chance to really fix it.” - Steve ‘07
“[David]’s been fixing search and hacking mystery projects in Erlang.” - Alexis Ohanian ‘08
“ I’ve totally replaced the reddit search function.” - David King ‘08
“We launched a new search engine yesterday. Calm down. It’s okay. I know. You’ve been hurt before.” - David King ‘10
“The old [cloud]search domain had significant performance issues: roughly 33% of queries took over 5 seconds to complete and would result in
the search error page.” - Brian Simpson ‘17
...Redditors are picky...
...and we have fully internalized it.
Search Today @ Reddit
Luis Bitencourt-Emilio, Sr. Director - Search & Discovery
Search Today: Architecture
Deep Dive: Ingestion… our first attempt
...let’s try that again; keep it simple stupid.
The rollout
The pudding: Bye bye error pages!
The pudding: If it’s not relevant it doesn’t matter
Recap: The numbers
● Over a quarter billion posts indexed
● 1000 index updates per second
● 400 QPS from first party apps
● 75M searches per day (including public API)
● Up next: ingesting many billions of comments, messages, and user profiles.
Show and tell: A better subreddit search
The challenge: Redditors are very creative in their subreddit naming (e.g.
r/superbowl is about superb owl pictures) which whilst fun, poses a challenge for
discovery.
The answer: faceted search on posts!
The pudding: A better subreddit search
The pudding: A better subreddit search
The future: A better subreddit search
Coming soon; normalization!
On a similar vein, once we ingest comments we can leverage a similar strategy
for posts!
This will make a big impact on things image, video and link only posts which have
little self text to index but lots of comments!
The Future of Search @ Reddit
Nick Caldwell, Reddit VPE
...but that’s not all
● Relevance
○ Relevance model experimentation
○ Query understanding & rewriting
● Smarter search for new content types
○ Comments, user profiles, messages, etc
○ Image and video search
○ Q&A
Search UX Redesigned...
The Future of Reddit
Welcoming
Personalized
Thanks!
Nick Caldwell: nickc@reddit.com
Chris Slowe: chris@reddit.com
Luis Bitencourt-Emilio: luis@reddit.com
PS: We’re hiring!
http://reddit.com/jobs

More Related Content

PPTX
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Lucidworks
 
ODP
Finding Anything: Real-time Search with IndexTank
YogiWanKenobi
 
ODP
Finding Anything: Real-time Search with IndexTank
YogiWanKenobi
 
PDF
sunny-slides
20DC11NOUFALN
 
PDF
In search of: A meetup about Liferay and Search 2016-04-20
Tibor Lipusz
 
PDF
Searchland: Search quality for Beginners
Valeria de Paiva
 
PPTX
Search-Engines-and-Information-Retrievals.pptx
nishatmh22
 
PDF
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
C4Media
 
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Lucidworks
 
Finding Anything: Real-time Search with IndexTank
YogiWanKenobi
 
Finding Anything: Real-time Search with IndexTank
YogiWanKenobi
 
sunny-slides
20DC11NOUFALN
 
In search of: A meetup about Liferay and Search 2016-04-20
Tibor Lipusz
 
Searchland: Search quality for Beginners
Valeria de Paiva
 
Search-Engines-and-Information-Retrievals.pptx
nishatmh22
 
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
C4Media
 

Similar to The Search for Better Search at Reddit - Nick Caldwell, Chris Slowe, and Luis Bitencourt-Emilio, Reddit (20)

PPS
Mythology of search engine
Himanshu Kumar Das
 
PPT
Searching tech2
Hugh Barnard
 
PDF
Search Engine Google
Chidanand Byahatti
 
PDF
Indextank east bay ruby meetup slides
YogiWanKenobi
 
PPT
Web Search and Mining
sathish sak
 
PDF
Search is the new UI
Great Wide Open
 
PDF
Download full ebook of Enterprise Search Reissue Martin White instant downloa...
zereyrumski
 
PPTX
SPConnections - What's new in SharePoint 2013 Search
Agnes Molnar
 
PDF
Haifa
Ram Dutt Shukla
 
PDF
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Lucidworks
 
PPTX
Connect and search your data
brendonpage
 
PPTX
Introduction to Information Retrieval
Carsten Eickhoff
 
PPT
Search Systems
Miles Price
 
PPTX
Relevancy and Search Quality Analysis - Search Technologies
enterprisesearchmeetup
 
PDF
Enterprise Search in SharePoint 2013
Findwise
 
PDF
Writing a Search Engine. How hard could it be?
Anthony Brown
 
PDF
Everything You Wish You Knew About Search
IDEAS - Int'l Data Engineering and Science Association
 
PDF
Under the Hood: Advanced Semantic Markup for SEO
Will Hattman
 
PPTX
SPConnections - Search Administration in SharePoint 2013
Agnes Molnar
 
PPTX
Web Search Engine, Web Crawler, and Semantics Web
Aatif19921
 
Mythology of search engine
Himanshu Kumar Das
 
Searching tech2
Hugh Barnard
 
Search Engine Google
Chidanand Byahatti
 
Indextank east bay ruby meetup slides
YogiWanKenobi
 
Web Search and Mining
sathish sak
 
Search is the new UI
Great Wide Open
 
Download full ebook of Enterprise Search Reissue Martin White instant downloa...
zereyrumski
 
SPConnections - What's new in SharePoint 2013 Search
Agnes Molnar
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Lucidworks
 
Connect and search your data
brendonpage
 
Introduction to Information Retrieval
Carsten Eickhoff
 
Search Systems
Miles Price
 
Relevancy and Search Quality Analysis - Search Technologies
enterprisesearchmeetup
 
Enterprise Search in SharePoint 2013
Findwise
 
Writing a Search Engine. How hard could it be?
Anthony Brown
 
Everything You Wish You Knew About Search
IDEAS - Int'l Data Engineering and Science Association
 
Under the Hood: Advanced Semantic Markup for SEO
Will Hattman
 
SPConnections - Search Administration in SharePoint 2013
Agnes Molnar
 
Web Search Engine, Web Crawler, and Semantics Web
Aatif19921
 
Ad

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Lucidworks
 
PDF
Drive Agent Effectiveness in Salesforce
Lucidworks
 
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks
 
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks
 
PPTX
Connected Experiences Are Personalized Experiences
Lucidworks
 
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Lucidworks
 
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
Lucidworks
 
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Lucidworks
 
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
Lucidworks
 
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Lucidworks
 
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Lucidworks
 
PDF
Smart Answers for Employee and Customer Support After COVID-19
Lucidworks
 
PPTX
Applying AI & Search in Europe - featuring 451 Research
Lucidworks
 
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
Lucidworks
 
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Lucidworks
 
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Lucidworks
 
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Lucidworks
 
PPTX
Webinar: Building a Business Case for Enterprise Search
Lucidworks
 
PPTX
Why Insight Engines Matter in 2020 and Beyond
Lucidworks
 
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Lucidworks
 
Drive Agent Effectiveness in Salesforce
Lucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks
 
Connected Experiences Are Personalized Experiences
Lucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
Lucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Lucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Lucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Lucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Lucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Lucidworks
 
Webinar: Building a Business Case for Enterprise Search
Lucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Lucidworks
 
Ad

Recently uploaded (20)

PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PPTX
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Software Development Company | KodekX
KodekX
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PPTX
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
agentic-ai-and-the-future-of-autonomous-systems.pdf
siddharthnetsavvies
 
PDF
Enable Enterprise-Ready Security on IBM i Systems.pdf
Precisely
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPT
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
PDF
GYTPOL If You Give a Hacker a Host
linda296484
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Software Development Company | KodekX
KodekX
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
 
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
agentic-ai-and-the-future-of-autonomous-systems.pdf
siddharthnetsavvies
 
Enable Enterprise-Ready Security on IBM i Systems.pdf
Precisely
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
GYTPOL If You Give a Hacker a Host
linda296484
 

The Search for Better Search at Reddit - Nick Caldwell, Chris Slowe, and Luis Bitencourt-Emilio, Reddit

  • 1. The Search for Better Search Nick Caldwell, Chris Slowe & Luis Bitencourt-Emilio
  • 2. What is Reddit? Nick Caldwell, VP of Engineering
  • 3. What is Reddit? Reddit is the frontpage of the internet A social network where there are tens of thousands of communities around whatever passions or interests you might have It’s where people converse about the things that are most important to them
  • 4. Reddit by the numbers Alexa Rank (US/World) MAU Communities Posts per day Comments day Votes per day Searches per Day 4th/7th 320M 1.1M 1M 5M 75M 70M
  • 5. SCALE ESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS ENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTEN ESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS CONTENT ENDLESS
  • 6. So, what are we doing with all that power
  • 7. Cat Walking a HumanCat Fist Bumping
  • 8. Wait, it’s not just cat pictures!
  • 9. Community > Content > Individual ● Authenticity ● Creative freedom ● Empathy @ scale
  • 10. r/confession Secrets that if revealed would change your life forever?
  • 13. None of that matters if you can’t FIND the content!
  • 14. Storytime: History of Search @ Reddit Chris Slowe, CTO
  • 15. In the beginning (2005), there was PostGres... And it was mostly good. ● Tsearch2 is pretty sweet ○ “Oh wow it does stemming!” ● We also really liked TRIGGERs back then (“No, it’s cool. The database does all the work and it’s guaranteed to be accurate”) ● Eventually grew so we were bogging down the majority of Postgres queries with a small minority (~2%) of search traffic
  • 16. 2007: Oh! There’s a tool for that. Enter Lucene. And it was mostly better. ● This was actually implemented (by me!) just over 10 years ago in July 2007. ● The runner up was a Google Search Appliance. Remember those! ○ Would have made a nice addition to our one rack ● Implemented in Python as a hand-rolled RPC server over TCP ○ Lucene Index files all on a single machine ○ “We’ll scale it later.” ● Actually had posts and comments
  • 17. 2008: What’s this “Solr” the kids are on about? ● Continuing the long tradition of “make the new guy fix it” we passed the “fix search pls” task onto our third hire, David King (u/ketralnis) ● Implemented in early 2008 and pre-Solr 1.3 ● Set up an actual search cluster. ● Wrapper/driver code in python for both indexing and searching. ● Scalable and tunable relevance!
  • 18. 2010: Up and to the right! ● As the site continued to grow and we first cracked a billion pageviews/month ● Engineering team of four, we put all of our effort into: ○ 503 mitigation ○ continuing to add Postgres read slaves ○ adding more cache ○ add very early version of Cassandra (0.6.0) ● Oh. Right. Search. How’s that going?
  • 19. Oh.
  • 20. 2010: If you love something, you outsource it Said no one ever. ● For starters, we were out of new guys to fix it. ● Also, no chance to really focus on it full time ○ Only about 2% of traffic ○ Because if you don’t build it, they won’t come… ● Contract with a company called IndexTank who provided a nifty drop-in replacement and we can stop worrying about it. ● “We launched a new search engine yesterday. Calm down. It’s okay. I know. You’ve been hurt before.” - David King ‘10
  • 21. 2012: start worrying! ● IndexTank got bought by LinkedIn (yay!) ○ ....and shut down their API (boo!) ○ ….with a 6 month grace period (d’oh!) ● AWS CloudSearch to the rescue!
  • 23. 2017: Fix it Fix it Fix it! ● You’ve heard of “five nines”? We had “nine fives” ● Giant CloudSearch Cluster, terrible performance ● Worked closely with AWS on it ● Did the equivalent of “turn it off and on again” ○ Full index rebuild ○ Drop seldom used indices ○ Consider blood sacrifice ○ Success!
  • 24. What I’m saying is, Search is hard... “we fixed a bug in the search results ordering” - Steve Huffman ‘06 “We updated the search system this morning to help alleviate some load problems” - Steve ‘06 “Jeremy is working on search! It’s not a complicated fix (basically, the sorting is whacky).” - Steve ‘06 “Search works much better, tagging and user-controlled subreddits are right around the corner” - Steve ‘07 “Search is better, but not quite where we’d like it.” - Steve ‘07 “Stats and search are temporarily disabled, but will be coming back as soon as we can get them repaired.” - Steve ‘07 “we were hoping to include an upgraded search, which, unlike the last version, was actually useful and helped you find what you were looking for. Unfortunately, the version we settled on didn’t quite load test as nicely” - Steve ‘ 07 “I made a quick fix to search that I hope helps until we get a chance to really fix it.” - Steve ‘07 “[David]’s been fixing search and hacking mystery projects in Erlang.” - Alexis Ohanian ‘08 “ I’ve totally replaced the reddit search function.” - David King ‘08 “We launched a new search engine yesterday. Calm down. It’s okay. I know. You’ve been hurt before.” - David King ‘10 “The old [cloud]search domain had significant performance issues: roughly 33% of queries took over 5 seconds to complete and would result in the search error page.” - Brian Simpson ‘17
  • 26. ...and we have fully internalized it.
  • 27. Search Today @ Reddit Luis Bitencourt-Emilio, Sr. Director - Search & Discovery
  • 29. Deep Dive: Ingestion… our first attempt
  • 30. ...let’s try that again; keep it simple stupid.
  • 32. The pudding: Bye bye error pages!
  • 33. The pudding: If it’s not relevant it doesn’t matter
  • 34. Recap: The numbers ● Over a quarter billion posts indexed ● 1000 index updates per second ● 400 QPS from first party apps ● 75M searches per day (including public API) ● Up next: ingesting many billions of comments, messages, and user profiles.
  • 35. Show and tell: A better subreddit search The challenge: Redditors are very creative in their subreddit naming (e.g. r/superbowl is about superb owl pictures) which whilst fun, poses a challenge for discovery. The answer: faceted search on posts!
  • 36. The pudding: A better subreddit search
  • 37. The pudding: A better subreddit search
  • 38. The future: A better subreddit search Coming soon; normalization! On a similar vein, once we ingest comments we can leverage a similar strategy for posts! This will make a big impact on things image, video and link only posts which have little self text to index but lots of comments!
  • 39. The Future of Search @ Reddit Nick Caldwell, Reddit VPE
  • 40. ...but that’s not all ● Relevance ○ Relevance model experimentation ○ Query understanding & rewriting ● Smarter search for new content types ○ Comments, user profiles, messages, etc ○ Image and video search ○ Q&A
  • 42. The Future of Reddit Welcoming Personalized
  • 43. Thanks! Nick Caldwell: [email protected] Chris Slowe: [email protected] Luis Bitencourt-Emilio: [email protected] PS: We’re hiring! http://reddit.com/jobs

Editor's Notes

  • #25: 2005 - Steve Huffman, cofounder and now CEO, implements postgres tsearch. 2006 - Chris Slowe, founding engineer and now CTO, implements pylucene. “we fixed a bug in the search results ordering” - Steve Huffman ‘06 “We updated the search system this morning to help alleviate some load problems” - Steve ‘06 “Jeremy is working on search! It’s not a complicated fix (basically, the sorting is whacky).” - Steve ‘06 “Search works much better, tagging and user-controlled subreddits are right around the corner” - Steve ‘07 “Search is better, but not quite where we’d like it.” - Steve ‘07 “Stats and search are temporarily disabled, but will be coming back as soon as we can get them repaired.” - Steve ‘07 “we were hoping to include an upgraded search, which, unlike the last version, was actually useful and helped you find what you were looking for. Unfortunately, the version we settled on didn’t quite load test as nicely” - Steve ‘ 07 “I made a quick fix to search that I hope helps until we get a chance to really fix it.” - Steve ‘07 2008 - David King, first employee and now search engineer, implements Solr. “[David]’s been fixing search and hacking mystery projects in Erlang.” - Alexis Ohanian ‘08 “I’ve totally replaced the reddit search function.” - David King ‘08 2010 - David King replaces Solr with IndexTank. “We launched a new search engine yesterday. Calm down. It’s okay. I know. You’ve been hurt before.” - David King ‘10 2012 - u/kemitche implements CloudSearch after LinkedIn shut down IndexTank “Q: Where do you see reddit in 10 years? A: Reddit search might work by then.” - Steve AMA ‘16 “Today we moved from the old Amazon CloudSearch domain to a new Amazon CloudSearch domain. The old search domain had significant performance issues: roughly 33% of queries took over 5 seconds to complete and would result in the search error page.” - u/bsimpson ‘17 NOW - Lucidworks Fusion! “As /u/bitofsalt mentioned a few months ago, we’ve been working on some improvements to search. We may even be ahead of spez’s 10 year plan.” u/starfishjenga ‘17
  • #26: https://redditblog.com/2006/07/25/searching/ https://redditblog.com/2007/06/08/a-note-on-search-and-what-were-working-on/ https://redditblog.com/2008/04/21/new-search-2/ https://www.reddit.com/r/announcements/comments/cs4ll/new_search/ https://www.reddit.com/r/changelog/comments/694o34/reddit_search_performance_improvements/ https://www.reddit.com/r/changelog/comments/6pi0kk/improving_search/
  • #27: 2005 - Steve Huffman, cofounder and now CEO, implements postgres tsearch. 2006 - Chris Slowe, founding engineer and now CTO, implements pylucene. “we fixed a bug in the search results ordering” - Steve Huffman ‘06 “We updated the search system this morning to help alleviate some load problems” - Steve ‘06 “Jeremy is working on search! It’s not a complicated fix (basically, the sorting is whacky).” - Steve ‘06 “Search works much better, tagging and user-controlled subreddits are right around the corner” - Steve ‘07 “Search is better, but not quite where we’d like it.” - Steve ‘07 “Stats and search are temporarily disabled, but will be coming back as soon as we can get them repaired.” - Steve ‘07 “we were hoping to include an upgraded search, which, unlike the last version, was actually useful and helped you find what you were looking for. Unfortunately, the version we settled on didn’t quite load test as nicely” - Steve ‘ 07 “I made a quick fix to search that I hope helps until we get a chance to really fix it.” - Steve ‘07 2008 - David King, first employee and now search engineer, implements Solr. “[David]’s been fixing search and hacking mystery projects in Erlang.” - Alexis Ohanian ‘08 “I’ve totally replaced the reddit search function.” - David King ‘08 2010 - David King replaces Solr with IndexTank. “We launched a new search engine yesterday. Calm down. It’s okay. I know. You’ve been hurt before.” - David King ‘10 2012 - u/kemitche implements CloudSearch after LinkedIn shut down IndexTank “Q: Where do you see reddit in 10 years? A: Reddit search might work by then.” - Steve AMA ‘16 “Today we moved from the old Amazon CloudSearch domain to a new Amazon CloudSearch domain. The old search domain had significant performance issues: roughly 33% of queries took over 5 seconds to complete and would result in the search error page.” - u/bsimpson ‘17 NOW - Lucidworks Fusion! “As /u/bitofsalt mentioned a few months ago, we’ve been working on some improvements to search. We may even be ahead of spez’s 10 year plan.” u/starfishjenga ‘17
  • #28: ●It’s so great to be here, when I joined Reddit the first project Nick had me jump on was Search. I had no idea what I was in for at the time ●Search is a hard problem, you wouldn’t all be here today otherwise, it is also huge opportunity from both a business and product perspective ●Luckily for us we had the great support of Lucidworks in rising to this challenge and I’m proud to be here today to tell you a little about a thing we did
  • #29: ●I’ve always liked to lead with the punchline, here’s what we’ve built in conjunction with the Lucidwork’s team ●We needed a scalable system that could index many billions of pieces of content and stand up to the traffic of the US’s 4th largest website. ●Our Search stack is composed of 3 main pieces: the search microservice, our data pipeline and our index cluster. ○We spun out search specific functionality from our monolith R2 to a separate autoscaling microservice, this greatly improved our ability to iterate quickly and closely monitor and scale this service as we rolled out search. Today, we generally run between 2-3 instances, but expect a significant increase as we begin to ingest comments and other content types and experiences. ○We then piped into our existing data pipeline to build canonical views of our datasets to be ingested by our index, as well as tapping into the live events for streaming updates ○Lastly, we stood up a 24 node Fusion Cluster to index and serve search results ●For comparison, our previous Cloudsearch iteration was running on around 200 machines (that’s a lot of Reddit Gold) and this new and significantly smaller stack is servicing just as much traffic at around 400QPS of first party app searches with 1/6th the resources. ●We can now build on top of this solid platform as we ingest orders of maginitude more content and move Search to the front and center of the Reddit user experience. ●Let’s dig in a bit on how this works…
  • #30: ●A critical component of any search stack is the ingestion pipeline, and doing this at Reddit scale is a challenge on it’s own. ●Our first iteration here was the beast above, which pulls and aggregates data from both postgres and our events pipeline to create canonical views of Reddit content to be indexed. ●In doing so, we thought we could also tackle a broader challenge at Reddit and provide a canonical and easily consumable view of all our data not just for Search but also for ML model training and analytics. ●TLDR; we bit off more than we could chew, this pipeline was too complex and had way too many moving pieces, leading to numerous failures, data quality issues and low performance particularly around backfilling. ●So we went back to the drawing board….
  • #31: ●…and this time we got it right… or at least less wrong. ●We retargetted ourselves at only solving the search ingestion challenge, which allowed us to remove a significant amount of complexity from the system and go from over 20 ETLs down to 7, and now pulling only from a single data source. ●This change took us from ingesting around 150M posts up to over 200M, and resolved numerous data quality issues we had been seeing in the previous pipeline. It also more than halved our overall time to run and significantly decreased hadoop resource utilization. ●But we’re not done yet… we also built out a streaming ingestion sytstem based on our kafka clusters to serve intra-day content updates. ●This live ingestion is scaling beyound our expectations and we’re currently looking at replacing all the ETLs above with this event based system, which could take an all-time backfill/reindex building from around 2 weeks down to 2 days. ●Enough about ingestion though… let’s look at some charts!
  • #32: ● ●Dark Traffic ●Shunting API for backwards compatibility ●Staged rollout (1, 5, 10, 25, 50, 90) for scale and bugs ●THANK YOU ALL FOR FEEDBACK
  • #33: ●Virtually eradicated errors, main source now happen to in-flight queries in the rare case when a node goes down. ●From 9 x 4’s to 4 x 9’s ●Some performance costs due to being on US-East/West, fixing this next.
  • #34: ●Early results before we actually work through relevance tweak ●These are promissing especially considering cloudsearch uses solr at it’s core ●We can now build on top of this platform with personalization, query intent/synonyms/rewriting, and ML models to further tune
  • #36: “learning to code” “Coming out” “Oil leaking” “Iron infusion”
  • #37: “learning to code” “Coming out”
  • #38: “Oil leaking” “Iron infusion”
  • #39: Normalization Comment indexing for image, video, link, sentiment, etc….
  • #41: Self harm search partnership in particular.