SlideShare a Scribd company logo
Ranking the Web with Spark
Apache Big Data Europe 2016
sylvain@sylvainzimmer.com
@sylvinus
/usr/bin/whoami
• Jamendo (Founder & CTO, 2004-2011)
• TEDxParis (Co-founder, 2009-2012)
• dotConferences (Founder, 2012-)
• Pricing Assistant (Co-founder & CTO, 2012-)
transparency
reproducibility
Ranking the Web with Spark
https://uidemo.commonsearch.org
https://explain.commonsearch.org/?q=python&g=en
Ranking
Disclaimer: IANASRE
(I Am Not A Search Relevance Engineer)
What's in a score
score = fn( doc, query, language, user, time )
What's in a score
score = fn( doc, query )
What's in a score
score = fn( static_score, dynamic_score ( query ))
Static score
Static features
• Scopes:
• Page: URL depth, markup stats, ...
• Domain: Age, page count, blacklists, ...
• WebGraph: PageRank, ...
http://infolab.stanford.edu/~backrub/google.html
The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
Crawler
Indexer
Database
SearcherRanker
Dynamic score
Dynamic features
• Text match: TF-IDF, BM25, proximity, topic, ...
• Query-level: number of words, popularity, ...
• Usage: clicks, dwell time, reformulations, ...
• Time
Scoring function
Users
Database
Elasticsearch
Indexer
Python, Spark
Data sources
Common Crawl, Alexa top 1M, ...
words, static score
query top 10 docs, final scores
Offline
Online
Searcher
Go
Ranking the Web with Spark
https://explain.commonsearch.org/?q=python&g=en
Issues with this architecture
• Static & dynamic scoring are in different
codebases
• No control over result diversity
• Hard to optimize
• Very dependent on Elasticsearch
Rescoring
Users
Database
Indexer
words, static score, features
query
Searcher
top 1k docs, features
Rescorer
final 10 docs
Issues with rescoring
• Latency
• Pagination
• Harder to explain
Learning to rank
LTR Model
• Features
• Training dataset
• Evaluation: NDCG, ERR, ...
• Algorithms: AdaRank, ListNet, LambdaMART, ...
• Learning with Spark!
The right questions
• What do users expect?
• What features?
• How to evaluate and fine-tune in the real world?
PageRank with Spark
Ranking the Web with Spark
http://commoncrawl.org
https://github.com/commonsearch/cosr-back
Common Search Pipeline
Doc sources
Common Crawl,
WARC files,
URLs ...
Filter
plugins
Document
parsing
Output
plugins
Data output
Database, file,
HDFS, S3, ...
Most popular Wikipedia pages
Dumping the web graph
Naive pyspark PageRank
GraphFrames
SparkSQL PageRank
SparkSQL PageRank
https://github.com/commonsearch/cosr-back/blob/master/spark/jobs/pagerank.py
Tests
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
https://github.com/commonsearch/cosr-back/blob/master/tests/sparktests/test_pagerank.py
https://about.commonsearch.org/developer/get-started
Ranking the Web with Spark
Top 10
Ranking the Web with Spark
Spam
Ranking the Web with Spark
Spamdexing
• Keyword stuffing, hidden text
• Scraper sites, Mirrors
• Link farms
• Splogs, Comment spam
• Domaining
• Cloaking
• Bombing
Questions?
https://about.commonsearch.org/contributing
https://github.com/commonsearch
contact@commonsearch.org
slack.commonsearch.org

More Related Content

What's hot (11)

PPTX
RDF Graph Data Management in Oracle Database and NoSQL Platforms
Graph-TA
 
PPTX
Graph Databases for SQL Server Professionals
Stéphane Fréchette
 
PDF
Graphs, Graphs everywhere - Lucene powered relation exploration
Zbyszko Papierski
 
PDF
Search Intelligently - Liferay Symposium North America 2016, Chicago, USA
André Ricardo Barreto de Oliveira
 
PPTX
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
South London Geek Nights
 
PDF
DBpedia Japanese
Fumihiro Kato
 
PPTX
Deriving an Emergent Relational Schema from RDF Data
Graph-TA
 
PDF
Managing RDF data with graph databases
Graph-TA
 
PPTX
NoSQL Roundup
Daniel Fields
 
PDF
How Graph Databases efficiently store, manage and query connected data at s...
jexp
 
PDF
Elasticsearch @JBoss.org, 2014
Lukas Vlcek
 
RDF Graph Data Management in Oracle Database and NoSQL Platforms
Graph-TA
 
Graph Databases for SQL Server Professionals
Stéphane Fréchette
 
Graphs, Graphs everywhere - Lucene powered relation exploration
Zbyszko Papierski
 
Search Intelligently - Liferay Symposium North America 2016, Chicago, USA
André Ricardo Barreto de Oliveira
 
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
South London Geek Nights
 
DBpedia Japanese
Fumihiro Kato
 
Deriving an Emergent Relational Schema from RDF Data
Graph-TA
 
Managing RDF data with graph databases
Graph-TA
 
NoSQL Roundup
Daniel Fields
 
How Graph Databases efficiently store, manage and query connected data at s...
jexp
 
Elasticsearch @JBoss.org, 2014
Lukas Vlcek
 

Viewers also liked (20)

PDF
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Sylvain Zimmer
 
PDF
PyCon FR 2016 - Et si on recodait Google en Python ?
Sylvain Zimmer
 
PDF
Predictive modeling healthcare
Taposh Roy
 
PDF
Building distributed processing system from scratch - Part 2
datamantra
 
PDF
Introduction to Structured Streaming
datamantra
 
PPTX
Keyboard covert channels
Freeman Zhang
 
PPTX
AMP Camp 5 Intro
jeykottalam
 
PDF
Spark sql
Freeman Zhang
 
PDF
Introduction to dataset
datamantra
 
PDF
Evolution of apache spark
datamantra
 
PDF
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
PDF
Spark on yarn
datamantra
 
PDF
Getting Started Running Apache Spark on Apache Mesos
Paco Nathan
 
PDF
Anatomy of in memory processing in Spark
datamantra
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PDF
Kafka and Spark Streaming
datamantra
 
PDF
Building Distributed Systems from Scratch - Part 1
datamantra
 
PDF
Introduction to Structured Data Processing with Spark SQL
datamantra
 
PPTX
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
KEY
Building Distributed Systems in Scala
Alex Payne
 
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Sylvain Zimmer
 
PyCon FR 2016 - Et si on recodait Google en Python ?
Sylvain Zimmer
 
Predictive modeling healthcare
Taposh Roy
 
Building distributed processing system from scratch - Part 2
datamantra
 
Introduction to Structured Streaming
datamantra
 
Keyboard covert channels
Freeman Zhang
 
AMP Camp 5 Intro
jeykottalam
 
Spark sql
Freeman Zhang
 
Introduction to dataset
datamantra
 
Evolution of apache spark
datamantra
 
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Spark on yarn
datamantra
 
Getting Started Running Apache Spark on Apache Mesos
Paco Nathan
 
Anatomy of in memory processing in Spark
datamantra
 
Building a modern Application with DataFrames
Spark Summit
 
Kafka and Spark Streaming
datamantra
 
Building Distributed Systems from Scratch - Part 1
datamantra
 
Introduction to Structured Data Processing with Spark SQL
datamantra
 
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
Building Distributed Systems in Scala
Alex Payne
 
Ad

Similar to Ranking the Web with Spark (20)

PPTX
Information Architecture - Don't Make Me Think
Kerry Dirks MCPS MS
 
PPTX
2015 Data Science Summit @ dato Review
Hang Li
 
PDF
Mark Tortoricci - Talent42 2015
Talent42
 
PPT
SPLive Orlando - Beyond the Search Center - Application or Solution?
Agnes Molnar
 
PPTX
Feature driven agile oriented web applications
Ram G Athreya
 
PDF
Search Engine Optimization (SEO) 101
pointit
 
PDF
SEO in the Age of Artificial Intelligence | How AI influences Search
Philipp Klöckner
 
PDF
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
PPTX
Ordering the chaos: Creating websites with imperfect data
Andy Stretton
 
PDF
Arron daniels 1 pager researching the tech talent market
Talent42
 
PDF
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan
 
PDF
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
PPT
Search Analytics for Fun and Profit
Louis Rosenfeld
 
PDF
The Human Side of Productivity
Dr. Tathagat Varma
 
DOC
Tri vuong-resume
Tri Vuong
 
PDF
Rapid Data Exploration With Hadoop
Peter Skomoroch
 
PPTX
Semtech bizsemanticsearchtutorial
Barbara Starr
 
PPT
Search Analytics: Powerful diagnostics for your site
Louis Rosenfeld
 
PPT
Data Driven Design: Using Web Analytics to Improve Information Architectures
Andrea Wiggins
 
PDF
A fresh new look into Information Gathering - OWASP Spain
Christian Martorella
 
Information Architecture - Don't Make Me Think
Kerry Dirks MCPS MS
 
2015 Data Science Summit @ dato Review
Hang Li
 
Mark Tortoricci - Talent42 2015
Talent42
 
SPLive Orlando - Beyond the Search Center - Application or Solution?
Agnes Molnar
 
Feature driven agile oriented web applications
Ram G Athreya
 
Search Engine Optimization (SEO) 101
pointit
 
SEO in the Age of Artificial Intelligence | How AI influences Search
Philipp Klöckner
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
Ordering the chaos: Creating websites with imperfect data
Andy Stretton
 
Arron daniels 1 pager researching the tech talent market
Talent42
 
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan
 
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
Search Analytics for Fun and Profit
Louis Rosenfeld
 
The Human Side of Productivity
Dr. Tathagat Varma
 
Tri vuong-resume
Tri Vuong
 
Rapid Data Exploration With Hadoop
Peter Skomoroch
 
Semtech bizsemanticsearchtutorial
Barbara Starr
 
Search Analytics: Powerful diagnostics for your site
Louis Rosenfeld
 
Data Driven Design: Using Web Analytics to Improve Information Architectures
Andrea Wiggins
 
A fresh new look into Information Gathering - OWASP Spain
Christian Martorella
 
Ad

More from Sylvain Zimmer (10)

PDF
Developer-friendly taskqueues: What you should ask yourself before choosing one
Sylvain Zimmer
 
PDF
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
Sylvain Zimmer
 
PDF
140byt.es - The Dark Side of Javascript
Sylvain Zimmer
 
PDF
Joshfire Framework 0.9 Technical Overview
Sylvain Zimmer
 
PDF
Javascript Views, Client-side or Server-side with NodeJS
Sylvain Zimmer
 
PDF
no.de quick presentation at #ParisJS 4
Sylvain Zimmer
 
PDF
Kinect + Javascript tech talk at #ParisJS Jan 2011
Sylvain Zimmer
 
PDF
Web Crawling with NodeJS
Sylvain Zimmer
 
PDF
Archicamp présentation
Sylvain Zimmer
 
PDF
Twisted presentation & Jamendo usecases
Sylvain Zimmer
 
Developer-friendly taskqueues: What you should ask yourself before choosing one
Sylvain Zimmer
 
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
Sylvain Zimmer
 
140byt.es - The Dark Side of Javascript
Sylvain Zimmer
 
Joshfire Framework 0.9 Technical Overview
Sylvain Zimmer
 
Javascript Views, Client-side or Server-side with NodeJS
Sylvain Zimmer
 
no.de quick presentation at #ParisJS 4
Sylvain Zimmer
 
Kinect + Javascript tech talk at #ParisJS Jan 2011
Sylvain Zimmer
 
Web Crawling with NodeJS
Sylvain Zimmer
 
Archicamp présentation
Sylvain Zimmer
 
Twisted presentation & Jamendo usecases
Sylvain Zimmer
 

Recently uploaded (20)

PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
Learn Computer Forensics, Second Edition
AnuraShantha7
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Learn Computer Forensics, Second Edition
AnuraShantha7
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 

Ranking the Web with Spark