Ranking the Web with Spark

3 likes968 views

This document discusses ranking web pages using Apache Spark. It begins by introducing the speaker and their background. It then provides an overview of how search engines traditionally work, including crawling, indexing, and ranking pages. It discusses using static features like URL depth and dynamic features like click-through rates to calculate page scores. The document proposes using Spark to perform learning to rank by training models on features and user data to improve results. It also demonstrates calculating PageRank on the Common Crawl dataset using GraphFrames in SparkSQL. Finally, it provides links to learn more about the Common Search open source project.

Technology

Ranking the Web with Spark
Apache Big Data Europe 2016
sylvain@sylvainzimmer.com
@sylvinus

/usr/bin/whoami
• Jamendo (Founder & CTO, 2004-2011)
• TEDxParis (Co-founder, 2009-2012)
• dotConferences (Founder, 2012-)
• Pricing Assistant (Co-founder & CTO, 2012-)

https://uidemo.commonsearch.org

https://explain.commonsearch.org/?q=python&g=en

Disclaimer: IANASRE
(I Am Not A Search Relevance Engineer)

What's in a score
score = fn( doc, query, language, user, time )

What's in a score
score = fn( doc, query )

What's in a score
score = fn( static_score, dynamic_score ( query ))

Static features
• Scopes:
• Page: URL depth, markup stats, ...
• Domain: Age, page count, blacklists, ...
• WebGraph: PageRank, ...

http://infolab.stanford.edu/~backrub/google.html
The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
Crawler
Indexer
Database
SearcherRanker

Dynamic features
• Text match: TF-IDF, BM25, proximity, topic, ...
• Query-level: number of words, popularity, ...
• Usage: clicks, dwell time, reformulations, ...
• Time

Users
Database
Elasticsearch
Indexer
Python, Spark
Data sources
Common Crawl, Alexa top 1M, ...
words, static score
query top 10 docs, ﬁnal scores
Ofﬂine
Online
Searcher
Go

Issues with this architecture
• Static & dynamic scoring are in different
codebases
• No control over result diversity
• Hard to optimize
• Very dependent on Elasticsearch

Users
Database
Indexer
words, static score, features
query
Searcher
top 1k docs, features
Rescorer
ﬁnal 10 docs

Issues with rescoring
• Latency
• Pagination
• Harder to explain

LTR Model
• Features
• Training dataset
• Evaluation: NDCG, ERR, ...
• Algorithms: AdaRank, ListNet, LambdaMART, ...
• Learning with Spark!

The right questions
• What do users expect?
• What features?
• How to evaluate and ﬁne-tune in the real world?

http://commoncrawl.org

https://github.com/commonsearch/cosr-back

Common Search Pipeline
Doc sources
Common Crawl,
WARC ﬁles,
URLs ...
Filter
plugins
Document
parsing
Output
plugins
Data output
Database, ﬁle,
HDFS, S3, ...

SparkSQL PageRank
https://github.com/commonsearch/cosr-back/blob/master/spark/jobs/pagerank.py

Tests
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
https://github.com/commonsearch/cosr-back/blob/master/tests/sparktests/test_pagerank.py

https://about.commonsearch.org/developer/get-started

Spamdexing
• Keyword stufﬁng, hidden text
• Scraper sites, Mirrors
• Link farms
• Splogs, Comment spam
• Domaining
• Cloaking
• Bombing

Questions?
https://about.commonsearch.org/contributing
https://github.com/commonsearch
contact@commonsearch.org
slack.commonsearch.org

More Related Content

What's hot (11)

PPTX

RDF Graph Data Management in Oracle Database and NoSQL PlatformsGraph-TA

PPTX

Graph Databases for SQL Server ProfessionalsStéphane Fréchette

PDF

Graphs, Graphs everywhere - Lucene powered relation explorationZbyszko Papierski

PDF

Search Intelligently - Liferay Symposium North America 2016, Chicago, USAAndré Ricardo Barreto de Oliveira

PPTX

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...South London Geek Nights

PDF

DBpedia JapaneseFumihiro Kato

PPTX

Deriving an Emergent Relational Schema from RDF DataGraph-TA

PDF

Managing RDF data with graph databasesGraph-TA

PPTX

NoSQL RoundupDaniel Fields

PDF

How Graph Databases efficiently store, manage and query connected data at s...jexp

PDF

Elasticsearch @JBoss.org, 2014Lukas Vlcek

RDF Graph Data Management in Oracle Database and NoSQL PlatformsGraph-TA

Graph Databases for SQL Server ProfessionalsStéphane Fréchette

Graphs, Graphs everywhere - Lucene powered relation explorationZbyszko Papierski

Search Intelligently - Liferay Symposium North America 2016, Chicago, USAAndré Ricardo Barreto de Oliveira

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...South London Geek Nights

DBpedia JapaneseFumihiro Kato

Deriving an Emergent Relational Schema from RDF DataGraph-TA

Managing RDF data with graph databasesGraph-TA

NoSQL RoundupDaniel Fields

How Graph Databases efficiently store, manage and query connected data at s...jexp

Elasticsearch @JBoss.org, 2014Lukas Vlcek

Viewers also liked (20)

PDF

Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2Sylvain Zimmer

PDF

PyCon FR 2016 - Et si on recodait Google en Python ?Sylvain Zimmer

PDF

Predictive modeling healthcareTaposh Roy

PDF

Building distributed processing system from scratch - Part 2datamantra

PDF

Introduction to Structured Streamingdatamantra

PPTX

Keyboard covert channelsFreeman Zhang

PPTX

AMP Camp 5 Introjeykottalam

PDF

Spark sqlFreeman Zhang

PDF

Introduction to datasetdatamantra

PDF

Evolution of apache sparkdatamantra

PDF

Anatomy of Spark SQL Catalyst - Part 2datamantra

PDF

Spark on yarndatamantra

PDF

Getting Started Running Apache Spark on Apache MesosPaco Nathan

PDF

Anatomy of in memory processing in Sparkdatamantra

PPTX

Building a modern Application with DataFramesSpark Summit

PDF

Kafka and Spark Streamingdatamantra

PDF

Building Distributed Systems from Scratch - Part 1datamantra

PDF

Introduction to Structured Data Processing with Spark SQLdatamantra

PPTX

Resilient Distributed DataSets - Apache SPARKTaposh Roy

KEY

Building Distributed Systems in ScalaAlex Payne

Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2Sylvain Zimmer

PyCon FR 2016 - Et si on recodait Google en Python ?Sylvain Zimmer

Predictive modeling healthcareTaposh Roy

Building distributed processing system from scratch - Part 2datamantra

Introduction to Structured Streamingdatamantra

Keyboard covert channelsFreeman Zhang

AMP Camp 5 Introjeykottalam

Spark sqlFreeman Zhang

Introduction to datasetdatamantra

Evolution of apache sparkdatamantra

Anatomy of Spark SQL Catalyst - Part 2datamantra

Spark on yarndatamantra

Getting Started Running Apache Spark on Apache MesosPaco Nathan

Anatomy of in memory processing in Sparkdatamantra

Building a modern Application with DataFramesSpark Summit

Kafka and Spark Streamingdatamantra

Building Distributed Systems from Scratch - Part 1datamantra

Introduction to Structured Data Processing with Spark SQLdatamantra

Resilient Distributed DataSets - Apache SPARKTaposh Roy

Building Distributed Systems in ScalaAlex Payne

Similar to Ranking the Web with Spark (20)

PPTX

Information Architecture - Don't Make Me ThinkKerry Dirks MCPS MS

PPTX

2015 Data Science Summit @ dato ReviewHang Li

PDF

Mark Tortoricci - Talent42 2015Talent42

PPT

SPLive Orlando - Beyond the Search Center - Application or Solution?Agnes Molnar

PPTX

Feature driven agile oriented web applicationsRam G Athreya

PDF

Search Engine Optimization (SEO) 101pointit

PDF

SEO in the Age of Artificial Intelligence | How AI influences SearchPhilipp Klöckner

PDF

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain

PPTX

Ordering the chaos: Creating websites with imperfect dataAndy Stretton

PDF

Arron daniels 1 pager researching the tech talent marketTalent42

PDF

Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan

PDF

OSCON 2014: Data Workflows for Machine LearningPaco Nathan

PPT

Search Analytics for Fun and ProfitLouis Rosenfeld

PDF

The Human Side of ProductivityDr. Tathagat Varma

DOC

Tri vuong-resumeTri Vuong

PDF

Rapid Data Exploration With HadoopPeter Skomoroch

PPTX

Semtech bizsemanticsearchtutorialBarbara Starr

PPT

Search Analytics: Powerful diagnostics for your siteLouis Rosenfeld

PPT

Data Driven Design: Using Web Analytics to Improve Information ArchitecturesAndrea Wiggins

PDF

A fresh new look into Information Gathering - OWASP SpainChristian Martorella

Information Architecture - Don't Make Me ThinkKerry Dirks MCPS MS

2015 Data Science Summit @ dato ReviewHang Li

Mark Tortoricci - Talent42 2015Talent42

SPLive Orlando - Beyond the Search Center - Application or Solution?Agnes Molnar

Feature driven agile oriented web applicationsRam G Athreya

Search Engine Optimization (SEO) 101pointit

SEO in the Age of Artificial Intelligence | How AI influences SearchPhilipp Klöckner

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain

Ordering the chaos: Creating websites with imperfect dataAndy Stretton

Arron daniels 1 pager researching the tech talent marketTalent42

Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan

OSCON 2014: Data Workflows for Machine LearningPaco Nathan

Search Analytics for Fun and ProfitLouis Rosenfeld

The Human Side of ProductivityDr. Tathagat Varma

Tri vuong-resumeTri Vuong

Rapid Data Exploration With HadoopPeter Skomoroch

Semtech bizsemanticsearchtutorialBarbara Starr

Search Analytics: Powerful diagnostics for your siteLouis Rosenfeld

Data Driven Design: Using Web Analytics to Improve Information ArchitecturesAndrea Wiggins

A fresh new look into Information Gathering - OWASP SpainChristian Martorella

More from Sylvain Zimmer (10)

PDF

Developer-friendly taskqueues: What you should ask yourself before choosing oneSylvain Zimmer

PDF

[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013Sylvain Zimmer

PDF

140byt.es - The Dark Side of JavascriptSylvain Zimmer

PDF

Joshfire Framework 0.9 Technical OverviewSylvain Zimmer

PDF

Javascript Views, Client-side or Server-side with NodeJSSylvain Zimmer

PDF

no.de quick presentation at #ParisJS 4Sylvain Zimmer

PDF

Kinect + Javascript tech talk at #ParisJS Jan 2011Sylvain Zimmer

PDF

Web Crawling with NodeJSSylvain Zimmer

PDF

Archicamp présentationSylvain Zimmer

PDF

Twisted presentation & Jamendo usecasesSylvain Zimmer

Developer-friendly taskqueues: What you should ask yourself before choosing oneSylvain Zimmer

[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013Sylvain Zimmer

140byt.es - The Dark Side of JavascriptSylvain Zimmer

Joshfire Framework 0.9 Technical OverviewSylvain Zimmer

Javascript Views, Client-side or Server-side with NodeJSSylvain Zimmer

no.de quick presentation at #ParisJS 4Sylvain Zimmer

Kinect + Javascript tech talk at #ParisJS Jan 2011Sylvain Zimmer

Web Crawling with NodeJSSylvain Zimmer

Archicamp présentationSylvain Zimmer

Twisted presentation & Jamendo usecasesSylvain Zimmer

Recently uploaded (20)

PDF

CIFDAQ Market Insights for July 7th 2025CIFDAQ

PDF

Blockchain Transactions Explained For EveryoneCIFDAQ

PPTX

OpenID AuthZEN - Analyst Briefing July 2025David Brossard

PDF

Smart Trailers 2025 Update with History and OverviewPaul Menig

PPTX

UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst ContentDianaGray10

PDF

CIFDAQ Weekly Market Wrap for 11th July 2025CIFDAQ

PPTX

Q2 FY26 Tableau User Group Leader Quarterly Calllward7

PDF

The Builder’s Playbook - 2025 State of AI Report.pdfjeroen339954

PDF

Learn Computer Forensics, Second EditionAnuraShantha7

PPTX

Building Search Using OpenSearch: Limitations and WorkaroundsSease

PPTX

AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptxsameeraaabegumm

PDF

SFWelly Summer 25 Release Highlights July 2025Anna Loughnan Colquhoun

PDF

Why Orbit Edge Tech is a Top Next JS Development Company in 2025mahendraalaska08

PDF

Presentation - Vibe Coding The Future of Techyanuarsinggih1

PPTX

Top iOS App Development Company in the USA for Innovative AppsSynapseIndia

PDF

NewMind AI - Journal 100 Insights After The 100th IssueNewMind AI

PDF

HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...mcastillo49

PDF

Using FME to Develop Self-Service CAD Applications for a Major UK Police ForceSafe Software

PDF

DevBcn - Building 10x Organizations Using Modern Productivity MetricsJustin Reock

PDF

"AI Transformation: Directions and Challenges", Pavlo ShaternikFwdays

CIFDAQ Market Insights for July 7th 2025CIFDAQ

Blockchain Transactions Explained For EveryoneCIFDAQ

OpenID AuthZEN - Analyst Briefing July 2025David Brossard

Smart Trailers 2025 Update with History and OverviewPaul Menig

UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst ContentDianaGray10

CIFDAQ Weekly Market Wrap for 11th July 2025CIFDAQ

Q2 FY26 Tableau User Group Leader Quarterly Calllward7

The Builder’s Playbook - 2025 State of AI Report.pdfjeroen339954

Learn Computer Forensics, Second EditionAnuraShantha7

Building Search Using OpenSearch: Limitations and WorkaroundsSease

AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptxsameeraaabegumm

SFWelly Summer 25 Release Highlights July 2025Anna Loughnan Colquhoun

Why Orbit Edge Tech is a Top Next JS Development Company in 2025mahendraalaska08

Presentation - Vibe Coding The Future of Techyanuarsinggih1

Top iOS App Development Company in the USA for Innovative AppsSynapseIndia

NewMind AI - Journal 100 Insights After The 100th IssueNewMind AI

HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...mcastillo49

Using FME to Develop Self-Service CAD Applications for a Major UK Police ForceSafe Software

DevBcn - Building 10x Organizations Using Modern Productivity MetricsJustin Reock

"AI Transformation: Directions and Challenges", Pavlo ShaternikFwdays

Ranking the Web with Spark

1. Ranking the Web with Spark Apache Big Data Europe 2016 [email protected] @sylvinus

2. /usr/bin/whoami • Jamendo (Founder & CTO, 2004-2011) • TEDxParis (Co-founder, 2009-2012) • dotConferences (Founder, 2012-) • Pricing Assistant (Co-founder & CTO, 2012-)

3. transparency reproducibility

5. https://uidemo.commonsearch.org

6. https://explain.commonsearch.org/?q=python&g=en

7. Ranking

8. Disclaimer: IANASRE (I Am Not A Search Relevance Engineer)

9. What's in a score score = fn( doc, query, language, user, time )

10. What's in a score score = fn( doc, query )

11. What's in a score score = fn( static_score, dynamic_score ( query ))

12. Static score

13. Static features • Scopes: • Page: URL depth, markup stats, ... • Domain: Age, page count, blacklists, ... • WebGraph: PageRank, ...

14. http://infolab.stanford.edu/~backrub/google.html The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database SearcherRanker

15. Dynamic score

16. Dynamic features • Text match: TF-IDF, BM25, proximity, topic, ... • Query-level: number of words, popularity, ... • Usage: clicks, dwell time, reformulations, ... • Time

17. Scoring function

18. Users Database Elasticsearch Indexer Python, Spark Data sources Common Crawl, Alexa top 1M, ... words, static score query top 10 docs, ﬁnal scores Ofﬂine Online Searcher Go

20. https://explain.commonsearch.org/?q=python&g=en

21. Issues with this architecture • Static & dynamic scoring are in different codebases • No control over result diversity • Hard to optimize • Very dependent on Elasticsearch

22. Rescoring

23. Users Database Indexer words, static score, features query Searcher top 1k docs, features Rescorer ﬁnal 10 docs

24. Issues with rescoring • Latency • Pagination • Harder to explain

25. Learning to rank

26. LTR Model • Features • Training dataset • Evaluation: NDCG, ERR, ... • Algorithms: AdaRank, ListNet, LambdaMART, ... • Learning with Spark!

27. The right questions • What do users expect? • What features? • How to evaluate and ﬁne-tune in the real world?

28. PageRank with Spark

30. http://commoncrawl.org

31. https://github.com/commonsearch/cosr-back

32. Common Search Pipeline Doc sources Common Crawl, WARC ﬁles, URLs ... Filter plugins Document parsing Output plugins Data output Database, ﬁle, HDFS, S3, ...

33. Most popular Wikipedia pages

34. Dumping the web graph

35. Naive pyspark PageRank

36. GraphFrames

37. SparkSQL PageRank

38. SparkSQL PageRank https://github.com/commonsearch/cosr-back/blob/master/spark/jobs/pagerank.py

39. Tests http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm https://github.com/commonsearch/cosr-back/blob/master/tests/sparktests/test_pagerank.py

40. https://about.commonsearch.org/developer/get-started

42. Top 10

44. Spam

46. Spamdexing • Keyword stufﬁng, hidden text • Scraper sites, Mirrors • Link farms • Splogs, Comment spam • Domaining • Cloaking • Bombing

47. Questions? https://about.commonsearch.org/contributing https://github.com/commonsearch [email protected] slack.commonsearch.org