SlideShare a Scribd company logo
Scrapy
Patrick O’Brien | @obdit
DataPhilly | 20131118 | Monetate
Steps of data science
●
●
●
●
●

Obtain
Scrub
Explore
Model
iNterpret
Steps of data science
●
●
●
●
●

Obtain
Scrub
Explore
Model
iNterpret

Scrapy
About Scrapy
●
●
●
●

Framework for collecting data
Open source - 100% python
Simple and well documented
OSX, Linux, Windows, BSD
Some of the features
● Built-in selectors
● Generating feed output
○ Format: json, csv, xml
○ Storage: local, FTP, S3, stdout

●
●
●
●

Encoding and autodetection
Stats collection
Control via a web service
Handle cookies, auth, robots.txt, user-agent
Scrapy Architecture
Data flow
1.
2.
3.
4.
5.
6.
7.

Engine opens, locates Spider, schedule first url as a Request
Scheduler sends url to the Engine, which sends it to Downloader
Downloader sends completed page as a Response through the
middleware to the engine
Engine sends Response to the Spider through middleware
Spiders sends Items and new Requests to the Engine
Engine sends Items to the Item Pipeline and Requests to the Scheduler
GOTO 2
Parts of Scrapy
●
●
●
●
●
●

Items
Spider
Link Extractors
Selectors
Request
Responses
Items
● Main container of structured information
● dict-like objects
from scrapy.item import Item, Field
class Product(Item):
name = Field()
price = Field()
stock = Field()
last_updated = Field(serializer=str)
Items
>>> product = Product(name='Desktop PC', price=1000)
>>> print product
Product(name='Desktop PC', price=1000)

>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC

>>> product.keys()
['price', 'name']
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]
Spiders
● Define how to move around a site
○ which links to follow
○ how to extract data

● Cycle
○ Initial request and callback
○ Store parsed content
○ Subsequent requests and callbacks
Generic Spiders
●
●
●
●
●

BaseSpider
CrawlSpider
XMLFeedSpider
CSVFeedSpider
SitemapSpider
BaseSpider
● Every other spider inherits from BaseSpider
● Two jobs
○ Request `start_urls`
○ Callback `parse` on resulting response
BaseSpider
...
class MySpider(BaseSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [

● Send Requests example.
com/[1:3].html

'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]

● Yield title Item

def parse(self, response):
sel = Selector(response)
for h3 in sel.xpath('//h3').extract():
yield MyItem(title=h3)

for url in sel.xpath('//a/@href').extract():
yield Request(url, callback=self.parse)

● Yield new Request
CrawlSpider
● Provide a set of rules on what links to follow
○ `link_extractor`
○ `call_back`
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category.php', ), deny=('subsection.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item.php', )), callback='parse_item'),
)
Link Extractors
● Extract links from Response objects
● Parameters include:
○
○
○
○
○
○

allow | deny
allow_domains | deny_domains
deny_extensions
restrict_xpaths
tags
attrs
Selectors
● Mechanisms for extracting data from HTML
● Built over the lxml library
● Two methods
○ XPath: sel.xpath('//a[contains(@href,

"image")]/@href' ).

extract()

○

CSS: sel.css('a[href*=image]::attr(href)' ).extract()

● Response object is called into Selector
○

sel = Selector(response)
Request
● Generated in Spider, sent to Downloader
● Represent an HTTP request
● FormRequest subclass performs HTTP
POST
○ useful to simulate user login
Response
● Comes from Downloader and sent to Spider
● Represents HTTP response
● Subclasses
○ TextResponse
○ HTMLResponse
○ XmlResponse
Advanced Scrapy
● Scrapyd
○ application to deploy and run Scrapy spiders
○ deploy projects and control with JSON API

● Signals
○ notify when events occur
○ hook into Signals API for advance tuning

● Extensions
○ Custom functionality loaded at Scrapy startup
More information
●
●
●
●

http://doc.scrapy.org/
https://twitter.com/ScrapyProject
https://github.com/scrapy
http://scrapyd.readthedocs.org
Demo

More Related Content

What's hot (20)

PDF
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
Anton
 
PDF
Web Scrapping with Python
Miguel Miranda de Mattos
 
PDF
Web Scraping with Python
Paul Schreiber
 
PDF
N hidden gems you didn't know hippo delivery tier and hippo (forge) could give
Woonsan Ko
 
PDF
Webscraping with asyncio
Jose Manuel Ortega Candel
 
PDF
Fun with Python
Narong Intiruk
 
PDF
Approach to find critical vulnerabilities
Ashish Kunwar
 
PDF
Application Logging With The ELK Stack
benwaine
 
PDF
LogStash - Yes, logging can be awesome
James Turnbull
 
ODP
Using Logstash, elasticsearch & kibana
Alejandro E Brito Monedero
 
PDF
Do something in 5 minutes with gas 1-use spreadsheet as database
Bruce McPherson
 
PDF
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
Puppet
 
PPTX
CouchDB Day NYC 2017: Full Text Search
IBM Cloud Data Services
 
PDF
Quicli - From zero to a full CLI application in a few lines of Rust
Damien Castelltort
 
PDF
Do something in 5 with gas 8-copy between databases
Bruce McPherson
 
PPTX
2015 555 kharchenko_ppt
Maxym Kharchenko
 
KEY
Python Development (MongoSF)
Mike Dirolf
 
PDF
Cross Domain Web
Mashups with JQuery and Google App Engine
Andy McKay
 
PPTX
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
IBM Cloud Data Services
 
PDF
Do something in 5 with gas 2-graduate to a database
Bruce McPherson
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
Anton
 
Web Scrapping with Python
Miguel Miranda de Mattos
 
Web Scraping with Python
Paul Schreiber
 
N hidden gems you didn't know hippo delivery tier and hippo (forge) could give
Woonsan Ko
 
Webscraping with asyncio
Jose Manuel Ortega Candel
 
Fun with Python
Narong Intiruk
 
Approach to find critical vulnerabilities
Ashish Kunwar
 
Application Logging With The ELK Stack
benwaine
 
LogStash - Yes, logging can be awesome
James Turnbull
 
Using Logstash, elasticsearch & kibana
Alejandro E Brito Monedero
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Bruce McPherson
 
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
Puppet
 
CouchDB Day NYC 2017: Full Text Search
IBM Cloud Data Services
 
Quicli - From zero to a full CLI application in a few lines of Rust
Damien Castelltort
 
Do something in 5 with gas 8-copy between databases
Bruce McPherson
 
2015 555 kharchenko_ppt
Maxym Kharchenko
 
Python Development (MongoSF)
Mike Dirolf
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Andy McKay
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
IBM Cloud Data Services
 
Do something in 5 with gas 2-graduate to a database
Bruce McPherson
 

Viewers also liked (6)

PDF
Getting started with Hadoop, Hive, and Elastic MapReduce
obdit
 
PDF
Hadoop Now, Next & Beyond
DataWorks Summit
 
PDF
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Yahoo Developer Network
 
PDF
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Jonathan Seidman
 
PDF
Hive sq lfor-hadoop
Pragati Singh
 
PDF
SQL in Hadoop
Sven Bayer
 
Getting started with Hadoop, Hive, and Elastic MapReduce
obdit
 
Hadoop Now, Next & Beyond
DataWorks Summit
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Yahoo Developer Network
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Jonathan Seidman
 
Hive sq lfor-hadoop
Pragati Singh
 
SQL in Hadoop
Sven Bayer
 
Ad

Similar to Scrapy talk at DataPhilly (20)

PPTX
Web scraping using scrapy - zekeLabs
zekeLabs Technologies
 
PPTX
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
LITTINRAJAN
 
PDF
How To Crawl Amazon Website Using Python Scrapy.pdf
jimmylofy
 
PPTX
Web Scraping using Python | Web Screen Scraping
CynthiaCruz55
 
PPTX
How To Crawl Amazon Website Using Python Scrap (1).pptx
iwebdatascraping
 
PDF
Scrapy tutorial
HarikaReddy115
 
PDF
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
ThinkODC
 
PPTX
Scrapy
Francisco Sousa
 
PPTX
Datasets, APIs, and Web Scraping
Damian T. Gordon
 
PDF
Web scraping in python
Viren Rajput
 
PDF
Getting started with Web Scraping in Python
Satwik Kansal
 
PDF
Tutorial on Web Scraping in Python
Nithish Raghunandanan
 
PDF
Scrapinghub PyCon Philippines 2015
Richard Dowinton
 
PDF
Getting started with Scrapy in Python
Viren Rajput
 
PPTX
Practical webcrawling with scrapy
Iván Compañy Avi
 
PPTX
Practical webcrawling with scrapy
Iván Compañy Avi
 
PPTX
[DSC Europe 24] Domagoj Maric - Modern Web Data Extraction: Techniques, Tools...
DataScienceConferenc1
 
PPTX
Scrappy
Vishwas N
 
PPTX
Big data at scrapinghub
Dana Brophy
 
Web scraping using scrapy - zekeLabs
zekeLabs Technologies
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
LITTINRAJAN
 
How To Crawl Amazon Website Using Python Scrapy.pdf
jimmylofy
 
Web Scraping using Python | Web Screen Scraping
CynthiaCruz55
 
How To Crawl Amazon Website Using Python Scrap (1).pptx
iwebdatascraping
 
Scrapy tutorial
HarikaReddy115
 
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
ThinkODC
 
Datasets, APIs, and Web Scraping
Damian T. Gordon
 
Web scraping in python
Viren Rajput
 
Getting started with Web Scraping in Python
Satwik Kansal
 
Tutorial on Web Scraping in Python
Nithish Raghunandanan
 
Scrapinghub PyCon Philippines 2015
Richard Dowinton
 
Getting started with Scrapy in Python
Viren Rajput
 
Practical webcrawling with scrapy
Iván Compañy Avi
 
Practical webcrawling with scrapy
Iván Compañy Avi
 
[DSC Europe 24] Domagoj Maric - Modern Web Data Extraction: Techniques, Tools...
DataScienceConferenc1
 
Scrappy
Vishwas N
 
Big data at scrapinghub
Dana Brophy
 
Ad

Recently uploaded (20)

PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 

Scrapy talk at DataPhilly

  • 1. Scrapy Patrick O’Brien | @obdit DataPhilly | 20131118 | Monetate
  • 2. Steps of data science ● ● ● ● ● Obtain Scrub Explore Model iNterpret
  • 3. Steps of data science ● ● ● ● ● Obtain Scrub Explore Model iNterpret Scrapy
  • 4. About Scrapy ● ● ● ● Framework for collecting data Open source - 100% python Simple and well documented OSX, Linux, Windows, BSD
  • 5. Some of the features ● Built-in selectors ● Generating feed output ○ Format: json, csv, xml ○ Storage: local, FTP, S3, stdout ● ● ● ● Encoding and autodetection Stats collection Control via a web service Handle cookies, auth, robots.txt, user-agent
  • 7. Data flow 1. 2. 3. 4. 5. 6. 7. Engine opens, locates Spider, schedule first url as a Request Scheduler sends url to the Engine, which sends it to Downloader Downloader sends completed page as a Response through the middleware to the engine Engine sends Response to the Spider through middleware Spiders sends Items and new Requests to the Engine Engine sends Items to the Item Pipeline and Requests to the Scheduler GOTO 2
  • 8. Parts of Scrapy ● ● ● ● ● ● Items Spider Link Extractors Selectors Request Responses
  • 9. Items ● Main container of structured information ● dict-like objects from scrapy.item import Item, Field class Product(Item): name = Field() price = Field() stock = Field() last_updated = Field(serializer=str)
  • 10. Items >>> product = Product(name='Desktop PC', price=1000) >>> print product Product(name='Desktop PC', price=1000) >>> product['name'] Desktop PC >>> product.get('name') Desktop PC >>> product.keys() ['price', 'name'] >>> product.items() [('price', 1000), ('name', 'Desktop PC')]
  • 11. Spiders ● Define how to move around a site ○ which links to follow ○ how to extract data ● Cycle ○ Initial request and callback ○ Store parsed content ○ Subsequent requests and callbacks
  • 13. BaseSpider ● Every other spider inherits from BaseSpider ● Two jobs ○ Request `start_urls` ○ Callback `parse` on resulting response
  • 14. BaseSpider ... class MySpider(BaseSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ ● Send Requests example. com/[1:3].html 'http://www.example.com/1.html', 'http://www.example.com/2.html', 'http://www.example.com/3.html', ] ● Yield title Item def parse(self, response): sel = Selector(response) for h3 in sel.xpath('//h3').extract(): yield MyItem(title=h3) for url in sel.xpath('//a/@href').extract(): yield Request(url, callback=self.parse) ● Yield new Request
  • 15. CrawlSpider ● Provide a set of rules on what links to follow ○ `link_extractor` ○ `call_back` rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). Rule(SgmlLinkExtractor(allow=('category.php', ), deny=('subsection.php', ))), # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(SgmlLinkExtractor(allow=('item.php', )), callback='parse_item'), )
  • 16. Link Extractors ● Extract links from Response objects ● Parameters include: ○ ○ ○ ○ ○ ○ allow | deny allow_domains | deny_domains deny_extensions restrict_xpaths tags attrs
  • 17. Selectors ● Mechanisms for extracting data from HTML ● Built over the lxml library ● Two methods ○ XPath: sel.xpath('//a[contains(@href, "image")]/@href' ). extract() ○ CSS: sel.css('a[href*=image]::attr(href)' ).extract() ● Response object is called into Selector ○ sel = Selector(response)
  • 18. Request ● Generated in Spider, sent to Downloader ● Represent an HTTP request ● FormRequest subclass performs HTTP POST ○ useful to simulate user login
  • 19. Response ● Comes from Downloader and sent to Spider ● Represents HTTP response ● Subclasses ○ TextResponse ○ HTMLResponse ○ XmlResponse
  • 20. Advanced Scrapy ● Scrapyd ○ application to deploy and run Scrapy spiders ○ deploy projects and control with JSON API ● Signals ○ notify when events occur ○ hook into Signals API for advance tuning ● Extensions ○ Custom functionality loaded at Scrapy startup
  • 22. Demo