SlideShare a Scribd company logo
HOWTO SCRAPE
ANY WEBSITE
FOR FUN ;)
by Anton Rifco
anton.rifco@gmail.com
Some pictures taken from internet.
This article possess no copyright. Use it for your own purpose
July 2013
Monday, 15 July, 13
Let’s scrape this :)
Monday, 15 July, 13
Web Scraping,
a process of automatically collecting (stealing?)
information from the Internet
Monday, 15 July, 13
THETOOL
You need these tool to steal (uupss) those data:
Python (2.6 or 2.7) with some packages*
Scrapy** framework
Google Chrome with XPath*** review plugin
Computer, of course
and functional brain
*) http://doc.scrapy.org/en/latest/intro/install.html#requirements
**) refer to http://scrapy.org/ (this slides won’t cover the installation of those things)
***) I use “XPath helper” plugin
Monday, 15 July, 13
S C R A P Y
Not Crappy
Scrapy is an application framework for crawling
web sites and extracting structured data which can be used
for a wide range of useful applications, like data mining,
information processing or historical archival.
Monday, 15 July, 13
S C R A P Y
Not Crappy
Scrapy works by creating logical spiders that will crawl to any
website you like.
You define the logic of that spider, using Python
Scrapy uses a mechanism based on XPath expressions called
XPath selectors.
Monday, 15 July, 13
XPath is W3C standard to navigate through XML document
(so as HTML)
Here, XML documents are treated as trees of nodes.The
topmost element of the tree is called the root element.
For more, refer to: http://www.w3schools.com/xpath/
X P A T H
Monday, 15 July, 13
X P A T H
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book>
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
From example of nodes in the XML document above:
<bookstore> (root element node)
<author>J K. Rowling</author> (element node)
lang="en" (attribute node)
For more, refer to: http://www.w3schools.com/xpath/
Monday, 15 July, 13
Selecting Nodes
XPath uses path expressions to select nodes in an XML
document.The node is selected by following a path or steps
For more, refer to: http://www.w3schools.com/xpath/
Expression Result
nodename Selects all nodes with the name “nodename”
/ Do selection from the root
// Do selection from current node
. Select current node
.. Select parent node
@attr Select attributes of nodes
text() Select the value of chosen node
X P A T H
Monday, 15 July, 13
X P A T H
Predicate Expressions
Predicates are used to find a specific node or a node that contains a specific value.
Predicates are always embedded in square brackets.
Expression Result
/bookstore/book[1] Selects the first book element that is the child of the bookstore element.
/bookstore/book[last()] Selects the last book element that is the child of the bookstore element
/bookstore/book[last()-1] Selects the last but one book element that is the child of the bookstore
element
/bookstore/book[position()<3] Selects the first two book elements that are children of the bookstore
element
//title[@lang] Selects all the title elements that have an attribute named lang
//title[@lang='eng'] Selects all the title elements that have an attribute named lang with a value of
'eng'
/bookstore/book[price>35.00] Selects all the book elements of the bookstore element that have a price
element with a value greater than 35.00
Monday, 15 July, 13
X PAT H H E L P E R
By using XPATH Helper, you can easily get the XPath
expression of a given node in HTML doc. It will be enabled by
pressing <Ctrl>+<Shift>+X on Chrome
Monday, 15 July, 13
R E A L A C T I O N
Create Scrapy Comesg project
> scrapy startproject comesg
Then, it will create the following project directory structure
comesg/ /* This is Project root */
scrapy.cfg /* Project config file */
comesg/
__init__.py
items.py /* Definition of Items to scrap */
pipelines.py /* Pipeline config for advance use*/
settings.py /* Advance setting file */
spiders/ /* Directory to put spiders file */
__init__.py
...
Monday, 15 July, 13
R E A L A C T I O N
Define the Information Items that we want to scrap
Click any of place, will open its details
So, of all those data, we want to collect:
name of places,
photo,
description,
address (if any), contact number (if any), opening hours (if any),
website (if any), and video (if any)
Monday, 15 July, 13
from scrapy.item import Item, Field
class ComesgItem(Item):
# define the fields for your item here like:
name = Field()
photo = Field()
desc = Field()
address = Field()
contact = Field()
hours = Field()
website = Field()
video = Field()
On items.py, write the following:
I t e m s D e f i n i t i o n
Monday, 15 July, 13
R E A L A C T I O N
Basically, here is our strategy
1. Implements first spider that will get
url of the listed items
2. Crawl to that url one by one
3. Implements second spider that will
fetch all the required data
Monday, 15 July, 13
class AttractionSpider(CrawlSpider):
! name = "get-attraction"
! allowed_domains = ["comesingapore.com"] ## Will never go outside playground
! start_urls = [ ## Starting URL
! ! "http://comesingapore.com/travel-guide/category/285/attractions"
! ]
! rules = ()
! def __init__(self, name=None, **kwargs):
! ! super(AttractionSpider, self).__init__(name, **kwargs)
! ! self.items_buffer = {}
! ! self.base_url = "http://comesingapore.com"
! ! from scrapy.conf import settings
! ! settings.overrides['DOWNLOAD_TIMEOUT'] = 360 ## prevent too early timeout
! def parse(self, response):
! ! print "Start scrapping Attractions...."
! ! try:
! ! ! hxs = HtmlXPathSelector(response)
## XPath expression to get the URL of item details
! ! ! links = hxs.select("//*[@id='content']//a[@style='color:black']/@href")
! ! !
! ! ! if not links:
! ! ! ! return
! ! ! ! log.msg("No Data to scrap")
! ! ! for link in links:
! ! ! ! v_url = ''.join( link.extract() )
! ! ! ! ! ! ! !
! ! ! ! if not v_url:
! ! ! ! ! continue
! ! ! ! else: ## If valid URL, continue crawl those URL
! ! ! ! ! _url = self.base_url + v_url
## real work handled by second spider
! ! ! ! ! yield Request( url= _url, callback=self.parse_details )
! ! except Exception as e:
! ! ! log.msg("Parsing failed for URL {%s}"%format(response.request.url))
F i r s t s p i d e r
Monday, 15 July, 13
def parse_details(self, response):
! ! print "Start scrapping Detailed Info...."
! ! try:
! ! ! hxs = HtmlXPathSelector(response)
! ! ! l_venue = ComesgItem()
! ! ! v_name = hxs.select("/html/body/div[@id='wrapper']/div[@id='page']/div[@id='page-bgtop']/div[@id='page-bgbtm']/
div[@id='content']/div[3]/h1/text()").extract()
! ! ! if not v_name:
! ! ! ! v_name = hxs.select("/html/body/div[@id='wrapper']/div[@id='page']/div[@id='page-bgtop']/div[@id='page-
bgbtm']/div[@id='content']/div[2]/h1/text()").extract()
! ! !
! ! ! l_venue["name"] = v_name[0].strip()
! ! !
! ! ! base = hxs.select("//*[@id='content']/div[7]")
! ! ! if base.extract()[0].strip() == "<div style="clear:both"></div>":
! ! ! ! base = hxs.select("//*[@id='content']/div[8]")
! ! ! elif base.extract()[0].strip() == "<div style="padding-top:10px;margin-top:10px;border-top:1px dotted #DDD;">n
You must be logged in to add a tipn </div>":
! ! ! ! base = hxs.select("//*[@id='content']/div[6]")
! ! ! x_datas = base.select("div[1]/b").extract()
! ! ! v_datas = base.select("div[1]/text()").extract()
! ! ! i_d = 0;
! ! ! if x_datas:
! ! ! ! for x_data in x_datas:
! ! ! ! ! print "data is:" + x_data.strip()
! ! ! ! ! if x_data.strip() == "<b>Address:</b>":
! ! ! ! ! ! l_venue["address"] = v_datas[i_d].strip()
! ! ! ! ! if x_data.strip() == "<b>Contact:</b>":
! ! ! ! ! ! l_venue["contact"] = v_datas[i_d].strip()
! ! ! ! ! if x_data.strip() == "<b>Operating Hours:</b>":
! ! ! ! ! ! l_venue["hours"] = v_datas[i_d].strip()
! ! ! ! ! if x_data.strip() == "<b>Website:</b>":
! ! ! ! ! ! l_venue["website"] = (base.select("div[1]/a/@href").extract())[0].strip()
! ! ! ! ! i_d += 1
! ! ! !
! ! ! v_photo = base.select("img/@src").extract()
! ! ! if v_photo:
! ! ! ! l_venue["photo"] = v_photo[0].strip()
! ! ! v_desc = base.select("div[3]/text()").extract()
! ! ! if v_desc:
! ! ! ! desc = ""
! ! ! ! for dsc in v_desc:
! ! ! ! ! desc += dsc
! ! ! ! l_venue["desc"] = desc.strip()
S e c o n d s p i d e r
Monday, 15 July, 13
R E A L A C T I O N
Run the Project
> scrapy crawl get-attraction -t csv -o attr.csv
In the end, it produces file attr.csv with the scraped data,
like following:
> head -3 attr.csv
website,name,photo,hours,contact,video,address,desc
http://www.tigerlive.com.sg,TigerLIVE,http://tn.comesingapore.com/img/others/240x240/f/6/0000246.jpg,Daily
from 11am to 8pm (Last admission at 6.30pm).,(+65) 6270 7676,,"St. James Power Station, 3 Sentosa Gateway,
Singapore 098544",
http://www.zoo.com.sg,Singapore Zoo,http://tn.comesingapore.com/img/others/240x240/6/2/0000098.jpg,Daily
from 8.30am - 6pm (Last ticket sale at 5.30pm),(+65) 6269 3411,http://www.youtube.com/embed/p4jgx4yNY9I,"80
Mandai Lake Road, Singapore 729826","See exotic and endangered animals up close in their natural habitats in
the . Voted the best attraction in Singapore on Trip Advisor, and considered one of the best zoos in the
world, this attraction is a must see, housing over 2500 mammals, birds and reptiles.
Monday, 15 July, 13
Get the complete Project code @
https://github.com/antonrifco/comesg
Monday, 15 July, 13
THANKYOU!
- Anton -
Monday, 15 July, 13
Ad

More Related Content

What's hot (20)

Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
Jose Manuel Ortega Candel
 
Selenium&amp;scrapy
Selenium&amp;scrapySelenium&amp;scrapy
Selenium&amp;scrapy
Arcangelo Saracino
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Sammy Fung
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
Erin Shellman
 
Webscraping with asyncio
Webscraping with asyncioWebscraping with asyncio
Webscraping with asyncio
Jose Manuel Ortega Candel
 
Django
DjangoDjango
Django
Kangjin Jun
 
CouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce ViewsCouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce Views
IBM Cloud Data Services
 
Open Hack London - Introduction to YQL
Open Hack London - Introduction to YQLOpen Hack London - Introduction to YQL
Open Hack London - Introduction to YQL
Christian Heilmann
 
Routing @ Scuk.cz
Routing @ Scuk.czRouting @ Scuk.cz
Routing @ Scuk.cz
Jakub Kulhan
 
Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010
Christian Heilmann
 
Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)
Christian Heilmann
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
Andy McKay
 
solving little problems
solving little problemssolving little problems
solving little problems
removed_e334947d661d520b05c7f698a45590c4
 
Hd insight programming
Hd insight programmingHd insight programming
Hd insight programming
Casear Chu
 
Undercover Pods / WP Functions
Undercover Pods / WP FunctionsUndercover Pods / WP Functions
Undercover Pods / WP Functions
podsframework
 
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Tsuyoshi Yamamoto
 
PuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbqueryPuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbquery
Puppet
 
Create responsive websites with Django, REST and AngularJS
Create responsive websites with Django, REST and AngularJSCreate responsive websites with Django, REST and AngularJS
Create responsive websites with Django, REST and AngularJS
Hannes Hapke
 
Introduction to the Pods JSON API
Introduction to the Pods JSON APIIntroduction to the Pods JSON API
Introduction to the Pods JSON API
podsframework
 
Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3
makoto tsuyuki
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Sammy Fung
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
Erin Shellman
 
Open Hack London - Introduction to YQL
Open Hack London - Introduction to YQLOpen Hack London - Introduction to YQL
Open Hack London - Introduction to YQL
Christian Heilmann
 
Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010
Christian Heilmann
 
Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)Things you can use (by the Yahoo Developer Network and friends)
Things you can use (by the Yahoo Developer Network and friends)
Christian Heilmann
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
Andy McKay
 
Hd insight programming
Hd insight programmingHd insight programming
Hd insight programming
Casear Chu
 
Undercover Pods / WP Functions
Undercover Pods / WP FunctionsUndercover Pods / WP Functions
Undercover Pods / WP Functions
podsframework
 
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
Tsuyoshi Yamamoto
 
PuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbqueryPuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbquery
Puppet
 
Create responsive websites with Django, REST and AngularJS
Create responsive websites with Django, REST and AngularJSCreate responsive websites with Django, REST and AngularJS
Create responsive websites with Django, REST and AngularJS
Hannes Hapke
 
Introduction to the Pods JSON API
Introduction to the Pods JSON APIIntroduction to the Pods JSON API
Introduction to the Pods JSON API
podsframework
 
Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3Django - 次の一歩 gumiStudy#3
Django - 次の一歩 gumiStudy#3
makoto tsuyuki
 

Similar to How to Scrap Any Website's content using ScrapyTutorial of How to scrape (crawling) website's content using Scrapy Python (20)

Rotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App EngineRotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App Engine
geehwan
 
#NewMeetup Performance
#NewMeetup Performance#NewMeetup Performance
#NewMeetup Performance
Justin Cataldo
 
dojo.Patterns
dojo.Patternsdojo.Patterns
dojo.Patterns
Peter Higgins
 
Maintainable JavaScript 2011
Maintainable JavaScript 2011Maintainable JavaScript 2011
Maintainable JavaScript 2011
Nicholas Zakas
 
JavaScript ES6
JavaScript ES6JavaScript ES6
JavaScript ES6
Leo Hernandez
 
NetPonto - The Future Of C# - NetConf Edition
NetPonto - The Future Of C# - NetConf EditionNetPonto - The Future Of C# - NetConf Edition
NetPonto - The Future Of C# - NetConf Edition
Paulo Morgado
 
RubyMotion
RubyMotionRubyMotion
RubyMotion
Mark
 
JQuery Flot
JQuery FlotJQuery Flot
JQuery Flot
Arshavski Alexander
 
Demystifying Maven
Demystifying MavenDemystifying Maven
Demystifying Maven
Mike Desjardins
 
Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.
Peter Higgins
 
Tuga IT 2018 Summer Edition - The Future of C#
Tuga IT 2018 Summer Edition - The Future of C#Tuga IT 2018 Summer Edition - The Future of C#
Tuga IT 2018 Summer Edition - The Future of C#
Paulo Morgado
 
Maintainable JavaScript 2012
Maintainable JavaScript 2012Maintainable JavaScript 2012
Maintainable JavaScript 2012
Nicholas Zakas
 
Fixture Replacement Plugins
Fixture Replacement PluginsFixture Replacement Plugins
Fixture Replacement Plugins
Paul Klipp
 
Yahoo is open to developers
Yahoo is open to developersYahoo is open to developers
Yahoo is open to developers
Christian Heilmann
 
Mobile App Development: Primi passi con NativeScript e Angular 2
Mobile App Development: Primi passi con NativeScript e Angular 2Mobile App Development: Primi passi con NativeScript e Angular 2
Mobile App Development: Primi passi con NativeScript e Angular 2
Filippo Matteo Riggio
 
Use Web Skills To Build Mobile Apps
Use Web Skills To Build Mobile AppsUse Web Skills To Build Mobile Apps
Use Web Skills To Build Mobile Apps
Nathan Smith
 
IT6801-Service Oriented Architecture-Unit-2-notes
IT6801-Service Oriented Architecture-Unit-2-notesIT6801-Service Oriented Architecture-Unit-2-notes
IT6801-Service Oriented Architecture-Unit-2-notes
Ramco Institute of Technology, Rajapalayam, Tamilnadu, India
 
Great Developers Steal
Great Developers StealGreat Developers Steal
Great Developers Steal
Ben Scofield
 
Using HTML5 for a great Open Web
Using HTML5 for a great Open WebUsing HTML5 for a great Open Web
Using HTML5 for a great Open Web
Robert Nyman
 
Sprockets
SprocketsSprockets
Sprockets
Christophe Porteneuve
 
Rotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App EngineRotzy - Building an iPhone Photo Sharing App on Google App Engine
Rotzy - Building an iPhone Photo Sharing App on Google App Engine
geehwan
 
#NewMeetup Performance
#NewMeetup Performance#NewMeetup Performance
#NewMeetup Performance
Justin Cataldo
 
Maintainable JavaScript 2011
Maintainable JavaScript 2011Maintainable JavaScript 2011
Maintainable JavaScript 2011
Nicholas Zakas
 
NetPonto - The Future Of C# - NetConf Edition
NetPonto - The Future Of C# - NetConf EditionNetPonto - The Future Of C# - NetConf Edition
NetPonto - The Future Of C# - NetConf Edition
Paulo Morgado
 
RubyMotion
RubyMotionRubyMotion
RubyMotion
Mark
 
Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.Your Library Sucks, and why you should use it.
Your Library Sucks, and why you should use it.
Peter Higgins
 
Tuga IT 2018 Summer Edition - The Future of C#
Tuga IT 2018 Summer Edition - The Future of C#Tuga IT 2018 Summer Edition - The Future of C#
Tuga IT 2018 Summer Edition - The Future of C#
Paulo Morgado
 
Maintainable JavaScript 2012
Maintainable JavaScript 2012Maintainable JavaScript 2012
Maintainable JavaScript 2012
Nicholas Zakas
 
Fixture Replacement Plugins
Fixture Replacement PluginsFixture Replacement Plugins
Fixture Replacement Plugins
Paul Klipp
 
Mobile App Development: Primi passi con NativeScript e Angular 2
Mobile App Development: Primi passi con NativeScript e Angular 2Mobile App Development: Primi passi con NativeScript e Angular 2
Mobile App Development: Primi passi con NativeScript e Angular 2
Filippo Matteo Riggio
 
Use Web Skills To Build Mobile Apps
Use Web Skills To Build Mobile AppsUse Web Skills To Build Mobile Apps
Use Web Skills To Build Mobile Apps
Nathan Smith
 
Great Developers Steal
Great Developers StealGreat Developers Steal
Great Developers Steal
Ben Scofield
 
Using HTML5 for a great Open Web
Using HTML5 for a great Open WebUsing HTML5 for a great Open Web
Using HTML5 for a great Open Web
Robert Nyman
 
Ad

More from Anton (8)

Growth Hack - How to Boost Business Metrics Significantly
Growth Hack - How to Boost Business Metrics SignificantlyGrowth Hack - How to Boost Business Metrics Significantly
Growth Hack - How to Boost Business Metrics Significantly
Anton
 
Getourguide Airlines Partnership
Getourguide Airlines PartnershipGetourguide Airlines Partnership
Getourguide Airlines Partnership
Anton
 
Scrum!
Scrum!Scrum!
Scrum!
Anton
 
Telco payment for startup companies
Telco payment for startup companiesTelco payment for startup companies
Telco payment for startup companies
Anton
 
Business on-the-move
Business on-the-moveBusiness on-the-move
Business on-the-move
Anton
 
Business on-the-move
Business on-the-moveBusiness on-the-move
Business on-the-move
Anton
 
Single Identity Number dengan Apache Cassandra
Single Identity Number dengan Apache CassandraSingle Identity Number dengan Apache Cassandra
Single Identity Number dengan Apache Cassandra
Anton
 
Single Identity Number dengan Apache Cassandra
Single Identity Number dengan Apache CassandraSingle Identity Number dengan Apache Cassandra
Single Identity Number dengan Apache Cassandra
Anton
 
Growth Hack - How to Boost Business Metrics Significantly
Growth Hack - How to Boost Business Metrics SignificantlyGrowth Hack - How to Boost Business Metrics Significantly
Growth Hack - How to Boost Business Metrics Significantly
Anton
 
Getourguide Airlines Partnership
Getourguide Airlines PartnershipGetourguide Airlines Partnership
Getourguide Airlines Partnership
Anton
 
Scrum!
Scrum!Scrum!
Scrum!
Anton
 
Telco payment for startup companies
Telco payment for startup companiesTelco payment for startup companies
Telco payment for startup companies
Anton
 
Business on-the-move
Business on-the-moveBusiness on-the-move
Business on-the-move
Anton
 
Business on-the-move
Business on-the-moveBusiness on-the-move
Business on-the-move
Anton
 
Single Identity Number dengan Apache Cassandra
Single Identity Number dengan Apache CassandraSingle Identity Number dengan Apache Cassandra
Single Identity Number dengan Apache Cassandra
Anton
 
Single Identity Number dengan Apache Cassandra
Single Identity Number dengan Apache CassandraSingle Identity Number dengan Apache Cassandra
Single Identity Number dengan Apache Cassandra
Anton
 
Ad

Recently uploaded (13)

Cognitive development revised - Copy.ppt
Cognitive development revised - Copy.pptCognitive development revised - Copy.ppt
Cognitive development revised - Copy.ppt
MasahiroNagasawa1
 
Sentence-Errors.pptxssssssssssssssssssss
Sentence-Errors.pptxssssssssssssssssssssSentence-Errors.pptxssssssssssssssssssss
Sentence-Errors.pptxssssssssssssssssssss
rexjosiahzosa
 
Pengenalan Berpikir Kritis Critical_Thinking_Intro.pptx
Pengenalan Berpikir Kritis Critical_Thinking_Intro.pptxPengenalan Berpikir Kritis Critical_Thinking_Intro.pptx
Pengenalan Berpikir Kritis Critical_Thinking_Intro.pptx
JabbarAPanggabean
 
UNIT 1 human rsource management presentation
UNIT 1 human rsource management presentationUNIT 1 human rsource management presentation
UNIT 1 human rsource management presentation
1712117
 
ee case study and evaluation of biodiesel
ee case study and evaluation of biodieselee case study and evaluation of biodiesel
ee case study and evaluation of biodiesel
SwaroopDalvi
 
Erikson’s theory , 8-stages of life , their importance
Erikson’s theory , 8-stages of life , their importanceErikson’s theory , 8-stages of life , their importance
Erikson’s theory , 8-stages of life , their importance
uf415127
 
Embracing Adaptability (setback) - Istvan Patzay
Embracing Adaptability (setback) - Istvan PatzayEmbracing Adaptability (setback) - Istvan Patzay
Embracing Adaptability (setback) - Istvan Patzay
isti84
 
Unlocking the Secrets of Love: The Science Behind Heartfelt Connections
Unlocking the Secrets of Love: The Science Behind Heartfelt ConnectionsUnlocking the Secrets of Love: The Science Behind Heartfelt Connections
Unlocking the Secrets of Love: The Science Behind Heartfelt Connections
Vikash Gautam
 
Emotions and emotional intelligence skills
Emotions and emotional intelligence skillsEmotions and emotional intelligence skills
Emotions and emotional intelligence skills
DRKAMINIBHASIN
 
Blue Cream Bold Simple Weekly Study Planner 2025.pdf
Blue Cream Bold Simple Weekly Study Planner 2025.pdfBlue Cream Bold Simple Weekly Study Planner 2025.pdf
Blue Cream Bold Simple Weekly Study Planner 2025.pdf
Vikash Gautam
 
CUTS Report by JOHN MEYNARD BARRUGA BAUTISTA, SHENE MERRYLAND BAUTISTA, AND A...
CUTS Report by JOHN MEYNARD BARRUGA BAUTISTA, SHENE MERRYLAND BAUTISTA, AND A...CUTS Report by JOHN MEYNARD BARRUGA BAUTISTA, SHENE MERRYLAND BAUTISTA, AND A...
CUTS Report by JOHN MEYNARD BARRUGA BAUTISTA, SHENE MERRYLAND BAUTISTA, AND A...
johnmeynardbautista9
 
crucial-conversations-training-powerpoint.pptx
crucial-conversations-training-powerpoint.pptxcrucial-conversations-training-powerpoint.pptx
crucial-conversations-training-powerpoint.pptx
vikramdas40
 
Spain is a country in southwestern Europe, located on the Iberian Peninsula. ...
Spain is a country in southwestern Europe, located on the Iberian Peninsula. ...Spain is a country in southwestern Europe, located on the Iberian Peninsula. ...
Spain is a country in southwestern Europe, located on the Iberian Peninsula. ...
leonardoquispeyapu6
 
Cognitive development revised - Copy.ppt
Cognitive development revised - Copy.pptCognitive development revised - Copy.ppt
Cognitive development revised - Copy.ppt
MasahiroNagasawa1
 
Sentence-Errors.pptxssssssssssssssssssss
Sentence-Errors.pptxssssssssssssssssssssSentence-Errors.pptxssssssssssssssssssss
Sentence-Errors.pptxssssssssssssssssssss
rexjosiahzosa
 
Pengenalan Berpikir Kritis Critical_Thinking_Intro.pptx
Pengenalan Berpikir Kritis Critical_Thinking_Intro.pptxPengenalan Berpikir Kritis Critical_Thinking_Intro.pptx
Pengenalan Berpikir Kritis Critical_Thinking_Intro.pptx
JabbarAPanggabean
 
UNIT 1 human rsource management presentation
UNIT 1 human rsource management presentationUNIT 1 human rsource management presentation
UNIT 1 human rsource management presentation
1712117
 
ee case study and evaluation of biodiesel
ee case study and evaluation of biodieselee case study and evaluation of biodiesel
ee case study and evaluation of biodiesel
SwaroopDalvi
 
Erikson’s theory , 8-stages of life , their importance
Erikson’s theory , 8-stages of life , their importanceErikson’s theory , 8-stages of life , their importance
Erikson’s theory , 8-stages of life , their importance
uf415127
 
Embracing Adaptability (setback) - Istvan Patzay
Embracing Adaptability (setback) - Istvan PatzayEmbracing Adaptability (setback) - Istvan Patzay
Embracing Adaptability (setback) - Istvan Patzay
isti84
 
Unlocking the Secrets of Love: The Science Behind Heartfelt Connections
Unlocking the Secrets of Love: The Science Behind Heartfelt ConnectionsUnlocking the Secrets of Love: The Science Behind Heartfelt Connections
Unlocking the Secrets of Love: The Science Behind Heartfelt Connections
Vikash Gautam
 
Emotions and emotional intelligence skills
Emotions and emotional intelligence skillsEmotions and emotional intelligence skills
Emotions and emotional intelligence skills
DRKAMINIBHASIN
 
Blue Cream Bold Simple Weekly Study Planner 2025.pdf
Blue Cream Bold Simple Weekly Study Planner 2025.pdfBlue Cream Bold Simple Weekly Study Planner 2025.pdf
Blue Cream Bold Simple Weekly Study Planner 2025.pdf
Vikash Gautam
 
CUTS Report by JOHN MEYNARD BARRUGA BAUTISTA, SHENE MERRYLAND BAUTISTA, AND A...
CUTS Report by JOHN MEYNARD BARRUGA BAUTISTA, SHENE MERRYLAND BAUTISTA, AND A...CUTS Report by JOHN MEYNARD BARRUGA BAUTISTA, SHENE MERRYLAND BAUTISTA, AND A...
CUTS Report by JOHN MEYNARD BARRUGA BAUTISTA, SHENE MERRYLAND BAUTISTA, AND A...
johnmeynardbautista9
 
crucial-conversations-training-powerpoint.pptx
crucial-conversations-training-powerpoint.pptxcrucial-conversations-training-powerpoint.pptx
crucial-conversations-training-powerpoint.pptx
vikramdas40
 
Spain is a country in southwestern Europe, located on the Iberian Peninsula. ...
Spain is a country in southwestern Europe, located on the Iberian Peninsula. ...Spain is a country in southwestern Europe, located on the Iberian Peninsula. ...
Spain is a country in southwestern Europe, located on the Iberian Peninsula. ...
leonardoquispeyapu6
 

How to Scrap Any Website's content using ScrapyTutorial of How to scrape (crawling) website's content using Scrapy Python

  • 1. HOWTO SCRAPE ANY WEBSITE FOR FUN ;) by Anton Rifco [email protected] Some pictures taken from internet. This article possess no copyright. Use it for your own purpose July 2013 Monday, 15 July, 13
  • 2. Let’s scrape this :) Monday, 15 July, 13
  • 3. Web Scraping, a process of automatically collecting (stealing?) information from the Internet Monday, 15 July, 13
  • 4. THETOOL You need these tool to steal (uupss) those data: Python (2.6 or 2.7) with some packages* Scrapy** framework Google Chrome with XPath*** review plugin Computer, of course and functional brain *) http://doc.scrapy.org/en/latest/intro/install.html#requirements **) refer to http://scrapy.org/ (this slides won’t cover the installation of those things) ***) I use “XPath helper” plugin Monday, 15 July, 13
  • 5. S C R A P Y Not Crappy Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. Monday, 15 July, 13
  • 6. S C R A P Y Not Crappy Scrapy works by creating logical spiders that will crawl to any website you like. You define the logic of that spider, using Python Scrapy uses a mechanism based on XPath expressions called XPath selectors. Monday, 15 July, 13
  • 7. XPath is W3C standard to navigate through XML document (so as HTML) Here, XML documents are treated as trees of nodes.The topmost element of the tree is called the root element. For more, refer to: http://www.w3schools.com/xpath/ X P A T H Monday, 15 July, 13
  • 8. X P A T H <?xml version="1.0" encoding="ISO-8859-1"?> <bookstore> <book> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> </bookstore> From example of nodes in the XML document above: <bookstore> (root element node) <author>J K. Rowling</author> (element node) lang="en" (attribute node) For more, refer to: http://www.w3schools.com/xpath/ Monday, 15 July, 13
  • 9. Selecting Nodes XPath uses path expressions to select nodes in an XML document.The node is selected by following a path or steps For more, refer to: http://www.w3schools.com/xpath/ Expression Result nodename Selects all nodes with the name “nodename” / Do selection from the root // Do selection from current node . Select current node .. Select parent node @attr Select attributes of nodes text() Select the value of chosen node X P A T H Monday, 15 July, 13
  • 10. X P A T H Predicate Expressions Predicates are used to find a specific node or a node that contains a specific value. Predicates are always embedded in square brackets. Expression Result /bookstore/book[1] Selects the first book element that is the child of the bookstore element. /bookstore/book[last()] Selects the last book element that is the child of the bookstore element /bookstore/book[last()-1] Selects the last but one book element that is the child of the bookstore element /bookstore/book[position()<3] Selects the first two book elements that are children of the bookstore element //title[@lang] Selects all the title elements that have an attribute named lang //title[@lang='eng'] Selects all the title elements that have an attribute named lang with a value of 'eng' /bookstore/book[price>35.00] Selects all the book elements of the bookstore element that have a price element with a value greater than 35.00 Monday, 15 July, 13
  • 11. X PAT H H E L P E R By using XPATH Helper, you can easily get the XPath expression of a given node in HTML doc. It will be enabled by pressing <Ctrl>+<Shift>+X on Chrome Monday, 15 July, 13
  • 12. R E A L A C T I O N Create Scrapy Comesg project > scrapy startproject comesg Then, it will create the following project directory structure comesg/ /* This is Project root */ scrapy.cfg /* Project config file */ comesg/ __init__.py items.py /* Definition of Items to scrap */ pipelines.py /* Pipeline config for advance use*/ settings.py /* Advance setting file */ spiders/ /* Directory to put spiders file */ __init__.py ... Monday, 15 July, 13
  • 13. R E A L A C T I O N Define the Information Items that we want to scrap Click any of place, will open its details So, of all those data, we want to collect: name of places, photo, description, address (if any), contact number (if any), opening hours (if any), website (if any), and video (if any) Monday, 15 July, 13
  • 14. from scrapy.item import Item, Field class ComesgItem(Item): # define the fields for your item here like: name = Field() photo = Field() desc = Field() address = Field() contact = Field() hours = Field() website = Field() video = Field() On items.py, write the following: I t e m s D e f i n i t i o n Monday, 15 July, 13
  • 15. R E A L A C T I O N Basically, here is our strategy 1. Implements first spider that will get url of the listed items 2. Crawl to that url one by one 3. Implements second spider that will fetch all the required data Monday, 15 July, 13
  • 16. class AttractionSpider(CrawlSpider): ! name = "get-attraction" ! allowed_domains = ["comesingapore.com"] ## Will never go outside playground ! start_urls = [ ## Starting URL ! ! "http://comesingapore.com/travel-guide/category/285/attractions" ! ] ! rules = () ! def __init__(self, name=None, **kwargs): ! ! super(AttractionSpider, self).__init__(name, **kwargs) ! ! self.items_buffer = {} ! ! self.base_url = "http://comesingapore.com" ! ! from scrapy.conf import settings ! ! settings.overrides['DOWNLOAD_TIMEOUT'] = 360 ## prevent too early timeout ! def parse(self, response): ! ! print "Start scrapping Attractions...." ! ! try: ! ! ! hxs = HtmlXPathSelector(response) ## XPath expression to get the URL of item details ! ! ! links = hxs.select("//*[@id='content']//a[@style='color:black']/@href") ! ! ! ! ! ! if not links: ! ! ! ! return ! ! ! ! log.msg("No Data to scrap") ! ! ! for link in links: ! ! ! ! v_url = ''.join( link.extract() ) ! ! ! ! ! ! ! ! ! ! ! ! if not v_url: ! ! ! ! ! continue ! ! ! ! else: ## If valid URL, continue crawl those URL ! ! ! ! ! _url = self.base_url + v_url ## real work handled by second spider ! ! ! ! ! yield Request( url= _url, callback=self.parse_details ) ! ! except Exception as e: ! ! ! log.msg("Parsing failed for URL {%s}"%format(response.request.url)) F i r s t s p i d e r Monday, 15 July, 13
  • 17. def parse_details(self, response): ! ! print "Start scrapping Detailed Info...." ! ! try: ! ! ! hxs = HtmlXPathSelector(response) ! ! ! l_venue = ComesgItem() ! ! ! v_name = hxs.select("/html/body/div[@id='wrapper']/div[@id='page']/div[@id='page-bgtop']/div[@id='page-bgbtm']/ div[@id='content']/div[3]/h1/text()").extract() ! ! ! if not v_name: ! ! ! ! v_name = hxs.select("/html/body/div[@id='wrapper']/div[@id='page']/div[@id='page-bgtop']/div[@id='page- bgbtm']/div[@id='content']/div[2]/h1/text()").extract() ! ! ! ! ! ! l_venue["name"] = v_name[0].strip() ! ! ! ! ! ! base = hxs.select("//*[@id='content']/div[7]") ! ! ! if base.extract()[0].strip() == "<div style="clear:both"></div>": ! ! ! ! base = hxs.select("//*[@id='content']/div[8]") ! ! ! elif base.extract()[0].strip() == "<div style="padding-top:10px;margin-top:10px;border-top:1px dotted #DDD;">n You must be logged in to add a tipn </div>": ! ! ! ! base = hxs.select("//*[@id='content']/div[6]") ! ! ! x_datas = base.select("div[1]/b").extract() ! ! ! v_datas = base.select("div[1]/text()").extract() ! ! ! i_d = 0; ! ! ! if x_datas: ! ! ! ! for x_data in x_datas: ! ! ! ! ! print "data is:" + x_data.strip() ! ! ! ! ! if x_data.strip() == "<b>Address:</b>": ! ! ! ! ! ! l_venue["address"] = v_datas[i_d].strip() ! ! ! ! ! if x_data.strip() == "<b>Contact:</b>": ! ! ! ! ! ! l_venue["contact"] = v_datas[i_d].strip() ! ! ! ! ! if x_data.strip() == "<b>Operating Hours:</b>": ! ! ! ! ! ! l_venue["hours"] = v_datas[i_d].strip() ! ! ! ! ! if x_data.strip() == "<b>Website:</b>": ! ! ! ! ! ! l_venue["website"] = (base.select("div[1]/a/@href").extract())[0].strip() ! ! ! ! ! i_d += 1 ! ! ! ! ! ! ! v_photo = base.select("img/@src").extract() ! ! ! if v_photo: ! ! ! ! l_venue["photo"] = v_photo[0].strip() ! ! ! v_desc = base.select("div[3]/text()").extract() ! ! ! if v_desc: ! ! ! ! desc = "" ! ! ! ! for dsc in v_desc: ! ! ! ! ! desc += dsc ! ! ! ! l_venue["desc"] = desc.strip() S e c o n d s p i d e r Monday, 15 July, 13
  • 18. R E A L A C T I O N Run the Project > scrapy crawl get-attraction -t csv -o attr.csv In the end, it produces file attr.csv with the scraped data, like following: > head -3 attr.csv website,name,photo,hours,contact,video,address,desc http://www.tigerlive.com.sg,TigerLIVE,http://tn.comesingapore.com/img/others/240x240/f/6/0000246.jpg,Daily from 11am to 8pm (Last admission at 6.30pm).,(+65) 6270 7676,,"St. James Power Station, 3 Sentosa Gateway, Singapore 098544", http://www.zoo.com.sg,Singapore Zoo,http://tn.comesingapore.com/img/others/240x240/6/2/0000098.jpg,Daily from 8.30am - 6pm (Last ticket sale at 5.30pm),(+65) 6269 3411,http://www.youtube.com/embed/p4jgx4yNY9I,"80 Mandai Lake Road, Singapore 729826","See exotic and endangered animals up close in their natural habitats in the . Voted the best attraction in Singapore on Trip Advisor, and considered one of the best zoos in the world, this attraction is a must see, housing over 2500 mammals, birds and reptiles. Monday, 15 July, 13
  • 19. Get the complete Project code @ https://github.com/antonrifco/comesg Monday, 15 July, 13