SlideShare a Scribd company logo
Web Scraping with  Matthew Turland Acadiana Open Source Group April 30, 2009
What Is It?
Normal Web Browsing
Difference #1: Immediate Audience
Difference #2: Consumption Method
Why Is It Useful?
Data Without Web Services
Integration Testing
Crawlers
With plain text, we give ourselves the  ability to manipulate knowledge, both  manually and programmatically, using  virtually every tool at our disposal. 3.14 The Power of Plain Text, The Pragmatic Programmer
Disadvantages
Potential Lack of Stability
Reverse Engineering Required
More Requests
No Nice Neat Data Package
Step #1: Retrieval
Speaking the Language
The Web We Weave GET / HTTP/1.1 User-Agent: ... HTTP/1.1 200 OK Content-Type: ...
GET   /index.php?foo=bar   HTTP/1.1 <a href= &quot;/index.php?foo=bar&quot; > Index </a> <form method= &quot;post&quot;  action= &quot;/index.php&quot; > <input name= &quot;foo&quot;  value= &quot;bar&quot;  /> </form> POST  /index.php   HTTP/1.1 foo = bar Browsing -> Requests
HTTP/1.1 200 OK Content-Type : image/gif Content-Length:  8558 Responses -> Rendered Elements <img src= &quot;/intl/en_ALL/images/logo.gif&quot;  /> GET   /intl/en_ALL/images/logo.gif   HTTP/1.1 Host:  google.com
Not As Easy As It Looks
Redirections
Referer [sic]
Cookies
User Agent Sniffing
robots.txt
Caching
HTTP Authentication
PHP: Glue for the Web
HTTP Client Libraries PEAR::HTTP_Client pecl_http Zend_Http_Client Streams ,  cURL
Simple Streams Example $uri  =  'http://www.example.com/some/resource' ; $get  = file_get_contents( $uri ); $context  = stream_context_create( array ( 'http'  =>  array ( 'method'  =>  'POST' , 'header'  =>  'Content-Type: '  . 'application/x-www-form-urlencoded' , 'content'  => http_build_query( array ( 'var1'  =>  'value1' , 'var2'  =>  'value2' )) ) ) ); $post  = file_get_contents( $uri , false,  $context );
pecl_http Example $http  = new HttpRequest( $uri ); $http ->enableCookies(); $http ->setMethod(HTTP_METH_POST); $http ->addPostFields( array ( 'var1'  =>  'value1' )); $http ->setOptions( 'useragent'  =>  'PHP '  .  phpversion (), 'referer'  =>  'http://example.com/some/referer' )); $response  =  $http -> send (); $headers  =  $response ->getHeaders(); $body  =  $response ->getBody();
pecl_http Request Pooling $pool  = new HttpRequestPool; foreach  ( $urls  as  $url ) { $request  = new HttpRequest( $url , HTTP_METH_GET); $pool ->attach( $request ); } $pool -> send (); foreach  ( $pool  as  $request ) { echo   $request ->getUrl(), PHP_EOL; echo   $request ->getResponseBody(), PHP_EOL; }
HTTP Resources RFC 2616 HyperText Transfer Protocol RFC 3986 Uniform Resource Identifiers &quot;HTTP: The Definitive Guide&quot; (ISBN 1565925092) &quot;HTTP Pocket Reference: HyperText Transfer Protocol&quot; (ISBN 1565928628) &quot;HTTP Developer's Handbook&quot; (ISBN 0672324547) by  Chris Shiflett Ben Ramsey's blog series on HTTP
Step #2:Analysis
Tidy Extension $config  =  array ( 'output-xhtml'  => true); $tidy  = tidy_parse_string( $markupString ,  $config ); $tidy  = tidy_parse_file( $markupFilePath ,  $config ); $output  = tidy_get_output( $tidy );
DOM Extension $doc  = new DOMDocument; $doc ->loadHTML( $htmlString ); $doc ->loadHTMLFile( $htmlFilePath ); $listItems  =  $doc ->getElementsByTagName( 'li' ); $xpath  = new DOMXPath( $doc ); $listItems  =  $xpath ->query( '//ul/li' ); foreach  ( $listItems  as  $listItem ) { echo   $listItem ->nodeValue, PHP_EOL; }
SimpleXML Extension $sxe  = new SimpleXMLElement( $markupString ); $sxe  = new SimpleXMLElement( $filePath , null, true); echo   $sxe ->body->ul->li[0], PHP_EOL; $children  =  $sxe ->body->ul->li; $children  =  $sxe ->body->ul->children(); foreach  ( $children  as  $li ) { echo   $li , PHP_EOL; } echo   $sxe ->body->ul[ 'id' ]; $attributes  =  $sxe ->body->ul->attributes(); foreach  ( $attributes  as  $name  =>  $value ) { echo   $name ,  '=' ,  $value , PHP_EOL; }
XMLReader Extension $doc  = XMLReader::xml( $xmlString ); $doc  = XMLReader::open( $filePath ); while  ( $doc -> read ()) { if  ( $doc ->nodeType == XMLReader::ELEMENT) { var_dump ( $doc ->localName); var_dump ( $doc ->hasValue); var_dump ( $doc ->value); var_dump ( $doc ->hasAttributes); var_dump ( $doc ->getAttribute( 'id' )); } }
CSS Selector Libraries phpQuery Simple HTML DOM Parser Zend_Dom_Query $doc1  = phpQuery::newDocumentFile( $markupFilePath ); $doc2  = phpQuery::newDocument( $markupString ); $listItems  = pq( 'ul > li' );  // uses $doc2 $listItems  = pq( 'ul > li' ,  $doc1 );
PCRE Extension
Best Practices
Approximate Human Behavior
Minimize Requests
Batch Jobs, Non-Peak Hours
Account for Unavailability
Aim for Parallelism
Validate Data
Test, Test, Test!
Questions

More Related Content

What's hot (20)

PPTX
Tax management-system
Fahim Faysal Kabir
 
PDF
YAPC::Asia 2010 Twitter解析サービス
Yusuke Wada
 
PDF
Introduction to the Pods JSON API
podsframework
 
PDF
Twib in Yokoahma.pm 2010/3/5
Yusuke Wada
 
PDF
Pemrograman Web 8 - MySQL
Nur Fadli Utomo
 
TXT
Wsomdp
riahialae
 
PPT
Php Rss
mussawir20
 
PPT
Modware next generation with pub module
cybersiddhu
 
PDF
Not Really PHP by the book
Ryan Kilfedder
 
PDF
TDC2015 Porto Alegre - Automate everything with Phing !
Matheus Marabesi
 
PPSX
jQuery - Doing it right
girish82
 
PDF
Add loop shortcode
Peter Baylies
 
PDF
PHP and Rich Internet Applications
elliando dias
 
PPTX
So cal0365productivitygroup feb2019
RonRohlfs1
 
PDF
Laravel the right way
Matheus Marabesi
 
PPTX
Prepared Statement 올바르게 사용하기
Kangjun Heo
 
PDF
Current state-of-php
Richard McIntyre
 
PPT
YAP / Open Mail Overview
Jonathan LeBlanc
 
PPT
Programming For Designers V3
sqoo
 
PPTX
Let's write secure Drupal code! - DrupalCamp Oslo, 2018
Balázs Tatár
 
Tax management-system
Fahim Faysal Kabir
 
YAPC::Asia 2010 Twitter解析サービス
Yusuke Wada
 
Introduction to the Pods JSON API
podsframework
 
Twib in Yokoahma.pm 2010/3/5
Yusuke Wada
 
Pemrograman Web 8 - MySQL
Nur Fadli Utomo
 
Wsomdp
riahialae
 
Php Rss
mussawir20
 
Modware next generation with pub module
cybersiddhu
 
Not Really PHP by the book
Ryan Kilfedder
 
TDC2015 Porto Alegre - Automate everything with Phing !
Matheus Marabesi
 
jQuery - Doing it right
girish82
 
Add loop shortcode
Peter Baylies
 
PHP and Rich Internet Applications
elliando dias
 
So cal0365productivitygroup feb2019
RonRohlfs1
 
Laravel the right way
Matheus Marabesi
 
Prepared Statement 올바르게 사용하기
Kangjun Heo
 
Current state-of-php
Richard McIntyre
 
YAP / Open Mail Overview
Jonathan LeBlanc
 
Programming For Designers V3
sqoo
 
Let's write secure Drupal code! - DrupalCamp Oslo, 2018
Balázs Tatár
 

Similar to Web Scraping with PHP (20)

PDF
Web Scraping with PHP
Matthew Turland
 
ODP
Web Scraping with PHP
Matthew Turland
 
PDF
ApacheCon 2005
Adam Trachtenberg
 
PDF
Open Social Data (Jaca), Alejandro Rivero
Aragón Open Data
 
PDF
PHP And Web Services: Perfect Partners
Lorna Mitchell
 
PPTX
Web scraping 101 with goutte
Joshua Copeland
 
PDF
Selenium sandwich-3: Being where you aren't.
Workhorse Computing
 
ODP
Mechanize at the Ruby Drink-up of Sophia, November 2011
rivierarb
 
PDF
Web services tutorial
Lorna Mitchell
 
PDF
Services web RESTful
goldoraf
 
PDF
When RSS Fails: Web Scraping with HTTP
Matthew Turland
 
PDF
Web Services Tutorial
Lorna Mitchell
 
PPTX
Creating Operational Redundancy for Effective Web Data Mining
Jonathan LeBlanc
 
ODP
Building Web Services with Zend Framework (PHP Benelux meeting 20100713 Vliss...
King Foo
 
PPT
Web Scraper Shibuya.pm tech talk #8
Tatsuhiko Miyagawa
 
PPTX
Browser
Shweta Oza
 
PDF
Crawling the world
Marc Morera
 
ODP
Working with Web Services
Lorna Mitchell
 
PDF
Working with web_services
Lorna Mitchell
 
ODP
Creating Web Services with Zend Framework - Matthew Turland
Matthew Turland
 
Web Scraping with PHP
Matthew Turland
 
Web Scraping with PHP
Matthew Turland
 
ApacheCon 2005
Adam Trachtenberg
 
Open Social Data (Jaca), Alejandro Rivero
Aragón Open Data
 
PHP And Web Services: Perfect Partners
Lorna Mitchell
 
Web scraping 101 with goutte
Joshua Copeland
 
Selenium sandwich-3: Being where you aren't.
Workhorse Computing
 
Mechanize at the Ruby Drink-up of Sophia, November 2011
rivierarb
 
Web services tutorial
Lorna Mitchell
 
Services web RESTful
goldoraf
 
When RSS Fails: Web Scraping with HTTP
Matthew Turland
 
Web Services Tutorial
Lorna Mitchell
 
Creating Operational Redundancy for Effective Web Data Mining
Jonathan LeBlanc
 
Building Web Services with Zend Framework (PHP Benelux meeting 20100713 Vliss...
King Foo
 
Web Scraper Shibuya.pm tech talk #8
Tatsuhiko Miyagawa
 
Browser
Shweta Oza
 
Crawling the world
Marc Morera
 
Working with Web Services
Lorna Mitchell
 
Working with web_services
Lorna Mitchell
 
Creating Web Services with Zend Framework - Matthew Turland
Matthew Turland
 
Ad

More from Matthew Turland (11)

PDF
New SPL Features in PHP 5.3
Matthew Turland
 
PDF
New SPL Features in PHP 5.3 (TEK-X)
Matthew Turland
 
PDF
Sinatra
Matthew Turland
 
PDF
Open Source Networking with Vyatta
Matthew Turland
 
PPT
Open Source Content Management Systems
Matthew Turland
 
ODP
PHP Basics for Designers
Matthew Turland
 
PPT
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
Matthew Turland
 
ODP
Utilizing the Xen Hypervisor in business practice - Bryan Fusilier
Matthew Turland
 
PDF
The Ruby Programming Language - Ryan Farnell
Matthew Turland
 
ODP
PDQ Programming Languages plus an overview of Alice - Frank Ducrest
Matthew Turland
 
ODP
Getting Involved in Open Source - Matthew Turland
Matthew Turland
 
New SPL Features in PHP 5.3
Matthew Turland
 
New SPL Features in PHP 5.3 (TEK-X)
Matthew Turland
 
Open Source Networking with Vyatta
Matthew Turland
 
Open Source Content Management Systems
Matthew Turland
 
PHP Basics for Designers
Matthew Turland
 
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
Matthew Turland
 
Utilizing the Xen Hypervisor in business practice - Bryan Fusilier
Matthew Turland
 
The Ruby Programming Language - Ryan Farnell
Matthew Turland
 
PDQ Programming Languages plus an overview of Alice - Frank Ducrest
Matthew Turland
 
Getting Involved in Open Source - Matthew Turland
Matthew Turland
 
Ad

Recently uploaded (20)

PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
July Patch Tuesday
Ivanti
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
July Patch Tuesday
Ivanti
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 

Web Scraping with PHP

  • 1. Web Scraping with Matthew Turland Acadiana Open Source Group April 30, 2009
  • 6. Why Is It Useful?
  • 7. Data Without Web Services
  • 10. With plain text, we give ourselves the ability to manipulate knowledge, both manually and programmatically, using virtually every tool at our disposal. 3.14 The Power of Plain Text, The Pragmatic Programmer
  • 12. Potential Lack of Stability
  • 15. No Nice Neat Data Package
  • 18. The Web We Weave GET / HTTP/1.1 User-Agent: ... HTTP/1.1 200 OK Content-Type: ...
  • 19. GET /index.php?foo=bar HTTP/1.1 <a href= &quot;/index.php?foo=bar&quot; > Index </a> <form method= &quot;post&quot; action= &quot;/index.php&quot; > <input name= &quot;foo&quot; value= &quot;bar&quot; /> </form> POST /index.php HTTP/1.1 foo = bar Browsing -> Requests
  • 20. HTTP/1.1 200 OK Content-Type : image/gif Content-Length: 8558 Responses -> Rendered Elements <img src= &quot;/intl/en_ALL/images/logo.gif&quot; /> GET /intl/en_ALL/images/logo.gif HTTP/1.1 Host: google.com
  • 21. Not As Easy As It Looks
  • 29. PHP: Glue for the Web
  • 30. HTTP Client Libraries PEAR::HTTP_Client pecl_http Zend_Http_Client Streams , cURL
  • 31. Simple Streams Example $uri = 'http://www.example.com/some/resource' ; $get = file_get_contents( $uri ); $context = stream_context_create( array ( 'http' => array ( 'method' => 'POST' , 'header' => 'Content-Type: ' . 'application/x-www-form-urlencoded' , 'content' => http_build_query( array ( 'var1' => 'value1' , 'var2' => 'value2' )) ) ) ); $post = file_get_contents( $uri , false, $context );
  • 32. pecl_http Example $http = new HttpRequest( $uri ); $http ->enableCookies(); $http ->setMethod(HTTP_METH_POST); $http ->addPostFields( array ( 'var1' => 'value1' )); $http ->setOptions( 'useragent' => 'PHP ' . phpversion (), 'referer' => 'http://example.com/some/referer' )); $response = $http -> send (); $headers = $response ->getHeaders(); $body = $response ->getBody();
  • 33. pecl_http Request Pooling $pool = new HttpRequestPool; foreach ( $urls as $url ) { $request = new HttpRequest( $url , HTTP_METH_GET); $pool ->attach( $request ); } $pool -> send (); foreach ( $pool as $request ) { echo $request ->getUrl(), PHP_EOL; echo $request ->getResponseBody(), PHP_EOL; }
  • 34. HTTP Resources RFC 2616 HyperText Transfer Protocol RFC 3986 Uniform Resource Identifiers &quot;HTTP: The Definitive Guide&quot; (ISBN 1565925092) &quot;HTTP Pocket Reference: HyperText Transfer Protocol&quot; (ISBN 1565928628) &quot;HTTP Developer's Handbook&quot; (ISBN 0672324547) by Chris Shiflett Ben Ramsey's blog series on HTTP
  • 36. Tidy Extension $config = array ( 'output-xhtml' => true); $tidy = tidy_parse_string( $markupString , $config ); $tidy = tidy_parse_file( $markupFilePath , $config ); $output = tidy_get_output( $tidy );
  • 37. DOM Extension $doc = new DOMDocument; $doc ->loadHTML( $htmlString ); $doc ->loadHTMLFile( $htmlFilePath ); $listItems = $doc ->getElementsByTagName( 'li' ); $xpath = new DOMXPath( $doc ); $listItems = $xpath ->query( '//ul/li' ); foreach ( $listItems as $listItem ) { echo $listItem ->nodeValue, PHP_EOL; }
  • 38. SimpleXML Extension $sxe = new SimpleXMLElement( $markupString ); $sxe = new SimpleXMLElement( $filePath , null, true); echo $sxe ->body->ul->li[0], PHP_EOL; $children = $sxe ->body->ul->li; $children = $sxe ->body->ul->children(); foreach ( $children as $li ) { echo $li , PHP_EOL; } echo $sxe ->body->ul[ 'id' ]; $attributes = $sxe ->body->ul->attributes(); foreach ( $attributes as $name => $value ) { echo $name , '=' , $value , PHP_EOL; }
  • 39. XMLReader Extension $doc = XMLReader::xml( $xmlString ); $doc = XMLReader::open( $filePath ); while ( $doc -> read ()) { if ( $doc ->nodeType == XMLReader::ELEMENT) { var_dump ( $doc ->localName); var_dump ( $doc ->hasValue); var_dump ( $doc ->value); var_dump ( $doc ->hasAttributes); var_dump ( $doc ->getAttribute( 'id' )); } }
  • 40. CSS Selector Libraries phpQuery Simple HTML DOM Parser Zend_Dom_Query $doc1 = phpQuery::newDocumentFile( $markupFilePath ); $doc2 = phpQuery::newDocument( $markupString ); $listItems = pq( 'ul > li' ); // uses $doc2 $listItems = pq( 'ul > li' , $doc1 );