0% found this document useful (0 votes)
80 views

Scraping HTML Chapter2

The document discusses various XPath navigation techniques in Python for web scraping, including: - Using slashes and brackets to look forward in the HTML structure and narrow selections. - Examples of selecting elements by tag name, attribute values, and relative position. - The wildcard character "*" to select all child elements, and contains() to match partial text in attributes. - Creating Selector objects in Scrapy to parse HTML content and extract data using XPath queries.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

Scraping HTML Chapter2

The document discusses various XPath navigation techniques in Python for web scraping, including: - Using slashes and brackets to look forward in the HTML structure and narrow selections. - Examples of selecting elements by tag name, attribute values, and relative position. - The wildcard character "*" to select all child elements, and contains() to match partial text in attributes. - Creating Selector objects in Scrapy to parse HTML content and extract data using XPath queries.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

XPath Navigation

W EB S CRAP IN G IN P YTH ON

Thomas Laetsch
Data Scientist, NYU
Slashes and Brackets
Single forward slash / looks forward one generation

Double forward slash // looks forward all future generations

Square brackets [] help narrow in on speci c elements

WEB SCRAPING IN PYTHON


To Bracket or not to Bracket
xpath = '/html/body'

xpath = '/html[1]/body[1]'

Give the same selection

WEB SCRAPING IN PYTHON


A Body of P
xpath = '/html/body/p'

WEB SCRAPING IN PYTHON


The Birds and the Ps
xpath = '/html/body/div/p' xpath = '/html/body/div/p[2]'

WEB SCRAPING IN PYTHON


Double Slashing the Brackets
xpath = '//p' xpath = '//p[1]'

WEB SCRAPING IN PYTHON


The Wildcard
xpath = '/html/body/*'
The asterisks * is the "wildcard"

WEB SCRAPING IN PYTHON


Xposé
W EB S CRAP IN G IN P YTH ON
Off the Beaten XPath
W EB S CRAP IN G IN P YTH ON

Thomas Laetsch
Data Scientist, NYU
(At)tribute
@ represents "attribute"
@class

@id

@href

WEB SCRAPING IN PYTHON


Brackets and Attributes

WEB SCRAPING IN PYTHON


Brackets and Attributes
xpath = '//p[@class="class-1"]'

WEB SCRAPING IN PYTHON


Brackets and Attributes
xpath = '//*[@id="uid"]'

WEB SCRAPING IN PYTHON


Brackets and Attributes
xpath = '//div[@id="uid"]/p[2]'

WEB SCRAPING IN PYTHON


Content with Contains
Xpath Contains Notation:

contains( @attri-name, "string-expr" )

WEB SCRAPING IN PYTHON


Contain This
xpath = '//*[contains(@class,"class-1")]'

WEB SCRAPING IN PYTHON


Contain This
xpath = '//*[@class="class-1"]'

WEB SCRAPING IN PYTHON


Get Classy
xpath = '/html/body/div/p[2]'

WEB SCRAPING IN PYTHON


Get Classy
xpath = '/html/body/div/p[2]/@class'

WEB SCRAPING IN PYTHON


End of the Path
W EB S CRAP IN G IN P YTH ON
Introduction to the
scrapy Selector
W EB S CRAP IN G IN P YTH ON

Thomas Laetsch
Data Scientist, NYU
Setting up a Selector
from scrapy import Selector

html = '''
<html>
<body>
<div class="hello datacamp">
<p>Hello World!</p>
</div>
<p>Enjoy DataCamp!</p>
</body>
</html>
'''

sel = Selector( text = html )

Created a scrapy Selector object using a string with the html code

The selector sel has selected the entire html document

WEB SCRAPING IN PYTHON


Selecting Selectors
We can use the xpath call within a Selector to create new Selector s of speci c pieces of
the html code

The return is a SelectorList of Selector objects

sel.xpath("//p")

# outputs the SelectorList:


[<Selector xpath='//p' data='<p>Hello World!</p>'>,
<Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]

WEB SCRAPING IN PYTHON


Extracting Data from a SelectorList
Use the extract() method

>>> sel.xpath("//p")

out: [<Selector xpath='//p' data='<p>Hello World!</p>'>,


<Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]

>>> sel.xpath("//p").extract()

out: [ '<p>Hello World!</p>',


'<p>Enjoy DataCamp!</p>' ]

We can use extract_first() to get the rst element of the list

>>> sel.xpath("//p").extract_first()

out: '<p>Hello World!</p>'

WEB SCRAPING IN PYTHON


Extracting Data from a Selector
ps = sel.xpath('//p')

second_p = ps[1]

second_p.extract()

out: '<p>Enjoy DataCamp!</p>'

WEB SCRAPING IN PYTHON


Select This Course!
W EB S CRAP IN G IN P YTH ON
"Inspecting the
HTML"
W EB S CRAP IN G IN P YTH ON

Thomas Laetsch, PhD


Data Scientist, NYU
"Source" = HTML Code

WEB SCRAPING IN PYTHON


Inspecting Elements

WEB SCRAPING IN PYTHON


HTML text to Selector
from scrapy import Selector

import requests

url = 'https://www.datacamp.com/courses/all'

html = requests.get( url ).content

sel = Selector( text = html )

WEB SCRAPING IN PYTHON


You Know Our
Secrets
W EB S CRAP IN G IN P YTH ON

You might also like