Scrapy - Selectors

Last Updated : 24 Jun, 2021

Scrapy Selectors as the name suggest are used to select some things. If we talk of CSS, then there are also selectors present that are used to select and apply CSS effects to HTML tags and text.

In Scrapy we are using selectors to mention the part of the website which is to be scraped by our spiders. Hence, to scrape the right data from the site, it is very important that we should select the tags which represent data correctly. There are many tools used for that.

Types of selectors:

In Scrapy, there are mainly two types of selectors, i.e. CSS selectors and XPath selectors. Both of them are performing the same function and selecting the same text or data but the format of passing the arguments is different in them.

CSS selectors: Since CSS languages are defined in any HTML File, so we can use their selectors as a way to select parts of the HTML file in Scrapy.
XPath selectors: It is a language used to select Nodes in XML documents and hence it can be used in HTML Files too since HTML Files can also be represented as XML documents.

Description:

Let's have an HTML File (index.html) as given below which we are to Scrap using our Spider and see how selectors are working. We will be working on Scrapy Shell to give commands to select data.

HTML

<html>
 <head>
  <title>Scrapy-Selectors</title>
 </head>
 <body>
  <div id='Selectors'>
    <h1> This is H1 Tag </h1>
    <span class="SPAN1"> This is Class Selectors SPAN tag </span>
  </div>
 </body>
</html>

Below given is a view of our Scrapy Shell which we will be using:

Command to open shell:

Scrapy shell file:///C:/Users/Dipak/Desktop/index.html

Scrapy Shell activated to Crawl Spiders on Index.html File.

Using Selectors:

Now we will discuss how to use selectors in Scrapy. Since there are mainly two types of it as given below:

CSS Selectors:

There are various formats for using CSS selectors under different cases. They are given below:

Very Basic Start goes from selecting the basic tags in HTML file such as the <HTML> tag, <HEAD>, <BODY>, etc. So the below given is the basic format to select any tag in the HTML File using Scrapy.

Shell Command : response.css('html').get()
 
# Here response object calls CSS selector method to
# target HTML tag and get() method
# is used to select everything inside the HTML tag.

Output:The whole content of the HTML file is selected.

So, now it's time to modify our way of selecting, If we want to select only the inside text of the Tags or just want to select the attribute of any particular tag then we can follow the below-given syntax:

# To select the text inside the Tags 
# excluding tags we have to use (::text) 
# as our extension.
response.css('h1::text').get()

# To select the attributes details of
# any HTML tag  we have to use below 
# given syntax:
response.css('span').attrib['class']

If there are many same types of tags in the HTML File then we can use .getall() method instead of .get() to select all the tags. It returns a list of selected tags and their data.
If the tag which we have to select is not mentioned in the file then CSS selectors return nothing. We can also provide default data to be returned if nothing is found.

XPath Selectors:

The way these selectors work is similar to that how CSS selectors work instead the syntax differs only.

The below are the surtaxes which can be written in XPATH for selecting, what we have done earlier.

# This is to select the text part of 
# title tag using XPATH
response.xpath('//title/text()')
response.xpath('//title/text()').get()

# This is how to select attributes
response.xpath('//span/@class').get()

Properties:

1. We can nest selectors within one another. Since if our HTML file can contain elements inside the div tag, so we can nest the selectors to select a particular element in it. To achieve this we first have to select all the elements inside the div tag, and then we can select any particular element from it.

div_tag = response.xpath('//div')
div_tag.getall()

for tags in div_tag:
     tag = tags.xpath('.//h1').get()
     print({tag})

2. Next we can use our selectors with the regular expression also. If we don't know what is the name of the attributes or elements then we can use regular expressions too for selection. For this we have a method named ( .re()).

The .re() Method is used to select tags based on the content match. If the content inside the HTML tag matches with the regular expression inputted, then this method returns a list of that content. In the above HTML file, we are having two tags named h1 and span tag inside the DIV tag, and the text in these both tags has the same starting i.e. " This is ". So to select them based on regex we have to form their regular expression which is given below:

regexp = r'This\sis\s*(.*)' and we have to input this in our .re() method

So our code becomes

response.css('#Selectors *::text').re(r'This\sis\s*(.*)')

Using Regular expression for selecting the text

3. EXSLT Regular Expressions are also supported by scrapy spiders. We can use its method to select the items based on some new regular expressions. This extension provides two different namespaces to be used in XPath

re: Used for making regular expressions.
set: Used for set manipulation

We can use these namespaces to modify the select statement specified in our Xpath method.

Below is one of the given example:

Suppose we had added two h1 tag and name their class in our HTML file so now it looks like :

HTML

<html>
 <head>
  <title>Scrapy-Selectors</title>
 </head>
 <body>
  <div id='Selectors'>
    <h1 class='FirstH1'> This is H1 Tag </h1>
    <h1 class='FirstH2'> This is Second H1 Tag </h1>
    <h1 class='FirstH'> This is Third H1 Tag </h1>
    <span class="SPAN1"> This is Class Selectors SPAN tag </span>
  </div>
 </body>
</html>

Now if we want to select both H1 tags using regexp then we can see that we have to select that tag that has a starting string first in the id part and the end integer doesn't matter.

So the code for this :

response.xpath('//h1[re:test(@class, "FirstH\d$")]').getall()

Here we are using re:test method to specify and test our regular expression on the class attribute of our h1 tag and regexp selects only those h1 tags whose class attribute values ends with an integer.

This was an example of using EXSLT in selectors in scrapy.

4. If want we can use both selectors merged together to enhance the way of selecting.

response.css('span').xpath('@class').get()
# CSS is used to select tag and XPATH is 
# used to select attribute

Note:

In XPath when we are using the nesting property of selectors then we should take care of a fact regarding Relative XPaths. Consider we selected a div tag as given below:

div_tag = response.xpath('//div')

This will select div tag and all the elements inside that tag. Now assume that the div tag contains some <a>tags within it. Now if we want to use nesting selectors and select the <a> tag then we would write

for a in div_tag.xpath('.//a'):

This is a relative path that tells the spider to select tag elements from only the path inside the div tag selected above. If we will write -

for a in div_tag('//a'):

It will select all the tag inside the HTML document. So we should take care of relative paths.

We can use Google Chrome Extension named as SelectorGadget which is used to simplify the selecting task. Since all websites today if we inspect them, have very lengthy and hard to understand and search codes. So amidst them, we can use this extension which enables selecting the tags on Frontend only.