Template
Template
Python
By
Supervisor: Dr Ravindra Kumar
Arnav Lakha 1/20/FET/BCS/112
Shashank Rai 1/20/FET/BCS/106
Shourya Ahuja 1/20/FET/BCS/115
Arun 1/20/FET/BCS/086
11/29/2023 1
Mohit Chaudhary 1/20/FET/BCS/087
Outline
• Introduction to scraping
• What is image scraping ?
• Is Image Scraping Legal?
• Introduction to python Scraper
• How to perform image scraping
• Some scraping knowledge
2
TABLE OF CONTENTS
1)Introduction
2) Problem Statements
3) Objectives
4) Hardware and software requirements
5) Literature Review
6)System Design
7) Methodology
8) Expected Outcome Of project /Result
9) Conclusion & Future Scope
10)References
3
Introduction
• From retail and real estate to tourism and hospitality, images play a
vital role in influencing customer decisions. Hence, it is important for
brands to see what kinds of photos are turning prospects into
customers.
• On the other side, customers go through numerous products and
images before settling on a final choice. Similarly, analysts browse
several pages and analyze hundreds of images to gain any meaningful
insight. In such cases, they have to download these images, which is
extremely error-prone and time-consuming when done manually.
• In these scenarios, we need image scraping
11/29/2023 5
Introduction to scraping
• There are many different tools for scraping available,
which differ in their functionality and use.
• Tools and frameworks come and go, choose the one
that fits the job.
• Scraping: the actual extraction of data / information
from a web page
6
What is image scraping ?
7
Is Image Scraping Legal?
Like more generalized web scraping, image scraping is a method for downloading
website content. It's not illegal, but there are some rules and best practices you should
follow. First, you should avoid scraping a website if it explicitly states that it does not
want you to. You can find this out by looking for a /robots.txt file on the target site.
Most websites allow web crawling because they want search engines to index their
content. You can scrape such websites since their images are publicly available.
However, just because you can download an image, that doesn't mean you can use it as
if it were your own. Most websites license their images to prevent you from
republishing them or reusing them in other ways. Always assume that you cannot reuse
images unless there is a specific exemption.
Best practices for image scraping to avoid common challenges
It is essential to scrape image data cautiously and follow best practices in order to avoid
technical and legal issues. Here are some best practices for image scraping:
•Check image formats and sizes: Images can come in various formats, such as JPEG,
GIF, and sizes, such as small thumbnails. Ensure that your image scraper can handle
all of these formats and different image sizes.
•Follow ethical and legal guidelines: Image scraping may be illegal under certain
conditions, such as when it violates copyright laws. Check the terms of service and the
Robots.txt file of the website you intend to scrape to ensure your data collection activity
does not violate any rules or policies. For example, most websites employ rate limits to
manage crawling traffic and prevent the overuse of APIs. Check for any
rate limits imposed by the website’s API and comply with them to avoid being blocked.
•Respecting the website’s server and bandwidth: Limit the frequency and volume of
your requests or add time delays between your requests. You can also use caching
techniques to avoid requesting the same image data multiple times.
9
Image scraping with
Python
You can scrape images from a web page using Python by following these steps:
1.Install the necessary libraries: The scraping library you choose will depend on your
specific data collection requirements. Beautiful Soup and Requests are typically the easiest
for basic image scraping tasks. At the same time, Scrapy and Pillow libraries provide more
advanced functions for web scraping images. Selenium is generally used for
scraping dynamic web pages, which requires user interaction, such as clicking buttons or
navigating menus.
You can install the desired library using the pip command, the Python package installer. For
example, to install Requests, type the “pip install requests” command into your prompt or
terminal.
2.Identify the image URLs on a web page you wish to scrape: You can inspect the
HTML source code of a page using developer tools in your browser. Image URLs are
generally included in the src attribute of a <img> tag in the HTML content (Figure 1). Copy
the image URL from the src attribute to use a Python library.
10
Introduction to python Scraper
11
How to perform image scraping ?
• bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This
module does not come built-in with Python. To install this type the below command in the
terminal.
• pip install bs4
• requests: Requests allows you to send HTTP/1.1 requests extremely easily. This module also does
not come built-in with Python. To install this type the below command in the terminal.
• pip install requests
• Approach:
• Import module
• Make requests instance and pass into URL
• Pass the requests into a Beautifulsoup() function
12
• Use ‘img’ tag to find them all tag (‘src ‘)
3.Request the target web page: Once you’ve identified the
target URLs, you can send a request to the web page containing
the images you want to scrape. For instance, if you are using the
Requests library to scrape an Amazon product image, you can
use the following code.
url = ‘https://amazon.com/xyz’
response = requests.get(url)
4.Parse the HTML content: You can use a Python library like
Beautiful Soup or lxml to parse the HTML content of the response.
5.Extract the image URLs : To extract the image URLs from all
image tags, you can use the ‘src’ attribute to specify the URL of
the image file that needs to be downloaded.
11/29/2023 13
3.Download all the images: Once you have the image URLs, you
must download the images from the URLs. Python includes several
built-in modules for downloading images from web pages, such as
urllib, urllib2 and Requests.
3. urllib: It is part of the Python standard library. You can download all the
images using the “urlretrieve()” function.
4. urllib2: It provides more advanced features for sending HTTP requests. You
can use the “urlopen()” function to open a connection to the image URL and
use the “read()” method to read the image data.
5. Requests: It is a third-party Python library. You can use the “get()” function
to send a request to the target URL and use the content attribute to access
the image data.
4.Save the downloaded image data: Finally, save the downloaded
images to your local file system. For example, you can use the “os”
module to save an image to the directory /path/to/images. It keeps
the image data in a file called image.jpg in the directory path, but you
can change the image filename to suit your needs.
11/29/2023 14
Some scraping knowledge
• Python : Language used to extract images from the
webpage
• HTTP: the communication protocol
• HTML: the language in which web pages are defined
• JS: javascript (code executing in the browser)
• CSS: style sheets, how web pages are styled.
Important, but does not contain data.
• JPG, PNG, BMP: images
• CSV / TXT / JSON / XML: data
15
PROBLEM STATEMENT
11/29/2023 16
Project OBJECTIVES
To identify the gaps in the existing techniques and find the scope of ...
11/29/2023 17
METHODOLOGY
11/29/2023 18
EXPECTED OUTCOME
• This aims to …
11/29/2023 19
REFERENCES
• https://research.aimultiple.com/image-scraping/
11/29/2023 20
Thank You!
21