SlideShare a Scribd company logo
2
Most read
3
Most read
4
Most read
Assignment on Crawling a website
The goal of this lab is to learn how to scrape web pages using BeautifulSoup.
Introduction
BeautifulSoup is a Python library that parses HTML files and allows you to extract information
from them. HTML files are the files that are used to represent web pages.
BeautifulSoup to a scraping task involves:
inspecting the source code of the web page in a text editor to infer its structure using information
about the structure to write code that pulls the data out, employing BeautifulSoup
Web crawling is the process of systematically browsing the web to extract data from websites.
BeautifulSoup is a Python library that makes it easy to scrape and parse HTML and XML
documents.
How Web Crawling Works
1. Fetching Web Pages: Use the requests module to download the webpage.
2. Parsing the HTML: Use BeautifulSoup to analyze and navigate the webpage’s
structure.
3. Extracting Data: Identify and extract useful elements (e.g., text, links, images).
4. Following Links (Optional): Move from one page to another (crawling multiple pages).
5. Saving Data: Store the extracted data in files or databases.
Installation
sudo pip3 install --upgrade beautifulsoup4 html5lib
OR
pip3 install --user --upgrade beautifulsoup4 html5lib
#setpup
import requests
from bs4 import BeautifulSoup
url = "https://www.blogger.com/"
#get The HTML
r=requests.get(url)
htmlContent=r.content
#print(htmlContent)
#Parse the HTML
scoup=BeautifulSoup(htmlContent,'html.parser')
#print(scoup)
#print(scoup.prettify)
#THML Tree Traversal Tag, Navigablestring, Beautifulscop, comment
title=scoup.title
#print(type(title))
#print(type(title.string))
#print(type(scoup))
paras=scoup.find_all('p')
#print(paras)
anchors=scoup.find_all('a')
#print(anchors)
#print(scoup.find('p'))
#print(scoup.find('p').get_text())
Example Code: Scraping Quotes from a Website
import requests
from bs4 import BeautifulSoup
# Step 1: Fetch the webpage content
url = "https://quotes.toscrape.com/"
response = requests.get(url)
# Step 2: Parse the HTML
soup = BeautifulSoup(response.text, "html.parser")
# Step 3: Extract specific data (quotes and authors)
quotes = soup.find_all("span", class_="text")
authors = soup.find_all("small", class_="author")
# Step 4: Display the extracted data
for i in range(len(quotes)):
print(f"Quote: {quotes[i].text}")
print(f"Author: {authors[i].text}")
print("-" * 50)

More Related Content

Similar to Unit 2_Crawling a website data collection, search engine indexing, and cybersecurity testing. (20)

PPTX
Web Scraping using Python | Web Screen Scraping
CynthiaCruz55
 
PPTX
Web_Scraping_Presentation_today pptx.pptx
YuvrajTkd
 
PPTX
Python ScrapingPresentation for dummy.pptx
norel46453
 
PDF
Getting started with Web Scraping in Python
Satwik Kansal
 
PPTX
Web programming using python frameworks.
Puneet Kumar Bhatia (MBA, ITIL V3 Certified)
 
PDF
Web scraping in python
Viren Rajput
 
PDF
chapter1_introHTML.pdf..................
safaameur1
 
PDF
Web Scraping is BS
John D
 
PPTX
Web Scraping With Python
Robert Dempsey
 
PDF
Web scraping in python
Saurav Tomar
 
PDF
簡易爬蟲製作和Pttcrawler
Tien-Yang (Aiden) Wu
 
PPTX
Data-Analytics using python (Module 4).pptx
DRSHk10
 
PPTX
Sesi 8_Scraping & API for really bnegineer.pptx
KevinLeo32
 
PDF
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
ThinkODC
 
PPTX
Web scraping using scrapy - zekeLabs
zekeLabs Technologies
 
KEY
/me wants it. Scraping sites to get data.
Robert Coup
 
PDF
Fun with Python
Narong Intiruk
 
PDF
How Does Beautiful Soup Facilitate E-Commerce Website Scraping in Python.pdf
dev670968
 
PDF
Python webinar 2nd july
Vineet Chaturvedi
 
PDF
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
Anton
 
Web Scraping using Python | Web Screen Scraping
CynthiaCruz55
 
Web_Scraping_Presentation_today pptx.pptx
YuvrajTkd
 
Python ScrapingPresentation for dummy.pptx
norel46453
 
Getting started with Web Scraping in Python
Satwik Kansal
 
Web programming using python frameworks.
Puneet Kumar Bhatia (MBA, ITIL V3 Certified)
 
Web scraping in python
Viren Rajput
 
chapter1_introHTML.pdf..................
safaameur1
 
Web Scraping is BS
John D
 
Web Scraping With Python
Robert Dempsey
 
Web scraping in python
Saurav Tomar
 
簡易爬蟲製作和Pttcrawler
Tien-Yang (Aiden) Wu
 
Data-Analytics using python (Module 4).pptx
DRSHk10
 
Sesi 8_Scraping & API for really bnegineer.pptx
KevinLeo32
 
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
ThinkODC
 
Web scraping using scrapy - zekeLabs
zekeLabs Technologies
 
/me wants it. Scraping sites to get data.
Robert Coup
 
Fun with Python
Narong Intiruk
 
How Does Beautiful Soup Facilitate E-Commerce Website Scraping in Python.pdf
dev670968
 
Python webinar 2nd july
Vineet Chaturvedi
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
Anton
 

More from ChatanBawankar (20)

PDF
Unit 6 Message Digest Message Digest Message Digest
ChatanBawankar
 
PDF
Unit 4 Legal Issues in Reverse Engineering.pdf
ChatanBawankar
 
PDF
Unit 4 Reverse Engineering Tools Functionalities & Use-Cases.pdf
ChatanBawankar
 
PDF
Unit 3 Significance of Log File Analysis in Pentesting.pdf
ChatanBawankar
 
PDF
Unit 3 Android Permission Model.pdf Android Permission Model
ChatanBawankar
 
PDF
Unit 3 Android Manifest File.pdf Android Manifest File
ChatanBawankar
 
PDF
Unit 2 DNS Spoofing in a BadUSB Attack.pdf
ChatanBawankar
 
PDF
Unit 2 ARP Poisoning Attack ARP Poisoning Attack.
ChatanBawankar
 
PDF
Unit Kali NetHunter is the official Kali Linux penetration testing platform f...
ChatanBawankar
 
PDF
Unit 1 Tools Beneficial for Monitoring the Debugging Process.pdf
ChatanBawankar
 
PDF
Unit 1 Kali NetHunter is the official Kali Linux penetration testing platform...
ChatanBawankar
 
PDF
Unit 3 Pentesting Analyze log file and find the secret information using Logcat
ChatanBawankar
 
PDF
Unit 2 Man-In-Middle Attack, Bad USB with MIMA
ChatanBawankar
 
PDF
Unit 1 Kali Nethunter Android: OS, Debub Bridge
ChatanBawankar
 
PDF
Unit 1.2 Introduction to Cybercrimes and Their Classification.pdf
ChatanBawankar
 
PDF
Unit 1.1 Introduction to Cybercrimes and Their Classification.pdf
ChatanBawankar
 
PDF
Unit 2.3 Introduction to Cyber Security Tools and Environment.pdf
ChatanBawankar
 
PDF
Unit 2.1 Introduction to Cyber Security Tools and Environment.pdf
ChatanBawankar
 
PDF
UNIT 3.2 Classical and Modern Encryption Techniques.pdf
ChatanBawankar
 
DOCX
Unit 2_Blacklisting & Whitelisting User Input in Python.docx
ChatanBawankar
 
Unit 6 Message Digest Message Digest Message Digest
ChatanBawankar
 
Unit 4 Legal Issues in Reverse Engineering.pdf
ChatanBawankar
 
Unit 4 Reverse Engineering Tools Functionalities & Use-Cases.pdf
ChatanBawankar
 
Unit 3 Significance of Log File Analysis in Pentesting.pdf
ChatanBawankar
 
Unit 3 Android Permission Model.pdf Android Permission Model
ChatanBawankar
 
Unit 3 Android Manifest File.pdf Android Manifest File
ChatanBawankar
 
Unit 2 DNS Spoofing in a BadUSB Attack.pdf
ChatanBawankar
 
Unit 2 ARP Poisoning Attack ARP Poisoning Attack.
ChatanBawankar
 
Unit Kali NetHunter is the official Kali Linux penetration testing platform f...
ChatanBawankar
 
Unit 1 Tools Beneficial for Monitoring the Debugging Process.pdf
ChatanBawankar
 
Unit 1 Kali NetHunter is the official Kali Linux penetration testing platform...
ChatanBawankar
 
Unit 3 Pentesting Analyze log file and find the secret information using Logcat
ChatanBawankar
 
Unit 2 Man-In-Middle Attack, Bad USB with MIMA
ChatanBawankar
 
Unit 1 Kali Nethunter Android: OS, Debub Bridge
ChatanBawankar
 
Unit 1.2 Introduction to Cybercrimes and Their Classification.pdf
ChatanBawankar
 
Unit 1.1 Introduction to Cybercrimes and Their Classification.pdf
ChatanBawankar
 
Unit 2.3 Introduction to Cyber Security Tools and Environment.pdf
ChatanBawankar
 
Unit 2.1 Introduction to Cyber Security Tools and Environment.pdf
ChatanBawankar
 
UNIT 3.2 Classical and Modern Encryption Techniques.pdf
ChatanBawankar
 
Unit 2_Blacklisting & Whitelisting User Input in Python.docx
ChatanBawankar
 
Ad

Recently uploaded (20)

PPTX
SPINA BIFIDA: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
PPTX
How to Handle Salesperson Commision in Odoo 18 Sales
Celine George
 
PPTX
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
PDF
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
PPTX
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PDF
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
PDF
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
PDF
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
PPTX
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
PPTX
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
PDF
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
PDF
The Different Types of Non-Experimental Research
Thelma Villaflores
 
PPTX
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
PDF
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
PDF
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PPTX
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
PDF
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
PDF
The dynastic history of the Chahmana.pdf
PrachiSontakke5
 
SPINA BIFIDA: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
How to Handle Salesperson Commision in Odoo 18 Sales
Celine George
 
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
The Different Types of Non-Experimental Research
Thelma Villaflores
 
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
The dynastic history of the Chahmana.pdf
PrachiSontakke5
 
Ad

Unit 2_Crawling a website data collection, search engine indexing, and cybersecurity testing.

  • 1. Assignment on Crawling a website The goal of this lab is to learn how to scrape web pages using BeautifulSoup. Introduction BeautifulSoup is a Python library that parses HTML files and allows you to extract information from them. HTML files are the files that are used to represent web pages. BeautifulSoup to a scraping task involves: inspecting the source code of the web page in a text editor to infer its structure using information about the structure to write code that pulls the data out, employing BeautifulSoup Web crawling is the process of systematically browsing the web to extract data from websites. BeautifulSoup is a Python library that makes it easy to scrape and parse HTML and XML documents. How Web Crawling Works 1. Fetching Web Pages: Use the requests module to download the webpage. 2. Parsing the HTML: Use BeautifulSoup to analyze and navigate the webpage’s structure. 3. Extracting Data: Identify and extract useful elements (e.g., text, links, images). 4. Following Links (Optional): Move from one page to another (crawling multiple pages). 5. Saving Data: Store the extracted data in files or databases. Installation sudo pip3 install --upgrade beautifulsoup4 html5lib OR pip3 install --user --upgrade beautifulsoup4 html5lib
  • 2. #setpup import requests from bs4 import BeautifulSoup url = "https://www.blogger.com/" #get The HTML r=requests.get(url) htmlContent=r.content #print(htmlContent) #Parse the HTML scoup=BeautifulSoup(htmlContent,'html.parser') #print(scoup) #print(scoup.prettify) #THML Tree Traversal Tag, Navigablestring, Beautifulscop, comment title=scoup.title #print(type(title)) #print(type(title.string)) #print(type(scoup)) paras=scoup.find_all('p') #print(paras)
  • 3. anchors=scoup.find_all('a') #print(anchors) #print(scoup.find('p')) #print(scoup.find('p').get_text()) Example Code: Scraping Quotes from a Website import requests from bs4 import BeautifulSoup # Step 1: Fetch the webpage content url = "https://quotes.toscrape.com/" response = requests.get(url) # Step 2: Parse the HTML soup = BeautifulSoup(response.text, "html.parser") # Step 3: Extract specific data (quotes and authors) quotes = soup.find_all("span", class_="text") authors = soup.find_all("small", class_="author") # Step 4: Display the extracted data for i in range(len(quotes)): print(f"Quote: {quotes[i].text}") print(f"Author: {authors[i].text}")