0% found this document useful (0 votes)

22 views12 pages

Web Scrapping Project Phase 4 1679950739

Uploaded by

falous.7alouf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views12 pages

Web Scrapping Project Phase 4 1679950739

Uploaded by

falous.7alouf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Web Scrapping For Benchmarking

March 27, 2023

Elaborated by: Nour Sfar

Project Phase: n°4 (Web Scraping and Data Cleaning)
Purpose of this phase of the project: collecting the necessary data for the final project phase.
Chosen Product: Wireless Headphones Scraped Website: MyteK.tn

1 Importing the necessary modules

[67]: import re # Import the 're' module, which provides support for regular␣
↪expressions

import math as mt # module for mathematical functions

import pandas as pd # module for data manipulation and analysis
import requests # module for making HTTP requests
from bs4 import BeautifulSoup # module for parsing HTML documents

Now that we have imported the necessary modules, we can start using them in our code to perform
various tasks, such as web scraping, data analysis, and more.

2 Creating a list of URLs to scrape

First, we create a list of URLs for the search term ‘Écouteurs Sans Fil’ by looping through the page
numbers 1 to 10 and appending the URL to the list. Next, we create a list of URLs for the search
term ‘Ecouteurs bluetooth’ by looping through the page numbers 1 to 5 and appending the URL
to the list. Finally, we create a list of URLs for the category ‘earbuds’ by loopingthrough the page
numbers 1 to 6 and appending the URL to the list.

[4]: #MYTEK
URLS=[] # an empty list to store the URLs
for i in range (1,11) :
url1='https://www.mytek.tn/catalogsearch/result/index/?
↪p='+str(i)+'&q=%C3%89couteurs+Sans+Fil'

URLS.append(url1)
for i in range (1,6) :
url2='https://www.mytek.tn/catalogsearch/result/index/?
↪p='+str(i)+'&q=Ecouteurs+bluetooth'

URLS.append(url2)
for i in range (1,7) :

1
url3='https://www.mytek.tn/image-son/son-numerique/earbuds.html?p='+str(i)
URLS.append(url3)

For the next cell I had displayed a list of URLs for testing purposes

[5]: #URLS

Now we have a list of URLs that we can use to scrape data from the MyTek website. We’ll use the
BeautifulSoup library to extract the desired information from the HTML content of each URL.

3 Scrapping phase
[31]: # Initializing empty lists to store the extracted data
list_ref=[] # a list to store the product reference
list_price=[] # a list to store the product prices
list_name=[] # a list to store the product names
list_avail=[] # a list to store the product availability status
list_feat=[] # a list to store the product features

# Looping through the list of URLs and extracting data from each page
for u in URLS:
response= requests.get(u) # sending an HTTP request to the URL
if response.ok: # ensuring that the link is scrappable (status code = 200)

#print(response) (for testing purposes)

soup=BeautifulSoup(response.text,"html.parser") # parsing the HTML␣
↪content using BeautifulSoup

title=soup.find('title') # extracting the page title (not used in this␣

↪code)

lists=soup.find_all('li',class_='item product product-item') # finding␣

↪all the product items on the page

# extracting the product reference number, price, name, availability␣

↪status, and features
for i in lists :
r=i.find("div",class_="skuDesktop").text

if r not in list_ref:
# checking if the product reference number is not already in the␣
↪list, caz we'r scrapping many links

#so we may scrape the same product from different links

p=i.find("span",class_="price").text
# Find the text content of the HTML element with class "price"␣
↪inside the element i

n=i.find('a',class_='product-item-link').text

2
# Find the text content of the HTML element with class␣
↪"product-item-link" inside the element i
a=i.find("div",class_="card-body").text
# Find the text content of the HTML element with class␣
↪"card-body" inside the element i

f=i.find("div",class_="strigDesc").next.text
# Find the next sibling element after the HTML element with␣
↪class "strigDesc" inside the element i,

# and get its text content

#s=f.split("-") # splitting the features into a list of strings

#(To simplify the process of extracting␣
↪additional data)

# adding the extracted data to their respective lists

list_ref.append(r)
list_price.append(p)
list_name.append(n)
list_avail.append(a)
list_feat.append(f)

Now we have extracted the data from all the URLs in the list, and stored it in separate lists. Before
combining the data to create a DataFrame, we will first need to clean our data and extract more
detailed characteristics from the features and product name lists.

4 Data Cleaning Phase

[32]: # In order to filter out non-headphone products, it is necessary to remove them␣
↪from the lists.

for i in [193,232,252]: # indexes of non-headphone products

list_ref.pop(i) # Removing the reference to these products from the list of␣
↪references.
list_price.pop(i) # Removing the price to these products from the list of␣
↪prices.

list_name.pop(i) # Removing the name to these products from the list of␣
↪names.

list_avail.pop(i) # Removing the availability to these products from the␣

↪list of availabilities.

list_feat.pop(i) # Removing the feature to these products from the list of␣
↪features.

[33]: # Simplifying the writing of prices to convert the price type from str to int␣
↪or float.

# At first, we had these types of writings ('1\xa0099,000\xa0DT' ,␣

↪'289,000\xa0DT')

3
for i in range(len(list_price)):
list_price[i]=list_price[i].replace('DT','')
list_price[i]=list_price[i].replace('\xa0','')
list_price[i]=list_price[i].replace(',000','')
list_price[i]=list_price[i].replace(',','.')
# Now, we have these writings ('1099' , '289')

[34]: # Converting the price type from str to int or float.

for i in range(len(list_price)): # Looping through the list of prices.
list_price[i]=float(list_price[i]) # convert the string to a float
list_price[i]=round(list_price[i]) # convert the string to a int using the␣
↪rounded

5 Extracting more detailed characteristics from the features and

product name lists
[35]: # Classifing each product as either "Earbuds" or "Earhelmet" to differentiate␣
↪between the two types of headphones.

list_type=[] # Create an empty list to store the results

ln=len(list_name)
for i in range (ln):
x=list_name[i].split()
#print(x) ( for testing)
if x[0] in ["Casque","Micro"]:
list_type.append("Casque Bluetooth")
else:
list_type.append("EarBuds")

[36]: # Extracting brand name of each Pproduct

list_brand=[] # Create an empty list to store the results
ln=len(list_name)
for i in range (ln):
x=list_name[i].split()
if "Hi-Fi" in x:
j=x.index("Hi-Fi")
list_brand.append(x[j+1])
elif "Headset" in x:
j=x.index("Headset")
list_brand.append(x[j+1])
elif "Fantasy" in x:
list_brand.append("Fantasy")
elif "5" in x:
list_brand.append("5 Plus Pro")
elif "SAMSUNG" in x:
j=x.index("SAMSUNG")
list_brand.append(x[j+1]+" "+x[j+2]+" "+x[j+3])

4
elif "OPPO" in x:
j=x.index("OPPO")
list_brand.append(x[j]+" "+x[j+1]+" "+x[j+2]+" "+x[j+3])
elif "APPLE" in x:
j=x.index("APPLE")
list_brand.append(x[j]+" "+x[j+1]+" "+x[j+2]+" "+x[j+3])
elif "XIAOMI" in x:
j=x.index("XIAOMI")
list_brand.append(x[j]+" "+x[j+1]+" "+x[j+2]+" "+x[j+3])
elif "NOKIA" in x:
j=x.index("NOKIA")
list_brand.append(x[j]+" "+x[j+1]+" "+x[j+2])
elif "JBL" in x:
j=x.index("JBL")
list_brand.append(x[j]+" "+x[j+1]+" "+x[j+2])
elif "LENOVO" in x:
j=x.index("LENOVO")
list_brand.append(x[j]+" "+x[j+1])
elif "BELKIN" in x:
j=x.index("BELKIN")
list_brand.append(x[j]+" "+x[j+1])
elif "SONY" in x:
j=x.index("SONY")
list_brand.append(x[j]+" "+x[j+1])
elif "HAMA" in x:
list_brand.append("HAMA")
elif "Pliable" in x:
j=x.index("Pliable")
list_brand.append(x[j+1])
elif "KSIX" in x:
list_brand.append("KSIX")
elif "YOOKIE" in x:
list_brand.append("YOOKIE")
elif "HP" in x:
list_brand.append("HP")
elif "PHILIPS" in x:
list_brand.append("PHILIPS")
elif "REDRAGON" in x:
list_brand.append("REDRAGON")
elif "SPIRIT" in x:
list_brand.append("SPIRIT OF GAMER")
elif "MI" in x:
list_brand.append("XIAOMI")
elif "BC" in x:
list_brand.append("BC MASTER")
elif "ENERGY" in x:
list_brand.append("ENERGY SISTEM")

5
elif "Sans" in x:
j=x.index("Sans")
list_brand.append(x[j+2])
elif "sans" in x:
j=x.index("sans")
list_brand.append(x[j+2])
elif "SANS" in x:
j=x.index("SANS")
list_brand.append(x[j+2])
elif "Ultra" in x:
j=x.index("Ultra")
list_brand.append(x[j+2])
elif "Casque" in x:
j=x.index("Casque")
list_brand.append(x[j+1])
elif x[1]=="Bluetooth":
list_brand.append(x[2])
else:
list_brand.append(x[1])

For the next cell i had displayed “lis_name” in order to identify the appropriate code to extract
the brand names.

[37]: #list_name

To ensure the accuracy of the list of brand names, it was necessary to verify each element (one by
one) to confirm that every appended brand name was correct So i had displayed “list_brand” in
the next cell.

[38]: #list_brand

[39]: # It was necessary to correct some of the brand names that were inaccurately␣
↪appended to the list

i=0
for e in list_brand:
i+=1
if e in ["Sport","Earbuds","Magnétiques","avec"]:
list_brand[i-1]="sans marque"
elif e=="Bluetooth":
list_brand[i-1]="TWS"

[57]: # Extracting the color of each product

list_colo=[] # Create an empty list to store the results
for e in list_name:
x=e.split(" - ")
if len(x)==2:
list_colo.append(x[1])
else:

6
z=x[0].split()
list_colo.append(z[-1])
# Some products do not have an indicated color.
for i in range(len(list_colo)):
if list_colo[i] in ["Lipstick","HP-03","mémoire","Pro(MXY72LL-A)"]:
list_colo[i]=None

Some here, in order to identify the appropriate code to extract the colors , it was necessary to
review the list of product names so i had displayed “list_name” in the next cell.

[59]: #list_name

The gamer option is mentioned in the features of each product, so for the next cell I will simply
loop through the features list and determine if the product is intended for gamers or not.

[42]: list_gamer=[] # Create an empty list to store the results

for e in list_feat:
if "Gamer" in e:
list_gamer.append("YES")
else:
list_gamer.append("no")

[93]: # Extracting the bluetooth generation of each product

list_blu=[] # Create an empty list to store the results
pattern1 = r"Bluetooth\s+(\d(\.\d)?)" # Define a regular expression pattern to␣
↪extract the Bluetooth generation number

pattern2 = r"Bluetooth®\s+(\d(\.\d)?)"
for e in list_feat:
match1 = re.search(pattern1, e)
match2 = re.search(pattern2, e)
if match1:
blu_gen= match1.group(1) # extracting the matched text using the␣
↪'group()' method

list_blu.append(blu_gen) # adding extracted generation to the list

elif match2:
blu_gen= match2.group(1)
list_blu.append(blu_gen)
else:
list_blu.append(None) # if the Bluetooth generation number not␣
↪indicated in features description

#we'll append None in the list

# PS:the number1(argument of the 'group()' method) refers to the index of the␣

↪capturing group wich is '(\d(\.\d)?)' in this case

# in the regular expression pattern

Also here, in order to identify the appropriate code to extract the bluetooth generation, it was
necessary to review the list of product features so i had displayed “list_feat” in the next cell

7
[ ]: #list_feat

[81]: #Extracting the spread distance of each product

list_sp=[]
pattern1 = r"Distance de propagation:\s*([\d\.]+)\s*mètres"
pattern2 = r"Portée de transmission:\s?(\d+)m"
pattern3 = r"Distance de fonctionnement:\s?(\d+)m"
for e in list_feat:
match1 = re.search(pattern1, e)
match2 = re.search(pattern2, e)
match3 = re.search(pattern3, e)
if match1:
spread= match1.group(1)
list_sp.append(spread)
elif match2:
spread= match2.group(1)
list_sp.append(spread)
elif match3:
spread= match3.group(1)
list_sp.append(spread)
else:
list_sp.append(None) # if the spread not indicated in features␣
↪description

#we'll append None in the list

Finally, in order to identify the appropriate code to extract the Spread distance of each product, it
was necessary to review the list of product features so i had displayed “list_feat” in the next cell

[96]: #list_feat

6 Creating Our Dataframe

[107]: # Create a dictionary with keys 'Reference', 'type', 'Brand', 'Color',␣
↪'Price(DT)', 'Availablity', and 'features',

#and values equal to the corresponding lists

d={'Reference':list_ref,'type':list_type,'Gamer':list_gamer,'Brand':
↪list_brand,'Bluetooth Gen':list_blu,'Spread':list_sp,'Color':

↪list_colo,'Availablity':list_avail,'Price(DT)':list_price,'features':

↪list_feat}

# Convert the dictionary to a pandas DataFrame

df = pd.DataFrame(d)

8
7 Exporting The Dataframe
[108]: df.to_excel("Scrap_Headphones.xlsx",index=False) # Save the DataFrame to an␣
↪Excel file named "Scrap_Headphones.xlsx"

#Ps: The 'index=False' argument tells pandas not to include the row index in␣
↪the output file

9
#27,937,813

H A S B E E N AWA R D E D T O

nourallah sfar
FO R S U C C E S S F U L LY C O M P L E T I N G

Web Scraping in Python

LENGTH

4 HOURS

COM PLETED ON

FEB 20, 2023

B_2 CIE Web Scraping
No ratings yet
B_2 CIE Web Scraping
8 pages
UI21CS29_Lab2
No ratings yet
UI21CS29_Lab2
11 pages
Balaji 1
No ratings yet
Balaji 1
30 pages
Cabico Tan
No ratings yet
Cabico Tan
11 pages
How To Scrape Product Data From Amazon - A Complete Guide - Oxylabs
No ratings yet
How To Scrape Product Data From Amazon - A Complete Guide - Oxylabs
19 pages
Product Info Scrapper
No ratings yet
Product Info Scrapper
18 pages
Step 3
No ratings yet
Step 3
2 pages
Python_PPT(5)[1]
No ratings yet
Python_PPT(5)[1]
27 pages
Step 2
No ratings yet
Step 2
2 pages
Python Programming
No ratings yet
Python Programming
11 pages
Web Scraping Assignment Ebay
No ratings yet
Web Scraping Assignment Ebay
6 pages
Web Scraping
No ratings yet
Web Scraping
11 pages
01 Web Data Analytics Pawan
No ratings yet
01 Web Data Analytics Pawan
55 pages
19-5E8 Tushara Priya
No ratings yet
19-5E8 Tushara Priya
23 pages
dropdownlistscraping
No ratings yet
dropdownlistscraping
7 pages
6
No ratings yet
6
3 pages
myrecent_projects
No ratings yet
myrecent_projects
1 page
2025_01_08_19_58_58_1736346538
No ratings yet
2025_01_08_19_58_58_1736346538
3 pages
Benchmaster Documentation
No ratings yet
Benchmaster Documentation
12 pages
Flipkart Web Scrapping
No ratings yet
Flipkart Web Scrapping
8 pages
Python scrapping task
No ratings yet
Python scrapping task
2 pages
Team one _20241214_203551_0000
No ratings yet
Team one _20241214_203551_0000
15 pages
Web Scrape For Barcodes
No ratings yet
Web Scrape For Barcodes
9 pages
Philips Web Scraper Spec
No ratings yet
Philips Web Scraper Spec
2 pages
Team one _20241214_201830_0000
No ratings yet
Team one _20241214_201830_0000
14 pages
UI Ex 6 (61)-1
No ratings yet
UI Ex 6 (61)-1
3 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Rate Analogy
No ratings yet
Rate Analogy
9 pages
Guides
No ratings yet
Guides
23 pages
Python Using AI
No ratings yet
Python Using AI
9 pages
Amazon Apparel PDF
No ratings yet
Amazon Apparel PDF
138 pages
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
No ratings yet
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
193 pages
Python Toolbox 100 Scripts for Developers Enhance Your Development Skills with Ready-to-Use Python Scripts (Sari, Serhan) (Z-Library)
No ratings yet
Python Toolbox 100 Scripts for Developers Enhance Your Development Skills with Ready-to-Use Python Scripts (Sari, Serhan) (Z-Library)
193 pages
PRJCT Report
No ratings yet
PRJCT Report
22 pages
Assessment task_ Carbon38
No ratings yet
Assessment task_ Carbon38
5 pages
outputs and code
No ratings yet
outputs and code
14 pages
This Python Script Reads a JSON File
No ratings yet
This Python Script Reads a JSON File
3 pages
Scrapy Beginners Series Part 2 - Cleaning & Processing Data - ScrapeOps
No ratings yet
Scrapy Beginners Series Part 2 - Cleaning & Processing Data - ScrapeOps
10 pages
Mango Details Web Scrapping: Project
No ratings yet
Mango Details Web Scrapping: Project
3 pages
Assgn
No ratings yet
Assgn
6 pages
IIM PBA Assignment 2
No ratings yet
IIM PBA Assignment 2
3 pages
Performance Task
No ratings yet
Performance Task
5 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
AI E Commerce Chatbot Report
No ratings yet
AI E Commerce Chatbot Report
3 pages
Beginner Guide To Web Scraping of Data
No ratings yet
Beginner Guide To Web Scraping of Data
14 pages
Workshop 2B: Web Scraping With Beautifulsoup 4: Comp20008 Elements of Data Processing
No ratings yet
Workshop 2B: Web Scraping With Beautifulsoup 4: Comp20008 Elements of Data Processing
5 pages
E-Commerce - Python - Project - Student - File (1) Answer
No ratings yet
E-Commerce - Python - Project - Student - File (1) Answer
167 pages
plate-notebook-guided-project-1-1
No ratings yet
plate-notebook-guided-project-1-1
58 pages
MR Brico Url
No ratings yet
MR Brico Url
2 pages
Database Dict
No ratings yet
Database Dict
2 pages
SalesDataAnalysis__1693296057
No ratings yet
SalesDataAnalysis__1693296057
14 pages
ChatGPT - Auto Classification TensorFlow
No ratings yet
ChatGPT - Auto Classification TensorFlow
38 pages
Screenshot 2024-12-10 at 8.32.21 PM
No ratings yet
Screenshot 2024-12-10 at 8.32.21 PM
24 pages
Pymysql 4
No ratings yet
Pymysql 4
12 pages
Pickles Script Untested With Instructions
No ratings yet
Pickles Script Untested With Instructions
14 pages
Laptop Price Analysis (Finance Analyst)
No ratings yet
Laptop Price Analysis (Finance Analyst)
36 pages
items
No ratings yet
items
4 pages
nalysis-manipulation-and-cleaning
No ratings yet
nalysis-manipulation-and-cleaning
15 pages