0% found this document useful (0 votes)
22 views12 pages

Web Scrapping Project Phase 4 1679950739

Uploaded by

falous.7alouf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views12 pages

Web Scrapping Project Phase 4 1679950739

Uploaded by

falous.7alouf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Web Scrapping For Benchmarking

March 27, 2023

Elaborated by: Nour Sfar


Project Phase: n°4 (Web Scraping and Data Cleaning)
Purpose of this phase of the project: collecting the necessary data for the final project phase.
Chosen Product: Wireless Headphones Scraped Website: MyteK.tn

1 Importing the necessary modules


[67]: import re # Import the 're' module, which provides support for regular␣
↪expressions

import math as mt # module for mathematical functions


import pandas as pd # module for data manipulation and analysis
import requests # module for making HTTP requests
from bs4 import BeautifulSoup # module for parsing HTML documents

Now that we have imported the necessary modules, we can start using them in our code to perform
various tasks, such as web scraping, data analysis, and more.

2 Creating a list of URLs to scrape


First, we create a list of URLs for the search term ‘Écouteurs Sans Fil’ by looping through the page
numbers 1 to 10 and appending the URL to the list. Next, we create a list of URLs for the search
term ‘Ecouteurs bluetooth’ by looping through the page numbers 1 to 5 and appending the URL
to the list. Finally, we create a list of URLs for the category ‘earbuds’ by loopingthrough the page
numbers 1 to 6 and appending the URL to the list.

[4]: #MYTEK
URLS=[] # an empty list to store the URLs
for i in range (1,11) :
url1='https://www.mytek.tn/catalogsearch/result/index/?
↪p='+str(i)+'&q=%C3%89couteurs+Sans+Fil'

URLS.append(url1)
for i in range (1,6) :
url2='https://www.mytek.tn/catalogsearch/result/index/?
↪p='+str(i)+'&q=Ecouteurs+bluetooth'

URLS.append(url2)
for i in range (1,7) :

1
url3='https://www.mytek.tn/image-son/son-numerique/earbuds.html?p='+str(i)
URLS.append(url3)

For the next cell I had displayed a list of URLs for testing purposes

[5]: #URLS

Now we have a list of URLs that we can use to scrape data from the MyTek website. We’ll use the
BeautifulSoup library to extract the desired information from the HTML content of each URL.

3 Scrapping phase
[31]: # Initializing empty lists to store the extracted data
list_ref=[] # a list to store the product reference
list_price=[] # a list to store the product prices
list_name=[] # a list to store the product names
list_avail=[] # a list to store the product availability status
list_feat=[] # a list to store the product features

# Looping through the list of URLs and extracting data from each page
for u in URLS:
response= requests.get(u) # sending an HTTP request to the URL
if response.ok: # ensuring that the link is scrappable (status code = 200)

#print(response) (for testing purposes)


soup=BeautifulSoup(response.text,"html.parser") # parsing the HTML␣
↪content using BeautifulSoup

title=soup.find('title') # extracting the page title (not used in this␣


↪code)

lists=soup.find_all('li',class_='item product product-item') # finding␣


↪all the product items on the page

# extracting the product reference number, price, name, availability␣


↪status, and features
for i in lists :
r=i.find("div",class_="skuDesktop").text

if r not in list_ref:
# checking if the product reference number is not already in the␣
↪list, caz we'r scrapping many links

#so we may scrape the same product from different links

p=i.find("span",class_="price").text
# Find the text content of the HTML element with class "price"␣
↪inside the element i

n=i.find('a',class_='product-item-link').text

2
# Find the text content of the HTML element with class␣
↪"product-item-link" inside the element i
a=i.find("div",class_="card-body").text
# Find the text content of the HTML element with class␣
↪"card-body" inside the element i

f=i.find("div",class_="strigDesc").next.text
# Find the next sibling element after the HTML element with␣
↪class "strigDesc" inside the element i,

# and get its text content

#s=f.split("-") # splitting the features into a list of strings


#(To simplify the process of extracting␣
↪additional data)

# adding the extracted data to their respective lists


list_ref.append(r)
list_price.append(p)
list_name.append(n)
list_avail.append(a)
list_feat.append(f)

Now we have extracted the data from all the URLs in the list, and stored it in separate lists. Before
combining the data to create a DataFrame, we will first need to clean our data and extract more
detailed characteristics from the features and product name lists.

4 Data Cleaning Phase


[32]: # In order to filter out non-headphone products, it is necessary to remove them␣
↪from the lists.

for i in [193,232,252]: # indexes of non-headphone products

list_ref.pop(i) # Removing the reference to these products from the list of␣
↪references.
list_price.pop(i) # Removing the price to these products from the list of␣
↪prices.

list_name.pop(i) # Removing the name to these products from the list of␣
↪names.

list_avail.pop(i) # Removing the availability to these products from the␣


↪list of availabilities.

list_feat.pop(i) # Removing the feature to these products from the list of␣
↪features.

[33]: # Simplifying the writing of prices to convert the price type from str to int␣
↪or float.

# At first, we had these types of writings ('1\xa0099,000\xa0DT' ,␣


↪'289,000\xa0DT')

3
for i in range(len(list_price)):
list_price[i]=list_price[i].replace('DT','')
list_price[i]=list_price[i].replace('\xa0','')
list_price[i]=list_price[i].replace(',000','')
list_price[i]=list_price[i].replace(',','.')
# Now, we have these writings ('1099' , '289')

[34]: # Converting the price type from str to int or float.


for i in range(len(list_price)): # Looping through the list of prices.
list_price[i]=float(list_price[i]) # convert the string to a float
list_price[i]=round(list_price[i]) # convert the string to a int using the␣
↪rounded

5 Extracting more detailed characteristics from the features and


product name lists
[35]: # Classifing each product as either "Earbuds" or "Earhelmet" to differentiate␣
↪between the two types of headphones.

list_type=[] # Create an empty list to store the results


ln=len(list_name)
for i in range (ln):
x=list_name[i].split()
#print(x) ( for testing)
if x[0] in ["Casque","Micro"]:
list_type.append("Casque Bluetooth")
else:
list_type.append("EarBuds")

[36]: # Extracting brand name of each Pproduct


list_brand=[] # Create an empty list to store the results
ln=len(list_name)
for i in range (ln):
x=list_name[i].split()
if "Hi-Fi" in x:
j=x.index("Hi-Fi")
list_brand.append(x[j+1])
elif "Headset" in x:
j=x.index("Headset")
list_brand.append(x[j+1])
elif "Fantasy" in x:
list_brand.append("Fantasy")
elif "5" in x:
list_brand.append("5 Plus Pro")
elif "SAMSUNG" in x:
j=x.index("SAMSUNG")
list_brand.append(x[j+1]+" "+x[j+2]+" "+x[j+3])

4
elif "OPPO" in x:
j=x.index("OPPO")
list_brand.append(x[j]+" "+x[j+1]+" "+x[j+2]+" "+x[j+3])
elif "APPLE" in x:
j=x.index("APPLE")
list_brand.append(x[j]+" "+x[j+1]+" "+x[j+2]+" "+x[j+3])
elif "XIAOMI" in x:
j=x.index("XIAOMI")
list_brand.append(x[j]+" "+x[j+1]+" "+x[j+2]+" "+x[j+3])
elif "NOKIA" in x:
j=x.index("NOKIA")
list_brand.append(x[j]+" "+x[j+1]+" "+x[j+2])
elif "JBL" in x:
j=x.index("JBL")
list_brand.append(x[j]+" "+x[j+1]+" "+x[j+2])
elif "LENOVO" in x:
j=x.index("LENOVO")
list_brand.append(x[j]+" "+x[j+1])
elif "BELKIN" in x:
j=x.index("BELKIN")
list_brand.append(x[j]+" "+x[j+1])
elif "SONY" in x:
j=x.index("SONY")
list_brand.append(x[j]+" "+x[j+1])
elif "HAMA" in x:
list_brand.append("HAMA")
elif "Pliable" in x:
j=x.index("Pliable")
list_brand.append(x[j+1])
elif "KSIX" in x:
list_brand.append("KSIX")
elif "YOOKIE" in x:
list_brand.append("YOOKIE")
elif "HP" in x:
list_brand.append("HP")
elif "PHILIPS" in x:
list_brand.append("PHILIPS")
elif "REDRAGON" in x:
list_brand.append("REDRAGON")
elif "SPIRIT" in x:
list_brand.append("SPIRIT OF GAMER")
elif "MI" in x:
list_brand.append("XIAOMI")
elif "BC" in x:
list_brand.append("BC MASTER")
elif "ENERGY" in x:
list_brand.append("ENERGY SISTEM")

5
elif "Sans" in x:
j=x.index("Sans")
list_brand.append(x[j+2])
elif "sans" in x:
j=x.index("sans")
list_brand.append(x[j+2])
elif "SANS" in x:
j=x.index("SANS")
list_brand.append(x[j+2])
elif "Ultra" in x:
j=x.index("Ultra")
list_brand.append(x[j+2])
elif "Casque" in x:
j=x.index("Casque")
list_brand.append(x[j+1])
elif x[1]=="Bluetooth":
list_brand.append(x[2])
else:
list_brand.append(x[1])

For the next cell i had displayed “lis_name” in order to identify the appropriate code to extract
the brand names.

[37]: #list_name

To ensure the accuracy of the list of brand names, it was necessary to verify each element (one by
one) to confirm that every appended brand name was correct So i had displayed “list_brand” in
the next cell.

[38]: #list_brand

[39]: # It was necessary to correct some of the brand names that were inaccurately␣
↪appended to the list

i=0
for e in list_brand:
i+=1
if e in ["Sport","Earbuds","Magnétiques","avec"]:
list_brand[i-1]="sans marque"
elif e=="Bluetooth":
list_brand[i-1]="TWS"

[57]: # Extracting the color of each product


list_colo=[] # Create an empty list to store the results
for e in list_name:
x=e.split(" - ")
if len(x)==2:
list_colo.append(x[1])
else:

6
z=x[0].split()
list_colo.append(z[-1])
# Some products do not have an indicated color.
for i in range(len(list_colo)):
if list_colo[i] in ["Lipstick","HP-03","mémoire","Pro(MXY72LL-A)"]:
list_colo[i]=None

Some here, in order to identify the appropriate code to extract the colors , it was necessary to
review the list of product names so i had displayed “list_name” in the next cell.

[59]: #list_name

The gamer option is mentioned in the features of each product, so for the next cell I will simply
loop through the features list and determine if the product is intended for gamers or not.

[42]: list_gamer=[] # Create an empty list to store the results


for e in list_feat:
if "Gamer" in e:
list_gamer.append("YES")
else:
list_gamer.append("no")

[93]: # Extracting the bluetooth generation of each product


list_blu=[] # Create an empty list to store the results
pattern1 = r"Bluetooth\s+(\d(\.\d)?)" # Define a regular expression pattern to␣
↪extract the Bluetooth generation number

pattern2 = r"Bluetooth®\s+(\d(\.\d)?)"
for e in list_feat:
match1 = re.search(pattern1, e)
match2 = re.search(pattern2, e)
if match1:
blu_gen= match1.group(1) # extracting the matched text using the␣
↪'group()' method

list_blu.append(blu_gen) # adding extracted generation to the list


elif match2:
blu_gen= match2.group(1)
list_blu.append(blu_gen)
else:
list_blu.append(None) # if the Bluetooth generation number not␣
↪indicated in features description

#we'll append None in the list

# PS:the number1(argument of the 'group()' method) refers to the index of the␣


↪capturing group wich is '(\d(\.\d)?)' in this case

# in the regular expression pattern

Also here, in order to identify the appropriate code to extract the bluetooth generation, it was
necessary to review the list of product features so i had displayed “list_feat” in the next cell

7
[ ]: #list_feat

[81]: #Extracting the spread distance of each product


list_sp=[]
pattern1 = r"Distance de propagation:\s*([\d\.]+)\s*mètres"
pattern2 = r"Portée de transmission:\s?(\d+)m"
pattern3 = r"Distance de fonctionnement:\s?(\d+)m"
for e in list_feat:
match1 = re.search(pattern1, e)
match2 = re.search(pattern2, e)
match3 = re.search(pattern3, e)
if match1:
spread= match1.group(1)
list_sp.append(spread)
elif match2:
spread= match2.group(1)
list_sp.append(spread)
elif match3:
spread= match3.group(1)
list_sp.append(spread)
else:
list_sp.append(None) # if the spread not indicated in features␣
↪description

#we'll append None in the list

Finally, in order to identify the appropriate code to extract the Spread distance of each product, it
was necessary to review the list of product features so i had displayed “list_feat” in the next cell

[96]: #list_feat

6 Creating Our Dataframe


[107]: # Create a dictionary with keys 'Reference', 'type', 'Brand', 'Color',␣
↪'Price(DT)', 'Availablity', and 'features',

#and values equal to the corresponding lists


d={'Reference':list_ref,'type':list_type,'Gamer':list_gamer,'Brand':
↪list_brand,'Bluetooth Gen':list_blu,'Spread':list_sp,'Color':

↪list_colo,'Availablity':list_avail,'Price(DT)':list_price,'features':

↪list_feat}

# Convert the dictionary to a pandas DataFrame


df = pd.DataFrame(d)

8
7 Exporting The Dataframe
[108]: df.to_excel("Scrap_Headphones.xlsx",index=False) # Save the DataFrame to an␣
↪Excel file named "Scrap_Headphones.xlsx"

#Ps: The 'index=False' argument tells pandas not to include the row index in␣
↪the output file

9
#27,937,813

H A S B E E N AWA R D E D T O

nourallah sfar
FO R S U C C E S S F U L LY C O M P L E T I N G

Web Scraping in Python


LENGTH

4 HOURS

COM PLETED ON

FEB 20, 2023

You might also like