Web Scrapping Project Phase 4 1679950739
Web Scrapping Project Phase 4 1679950739
Now that we have imported the necessary modules, we can start using them in our code to perform
various tasks, such as web scraping, data analysis, and more.
[4]: #MYTEK
URLS=[] # an empty list to store the URLs
for i in range (1,11) :
url1='https://www.mytek.tn/catalogsearch/result/index/?
↪p='+str(i)+'&q=%C3%89couteurs+Sans+Fil'
URLS.append(url1)
for i in range (1,6) :
url2='https://www.mytek.tn/catalogsearch/result/index/?
↪p='+str(i)+'&q=Ecouteurs+bluetooth'
URLS.append(url2)
for i in range (1,7) :
1
url3='https://www.mytek.tn/image-son/son-numerique/earbuds.html?p='+str(i)
URLS.append(url3)
For the next cell I had displayed a list of URLs for testing purposes
[5]: #URLS
Now we have a list of URLs that we can use to scrape data from the MyTek website. We’ll use the
BeautifulSoup library to extract the desired information from the HTML content of each URL.
3 Scrapping phase
[31]: # Initializing empty lists to store the extracted data
list_ref=[] # a list to store the product reference
list_price=[] # a list to store the product prices
list_name=[] # a list to store the product names
list_avail=[] # a list to store the product availability status
list_feat=[] # a list to store the product features
# Looping through the list of URLs and extracting data from each page
for u in URLS:
response= requests.get(u) # sending an HTTP request to the URL
if response.ok: # ensuring that the link is scrappable (status code = 200)
if r not in list_ref:
# checking if the product reference number is not already in the␣
↪list, caz we'r scrapping many links
p=i.find("span",class_="price").text
# Find the text content of the HTML element with class "price"␣
↪inside the element i
n=i.find('a',class_='product-item-link').text
2
# Find the text content of the HTML element with class␣
↪"product-item-link" inside the element i
a=i.find("div",class_="card-body").text
# Find the text content of the HTML element with class␣
↪"card-body" inside the element i
f=i.find("div",class_="strigDesc").next.text
# Find the next sibling element after the HTML element with␣
↪class "strigDesc" inside the element i,
Now we have extracted the data from all the URLs in the list, and stored it in separate lists. Before
combining the data to create a DataFrame, we will first need to clean our data and extract more
detailed characteristics from the features and product name lists.
list_ref.pop(i) # Removing the reference to these products from the list of␣
↪references.
list_price.pop(i) # Removing the price to these products from the list of␣
↪prices.
list_name.pop(i) # Removing the name to these products from the list of␣
↪names.
list_feat.pop(i) # Removing the feature to these products from the list of␣
↪features.
[33]: # Simplifying the writing of prices to convert the price type from str to int␣
↪or float.
3
for i in range(len(list_price)):
list_price[i]=list_price[i].replace('DT','')
list_price[i]=list_price[i].replace('\xa0','')
list_price[i]=list_price[i].replace(',000','')
list_price[i]=list_price[i].replace(',','.')
# Now, we have these writings ('1099' , '289')
4
elif "OPPO" in x:
j=x.index("OPPO")
list_brand.append(x[j]+" "+x[j+1]+" "+x[j+2]+" "+x[j+3])
elif "APPLE" in x:
j=x.index("APPLE")
list_brand.append(x[j]+" "+x[j+1]+" "+x[j+2]+" "+x[j+3])
elif "XIAOMI" in x:
j=x.index("XIAOMI")
list_brand.append(x[j]+" "+x[j+1]+" "+x[j+2]+" "+x[j+3])
elif "NOKIA" in x:
j=x.index("NOKIA")
list_brand.append(x[j]+" "+x[j+1]+" "+x[j+2])
elif "JBL" in x:
j=x.index("JBL")
list_brand.append(x[j]+" "+x[j+1]+" "+x[j+2])
elif "LENOVO" in x:
j=x.index("LENOVO")
list_brand.append(x[j]+" "+x[j+1])
elif "BELKIN" in x:
j=x.index("BELKIN")
list_brand.append(x[j]+" "+x[j+1])
elif "SONY" in x:
j=x.index("SONY")
list_brand.append(x[j]+" "+x[j+1])
elif "HAMA" in x:
list_brand.append("HAMA")
elif "Pliable" in x:
j=x.index("Pliable")
list_brand.append(x[j+1])
elif "KSIX" in x:
list_brand.append("KSIX")
elif "YOOKIE" in x:
list_brand.append("YOOKIE")
elif "HP" in x:
list_brand.append("HP")
elif "PHILIPS" in x:
list_brand.append("PHILIPS")
elif "REDRAGON" in x:
list_brand.append("REDRAGON")
elif "SPIRIT" in x:
list_brand.append("SPIRIT OF GAMER")
elif "MI" in x:
list_brand.append("XIAOMI")
elif "BC" in x:
list_brand.append("BC MASTER")
elif "ENERGY" in x:
list_brand.append("ENERGY SISTEM")
5
elif "Sans" in x:
j=x.index("Sans")
list_brand.append(x[j+2])
elif "sans" in x:
j=x.index("sans")
list_brand.append(x[j+2])
elif "SANS" in x:
j=x.index("SANS")
list_brand.append(x[j+2])
elif "Ultra" in x:
j=x.index("Ultra")
list_brand.append(x[j+2])
elif "Casque" in x:
j=x.index("Casque")
list_brand.append(x[j+1])
elif x[1]=="Bluetooth":
list_brand.append(x[2])
else:
list_brand.append(x[1])
For the next cell i had displayed “lis_name” in order to identify the appropriate code to extract
the brand names.
[37]: #list_name
To ensure the accuracy of the list of brand names, it was necessary to verify each element (one by
one) to confirm that every appended brand name was correct So i had displayed “list_brand” in
the next cell.
[38]: #list_brand
[39]: # It was necessary to correct some of the brand names that were inaccurately␣
↪appended to the list
i=0
for e in list_brand:
i+=1
if e in ["Sport","Earbuds","Magnétiques","avec"]:
list_brand[i-1]="sans marque"
elif e=="Bluetooth":
list_brand[i-1]="TWS"
6
z=x[0].split()
list_colo.append(z[-1])
# Some products do not have an indicated color.
for i in range(len(list_colo)):
if list_colo[i] in ["Lipstick","HP-03","mémoire","Pro(MXY72LL-A)"]:
list_colo[i]=None
Some here, in order to identify the appropriate code to extract the colors , it was necessary to
review the list of product names so i had displayed “list_name” in the next cell.
[59]: #list_name
The gamer option is mentioned in the features of each product, so for the next cell I will simply
loop through the features list and determine if the product is intended for gamers or not.
pattern2 = r"Bluetooth®\s+(\d(\.\d)?)"
for e in list_feat:
match1 = re.search(pattern1, e)
match2 = re.search(pattern2, e)
if match1:
blu_gen= match1.group(1) # extracting the matched text using the␣
↪'group()' method
Also here, in order to identify the appropriate code to extract the bluetooth generation, it was
necessary to review the list of product features so i had displayed “list_feat” in the next cell
7
[ ]: #list_feat
Finally, in order to identify the appropriate code to extract the Spread distance of each product, it
was necessary to review the list of product features so i had displayed “list_feat” in the next cell
[96]: #list_feat
↪list_colo,'Availablity':list_avail,'Price(DT)':list_price,'features':
↪list_feat}
8
7 Exporting The Dataframe
[108]: df.to_excel("Scrap_Headphones.xlsx",index=False) # Save the DataFrame to an␣
↪Excel file named "Scrap_Headphones.xlsx"
#Ps: The 'index=False' argument tells pandas not to include the row index in␣
↪the output file
9
#27,937,813
H A S B E E N AWA R D E D T O
nourallah sfar
FO R S U C C E S S F U L LY C O M P L E T I N G
4 HOURS
COM PLETED ON