0% found this document useful (0 votes)
2 views

B_2 CIE Web Scraping

The document provides an overview of web scraping tools, detailing their components, processes, and coding examples for extracting data from websites, specifically eBay. It outlines the steps involved in web scraping, including identifying data, developing scripts, and storing results. Additionally, it includes Python code snippets for scraping product details and analyzing data using pandas.

Uploaded by

renehe1781
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

B_2 CIE Web Scraping

The document provides an overview of web scraping tools, detailing their components, processes, and coding examples for extracting data from websites, specifically eBay. It outlines the steps involved in web scraping, including identifying data, developing scripts, and storing results. Additionally, it includes Python code snippets for scraping product details and analyzing data using pandas.

Uploaded by

renehe1781
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Essentials of Data and Text Processing

Submitted By,
Jivani Dhairya (202203100110120),
Sanjana Kotadiya (202203100110175),
Tirthkumar Thummar (202203100110190),
Archie Koradia (202203100110197)

Guided By,
Ms. Jenisha Tailor

Uka Tarsadia University


January,2025
Chapter 1: Web Scraping Tool Introduction

Web scraping tools are software applications designed to extract data from
websites automatically. It enables users to retrieve data from web pages
and save it in usable format for analysis or other purposes.

Chapter 2: Web Scraping System & Data Gathered

The web scraping system typically consists of the following components:


1. Data Sources: Websites from which data will be scraped. These can include e-
commerce sites, social media platforms, news websites, government databases,
and more.
2. Web Scraping Tool: The software application used to automate the data
extraction process. This could be a custom script developed in-house or a third-
party tool like those mentioned in Chapter 1.
3. Data Storage: The destination where scraped data is stored. This could be a
local file, database, cloud storage service, or data warehouse.
4. Data Processing: Optional step where the scraped data is cleaned,
transformed.

Chapter 3: Web Scraping Steps

The web scraping process typically involves the following steps:


1. Identify Data
2. Inspect Web Page
3. Select Scraping Tool
4. Develop Scraping Script
5. Execute Scraping Script
6. Handle Errors
7. Store Scraped Data
Chapter 4: Codes

Doing Web Scraping

import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
# Function to extract product details from individual product page
def extract_product_details(product_url):
response = requests.get(product_url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')

# Try to extract specific details (these selectors will need to be updated based on
the actual page structure)
try:
display_size = soup.find('li', {'class': 'd-item__attr-value'}).text.strip() # Example
selector for display size
except AttributeError:
display_size = 'N/A'

try:
battery_capacity = soup.find('li', {'class': 'd-item__attr-value'}).text.strip() #
Example selector for battery
except AttributeError:
battery_capacity = 'N/A'

try:
status = soup.find('span', class_='d-item__cond').text.strip() # Example selector
for status
except AttributeError:
status = 'N/A'

return display_size, battery_capacity, status


else:
return 'N/A', 'N/A', 'N/A'

# Define the eBay search URL (modify the search query to suit your needs)
url = "https://www.ebay.com/sch/i.html?_nkw=iphone&_sop=12" # Example: search
for iPhones

# Step 1: Send a request to the eBay search results page


response = requests.get(url)

# Check if the request was successful


if response.status_code == 200:
# Step 2: Parse the page content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Step 3: Prepare a list to store product data


product_data = []

# Step 4: Find all product listings (based on the structure of the page)
listings = soup.find_all('li', class_='s-item') # This class might change, inspect the
actual structure

for item in listings:


try:
title = item.find('h3', class_='s-item_title').text.strip() # Product title
except AttributeError:
title = 'N/A'
try:
price = item.find('span', class_='s-item__price').text.strip() # Price
except AttributeError:
price = 'N/A'

try:
shipping = item.find('span', class_='s-item__shipping').text.strip() # Shipping info
except AttributeError:
shipping = 'N/A'

try:
condition = item.find('span', class_='s-item__condition').text.strip() # Product
condition
except AttributeError:
condition = 'N/A'

try:
link = item.find('a', class_='s-item__link')['href'] # Product URL
except (AttributeError, TypeError):
link = 'N/A'

# Fetch additional details from the product page


display_size, battery_capacity, status = extract_product_details(link)

# Add the extracted data to the list


product_data.append({
'Title': title,
'Price': price,
'Shipping': shipping,
'Condition': condition,
'Display Size': display_size,
'Battery Capacity': battery_capacity,
'Status': status,
'Product URL': link
})

# Step 5: Save the data to an Excel file on Desktop


if product_data:
# For Windows (change the path if you're on Mac/Linux)
desktop_path = os.path.join(os.path.expanduser('~'), 'Desktop',
'ebay_iphone_data.xlsx')

df = pd.DataFrame(product_data)
df.to_excel(desktop_path, index=False, engine='openpyxl')
print(f"Data saved to '{desktop_path}'")
else:
print("No product data found.")
else:
print(f"Failed to retrieve the page. Status Code: {response.status_code}")

Importing Data in Python:


import pandas as pd
# Replace with your actual file path
df = pd.read_csv(r'C:\Users\Admin\Downloads\ebay_iphone_data.csv')
# Display all rows of the dataframe
print(df)
# Optional: Check the total number of rows and columns
print(f"Dataset shape: {df.shape}")
Finding Mean, Median, Mode:
import pandas as pd
# Replace with your actual file path
df = pd.read_csv(r'C:\Users\Admin\Downloads\ebay_iphone_data.csv')
# Filter out rows where Shipping is "Free International Shipping"
filtered_df = df[df['Shipping'] != "Free International Shipping"]
# Select the specific columns
selected_columns = filtered_df[['Price', 'Shipping']
# Convert Shipping to numeric if it contains numeric values (e.g., "$10", "15")
# Ensure to handle currency symbols if they exist
selected_columns['Shipping'] = pd.to_numeric(selected_columns['Shipping'],
errors='coerce')
# Mean
mean_values = selected_columns.mean()
print("Mean:\n", mean_values)
# Median
median_values = selected_columns.median()
print("\nMedian:\n", median_values)

# Mode
mode_values = selected_columns.mode()
print("\nMode:\n", mode_values)

Chapter 5: Screenshots of Data Scraped


1)

Chapter 6: References
https://www.ebay.com/

You might also like