21CSC303JJ-SEPM_Ex-1.docx - Google Docs
21CSC303JJ-SEPM_Ex-1.docx - Google Docs
S
RM IST, Kattankulathur – 603 203
S
1 Exercise 5
2 Viva 5
Total 10
Team Members:
Project Description
he goal of this project is to develop a Python-based web scraper that automates the
T
extraction of data from websites. This scraper will be designed to handle both static and
dynamic content, using tools like BeautifulSoup, Selenium, and Playwright to retrieve and
process data efficiently. The scraper will collect valuable information from multiple target
websites, such as product listings, prices, reviews, news articles, or market trends,
depending on the organization's specific needs.
Business Case
<Incorporate the Business Case template>
Result
Thus, the project team formed, the project is described, the business case was
prepared and the problem statement was arrived.
ONE PAGE BUSINESS CASE TEMPLATE
DATE
SUBMITTED
BY
TITLE / ROLE
THE PROJECT
In bullet points, describe the problem this project aims to solve or the opportunity it aims to develop.
● athering data manually from multiple websites or sources can be tedious, repetitive, and error-prone,
G
especially when it involves large datasets.
● Extracted data from websites is often unstructured or inconsistent, making it difficult to analyze or use
effectively without significant data cleaning.
● Many businesses or researchers require real-time data, but manually monitoring websites for updates can be
inefficient. Automation of this process would allow for timely data collection and updates.
● Collecting large volumes of data from multiple web pages or websites requires significant resources and
effort. A scalable solution that can handle multiple sites and pages automatically is needed.
THE HISTORY
In bullet points, describe the current situation.
● Many businesses, researchers, and analysts still rely on manual methods to collect data from websites, which
is time-consuming, inefficient, and prone to human error.
● Data collected from websites is often unstructured and inconsistent, requiring significant manual effort to
clean, organize, and format it for use.
● Although some tools exist for web scraping, many still require technical expertise to set up and are not
designed to handle large-scale or dynamic content efficiently.
● Websites with content loaded via JavaScript are difficult to scrape with traditional methods (like
BeautifulSoup), requiring advanced techniques or tools like Selenium or Playwright, which are not always easy
to implement.
LIMITATIONS
List what could prevent the success of the project, such as the need for expensive equipment, bad weather,
lack of special training, etc.
● Some websites have restrictions on web scraping in their terms of service, which could result in legal
consequences or blocked access to data
● Many websites use CAPTCHAs or other anti-bot measures that could hinder or block the scraper's ability
to extract data.
● Websites may frequently change their structure or layout, which can break the scraper and require
constant maintenance to adapt to these changes.
● Websites with heavy reliance on JavaScript may be difficult to scrape without using additional tools like
Selenium or Playwright, which require advanced handling and could slow down the process.
APPROACH
List what is needed to complete the project.
● T ools/Libraries:Python, BeautifulSoup, Selenium/Playwright,Requests, Pandas, SQLite/MySQL/MongoDB, etc.
● Environment:Virtual environment, IDE, version control(Git), and scheduling tools.
● Data Strategy:Identify target websites, define the data to scrape, and handle dynamic content.
● Data Storage:Choose CSV, JSON, or a database, andprocess data with Pandas.
● Automation:Set up scripts and automate schedulingwith cron or task scheduler.
● T esting/Debugging:Implement logging, error handling,and manual debugging.
● Deployment (optional):Host on cloud or server ifscaling is needed.
● Legal Considerations:Comply with website scraping policies and avoid violating terms of service.
BENEFITS
In bullet points, list the benefits that this project will bring to the organization.
● Automation:Saves time by automating data collection, reducing manual effort.
● Efficiency:Increases productivity by speeding up the data gathering process.
● Real-Time Access:Provides up-to-date data for timely decision-making.
● Cost Reduction:Lowers labor and third-party data costs.
● Scalability:Can handle large volumes of data and expand to new sources.