0% found this document useful (0 votes)
15 views4 pages

21CSC303JJ-SEPM_Ex-1.docx - Google Docs

The document outlines a project aimed at developing a web scraper using Python to automate data extraction from websites. The project involves using tools like BeautifulSoup and Selenium to efficiently gather structured data for various applications, addressing issues with manual data collection. It also highlights potential challenges, necessary resources, and the benefits of automation, efficiency, and real-time access to data.

Uploaded by

sb4146
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views4 pages

21CSC303JJ-SEPM_Ex-1.docx - Google Docs

The document outlines a project aimed at developing a web scraper using Python to automate data extraction from websites. The project involves using tools like BeautifulSoup and Selenium to efficiently gather structured data for various applications, addressing issues with manual data collection. It also highlights potential challenges, necessary resources, and the benefits of automation, efficiency, and real-time access to data.

Uploaded by

sb4146
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

‭ chool of Computing‬

S
‭ RM IST, Kattankulathur – 603 203‬
S

‭Course Code: 21CSC303J‬

‭Course Name: Software Engineering and Project Management‬


‭Experiment No‬ ‭1‬

‭Title of Experiment‬ ‭WEB Scrapper using python‬

‭Name of the candidate‬ ‭Ansh‬

‭Team Members‬ ‭Nagendra Reddy,‬

‭Register Number‬ ‭RA2211030010031, RA2211030010030‬

‭Date of Experiment‬ ‭21/01/2025‬

‭Mark Split Up‬


‭S.No‬ ‭Description Maximum Mark Mark Obtained‬

‭1‬ ‭Exercise 5‬

‭2‬ ‭Viva 5‬

‭Total 10‬

‭Staff Signature with date‬


‭Aim:‬
‭The aim of this project is to develop a web scraper using Python that will automatically‬
e‭ xtract and collect relevant data from a specified set of websites. The scraper will be designed to‬
‭efficiently gather structured data such as product information, news articles, or social media posts,‬
‭based on the user's requirements.‬

‭Team Members:‬

‭SN‬ ‭Name‬ ‭Reg no‬ ‭Role‬

‭ANSH‬ ‭RA2211030010031‬ ‭LEAD/ REP‬

‭Nagendra Reddy‬ ‭RA2211030010030‬ ‭Member‬

‭Project Title:‬‭WEB Scrapper using python‬

‭Project Description‬

‭ he goal of this project is to develop a Python-based web scraper that automates the‬
T
‭extraction of data from websites. This scraper will be designed to handle both static and‬
‭dynamic content, using tools like BeautifulSoup, Selenium, and Playwright to retrieve and‬
‭process data efficiently. The scraper will collect valuable information from multiple target‬
‭websites, such as product listings, prices, reviews, news articles, or market trends,‬
‭depending on the organization's specific needs.‬

‭Business Case‬
‭<Incorporate the Business Case template>‬

‭Result‬
‭Thus, the project team formed, the project is described, the business case was‬
‭prepared and the problem statement was arrived.‬
‭ONE PAGE BUSINESS CASE TEMPLATE‬
‭DATE‬

‭SUBMITTED‬
‭BY‬

‭TITLE / ROLE‬

‭THE PROJECT‬
‭In bullet points, describe the problem this project aims to solve or the opportunity it aims to develop.‬

‭●‬ ‭ athering data manually from multiple websites or sources can be tedious, repetitive, and error-prone,‬
G
‭especially when it involves large datasets.‬
‭●‬ ‭Extracted data from websites is often unstructured or inconsistent, making it difficult to analyze or use‬
‭effectively without significant data cleaning.‬
‭●‬ ‭Many businesses or researchers require real-time data, but manually monitoring websites for updates can be‬
‭inefficient. Automation of this process would allow for timely data collection and updates.‬
‭●‬ ‭Collecting large volumes of data from multiple web pages or websites requires significant resources and‬
‭effort. A scalable solution that can handle multiple sites and pages automatically is needed.‬

‭THE HISTORY‬
‭In bullet points, describe the current situation.‬

‭●‬ ‭Many businesses, researchers, and analysts still rely on manual methods to collect data from websites, which‬
i‭s time-consuming, inefficient, and prone to human error.‬
‭●‬ ‭Data collected from websites is often unstructured and inconsistent, requiring significant manual effort to‬
‭clean, organize, and format it for use.‬
‭●‬ ‭Although some tools exist for web scraping, many still require technical expertise to set up and are not‬
‭designed to handle large-scale or dynamic content efficiently.‬
‭●‬ ‭Websites with content loaded via JavaScript are difficult to scrape with traditional methods (like‬
‭BeautifulSoup), requiring advanced techniques or tools like Selenium or Playwright, which are not always easy‬
‭to implement.‬

‭LIMITATIONS‬
‭List what could prevent the success of the project, such as the need for expensive equipment, bad weather,‬
l‭ack of special training, etc.‬
‭●‬ ‭Some websites have restrictions on web scraping in their terms of service, which could result in legal‬
‭consequences or blocked access to data‬
‭●‬ ‭Many websites use CAPTCHAs or other anti-bot measures that could hinder or block the scraper's ability‬
‭to extract data.‬
‭●‬ ‭Websites may frequently change their structure or layout, which can break the scraper and require‬
‭constant maintenance to adapt to these changes.‬
‭●‬ ‭Websites with heavy reliance on JavaScript may be difficult to scrape without using additional tools like‬
‭Selenium or Playwright, which require advanced handling and could slow down the process.‬

‭APPROACH‬
‭List what is needed to complete the project.‬

‭‬
● T‭ ools/Libraries:‬‭Python, BeautifulSoup, Selenium/Playwright,‬‭Requests, Pandas, SQLite/MySQL/MongoDB, etc.‬
‭●‬ ‭Environment:‬‭Virtual environment, IDE, version control‬‭(Git), and scheduling tools.‬
‭●‬ ‭Data Strategy:‬‭Identify target websites, define the data to scrape, and handle dynamic content.‬
‭●‬ ‭Data Storage:‬‭Choose CSV, JSON, or a database, and‬‭process data with Pandas.‬
‭●‬ ‭Automation:‬‭Set up scripts and automate scheduling‬‭with cron or task scheduler.‬
‭‬
● T‭ esting/Debugging:‬‭Implement logging, error handling,‬‭and manual debugging.‬
‭●‬ ‭Deployment (optional):‬‭Host on cloud or server if‬‭scaling is needed.‬
‭●‬ ‭Legal Considerations:‬‭Comply with website scraping policies and avoid violating terms of service.‬

‭BENEFITS‬
‭In bullet points, list the benefits that this project will bring to the organization.‬
‭●‬ ‭Automation:‬‭Saves time by automating data collection, reducing manual effort.‬
‭●‬ ‭Efficiency:‬‭Increases productivity by speeding up the data gathering process.‬
‭●‬ ‭Real-Time Access:‬‭Provides up-to-date data for timely decision-making.‬
‭●‬ ‭Cost Reduction:‬‭Lowers labor and third-party data costs.‬
‭●‬ ‭Scalability:‬‭Can handle large volumes of data and expand to new sources.‬

You might also like