DW sem
DW sem
A) Data wrangling, also known as data munging, is the process of cleaning, structuring,
and enriching raw data into a desired format for better decision-making in data analysis.
Key Steps in Data Wrangling:
1. Data Collection – Gathering data from different sources (e.g., CSVs, APIs, databases).
2. Data Cleaning – Fixing or removing incorrect, incomplete, or inconsistent data.
3. Data Transformation – Reshaping data into a usable structure (e.g., pivoting, normalization).
4. Data Enrichment – Enhancing data with additional context or external sources.
5. Data Validation – Ensuring data quality and integrity before analysis.
Significance in Data Analysis:
• Improves Accuracy: Clean data reduces noise and errors, ensuring better model
performance.
• Saves Time: Well-prepared data streamlines analysis and visualization efforts.
• Enables Insightful Analysis: Structured data allows for deeper and more reliable insights.
• Supports Automation: Repeatable wrangling steps make workflows efficient and
reproducible.
1.1.b) List any five tasks involved in data wrangling and briefly describe them?
A) Here are five common tasks involved in data wrangling, along with brief descriptions:
1. Data Cleaning
o Involves handling missing values, correcting typos, removing duplicates, and fixing
inconsistencies in the data.
2. Data Transformation
o Changing the structure or format of data, such as normalizing values, converting data
types, or aggregating data to a higher level.
3. Data Merging/Joining
o Combining data from multiple sources or tables based on a common key to create a
unified dataset.
4. Data Filtering
5. Data Aggregation
o Summarizing data using functions like sum, average, count, or group by categories to
derive higher-level insights.
These tasks help turn raw, messy data into clean, structured, and analysis-ready datasets.
1.2.a) Explain why data wrangling is important for ensuring data quality in a machine learning
pipeline.
A) Data wrangling is crucial for ensuring data quality in a machine learning (ML) pipeline
because high-quality data is the foundation of accurate and reliable models. Here's why it's
important:
• Raw data often contains typos, missing values, duplicates, and outliers.
• Wrangling helps clean this up, preventing the model from learning from faulty patterns.
2. Ensures Consistency
• Inconsistent formats (e.g., date formats, text casing) can confuse ML algorithms.
• Clean, structured data leads to faster training times and better convergence during model
fitting.
In short, data wrangling ensures the data feeding into an ML pipeline is accurate, complete, and
reliable, directly influencing the performance and trustworthiness of the final model.
1.2.b) Illustrate the differences between CSV, JSON, and XML data formats with examples?
A) Here's a comparison of CSV, JSON, and XML data formats, highlighting their structure, readability,
and use cases — along with examples:
Relational Databases
• Definition: Databases that store data in structured tables with rows and columns. Each table
has a schema, and tables can be related using foreign keys.
• Uses: Best for structured data and complex queries involving relationships between entities.
Examples:
1. MySQL
2. PostgreSQL
• Definition: Databases that store data in flexible, non-tabular formats like documents, key-
value pairs, graphs, or wide-columns.
Examples:
1. MongoDB (document-based)
Summary Comparison:
A) Here are the steps to install Python packages for Excel and PDF parsing:
You'll need pandas and openpyxl (for .xlsx files) or xlrd (for .xls files):
Common libraries:
4. Verify Installations
import pandas as pd
import openpyxl
import xlrd
import PyPDF2
import pdfplumber
2.2.a) Explain the differences between MySQL and PostgreSQL as relational databases?
A) Here’s a clear comparison of MySQL and PostgreSQL, two popular relational database
systems:
• MySQL:
• PostgreSQL:
o More efficient for complex queries, large data sets, and write-heavy workloads.
• MySQL:
• PostgreSQL:
o Supports advanced features like CTEs, window functions, and custom data types.
3. Extensibility
• MySQL:
• PostgreSQL:
o Very extensible – you can add custom functions, data types, operators, and even
write stored procedures in multiple languages (e.g., Python, PL/pgSQL).
• MySQL:
• PostgreSQL:
o Fully ACID-compliant by default, known for its robustness and data integrity.
• MySQL:
• PostgreSQL:
o Uses a more permissive license (PostgreSQL License), great for commercial use.
2.2.b)Describe the challenges of parsing PDF files programmatically and how Python helps to
overcome them.
A) Parsing PDF files programmatically presents several challenges — but Python offers great
tools to help work around them. Here's a breakdown:
o Text may not follow a logical reading order in the file's raw structure.
3. Complex Layouts
o Multi-column formats, embedded images, headers/footers, and forms complicate
text extraction.
o Unlike HTML or XML, most PDFs lack tags that define document structure (e.g.,
headings, tables).
5. Encoding Issues
6. Embedded Images
o Data embedded as images (e.g. scanned pages) requires OCR (optical character
recognition).
Python offers powerful libraries tailored to handle different PDF extraction needs:
1. PyPDF2 / pypdf
reader = PdfReader("file.pdf")
text = reader.pages[0].extract_text()
2. pdfplumber
• Best for extracting tables, columns, and line-by-line text with better layout preservation.
import pdfplumber
page = pdf.pages[0]
text = page.extract_text()
table = page.extract_table()
text = pytesseract.image_to_string(Image.open("page.png"))
4. pdfminer.six
• Offers fine-grained control for advanced layout-aware text extraction, useful when PyPDF2
or pdfplumber fall short.
Combining Libraries
• Use pdf2image to convert pages to images → then OCR with pytesseract for scanned
content.
• Combine pdfplumber and pandas for table extraction → direct to CSV or Excel.
Summary
Let me know if you want a starter script, a PDF parsing workflow diagram, or help with a specific
file!
Example:
Raw data:
Name, Age, Country
Alice, 25, USA
bob, , us
Alice, 25, USA
After cleanup:
Name, Age, Country
Alice, 25, USA
Bob, Unknown, USA
Clean data = trustworthy results. Want a Python example of a basic cleanup process using pandas?
3.1.b) List and briefly describe the steps involved in normalizing and standardizing data?
A) data — two key techniques used to scale features before analysis or machine learning.
• df.describe()
• Look for outliers, skewness, and data ranges to decide between normalization and
standardization.
Formula:
# Normalization
normalizer = MinMaxScaler()
# Standardization
scaler = StandardScaler()
• pd.DataFrame(normalized_data).describe()
• pd.DataFrame(standardized_data).describe()
Summary Table
Step Description
Let me know if you'd like a mini project that uses these techniques with real-world data!
3.2.a) Explain the difference between finding duplicates and fuzzy matching during data
cleanup?
A)
1. Finding Duplicates
How it works:
Example:
import pandas as pd
duplicates = df[df.duplicated()]
2. Fuzzy Matching
How it works:
• Uses string similarity algorithms (e.g., Levenshtein distance, Jaccard, cosine similarity).
Summary
• Use duplicate detection when values are the same (e.g., same email twice).
• Use fuzzy matching when values are similar but not identical (e.g., "NYC" vs "New York
City").
3.2.b) Illustrate the role of regular expressions (RegEx) in identifying patterns for data cleanup?
A) Absolutely! Regular Expressions (RegEx) are a powerful tool for identifying, validating, and
cleaning patterns in messy data during cleanup. Here's how they help, with examples:
What is RegEx?
RegEx is a sequence of characters that defines a search pattern — especially useful for:
Clean currency
[$€₹,] Removes currency and commas
symbols
import re
2. Extract Emails
Benefit Description
Data exploration is the initial step in data analysis where you examine datasets to understand
their structure, spot patterns, detect anomalies, and get a sense of potential relationships
between variables.
o Example:
2. Summary Statistics
o Example:
o df.describe()
o Example:
o Example:
4.1.b) Name three open-source platforms used for presenting and publishing data, and explain
their basic purposes?
A)
1. Jupyter Notebook
o Use Case: Ideal for data analysis, exploratory data science, machine learning
experimentation, and presenting results in a readable format.
2. Apache Superset
o Use Case: Designed for business intelligence and interactive dashboards, Superset
connects to various databases and allows users to create rich visualizations without
coding.
3. Metabase
o Use Case: Suitable for non-technical users to generate dashboards and reports using
a simple interface, often used for internal analytics in organizations.
4.2.a) Explain how identifying correlations and outliers helps in data analysis?
A) Identifying correlations and outliers is crucial in data analysis because they reveal patterns
and issues that impact how we understand and use data:
Correlations
• What it means: Correlation shows how two variables move in relation to each other
(positive, negative, or none).
• Why it helps:
o Spot relationships: Helps identify variables that influence each other (e.g.,
advertising spend and sales).
Outliers
• What it means: Outliers are data points that significantly differ from others.
• Why it helps:
o Detect data quality issues: Could point to entry errors or sensor faults.
o Prevent skewed analysis: Outliers can distort means, variances, or model outcomes.
Example:
• A strong correlation between square footage and price can help predict value.
• An outlier (a small house with an extremely high price) might be an error or a luxury home,
worth investigating.
4.2.b) Discuss the significance of charts and maps in data visualization with examples?
A) Charts and maps are essential tools in data visualization because they turn raw data into
visual stories that are easy to understand, compare, and act on. Here’s why they matter, with
examples:
1. Line Chart
2. Bar Chart
3. Pie Chart
4. Scatter Plot
1. Choropleth Map
3. Symbol Map
5.1.a) Define web scraping and list any five common tasks it can be used for?
Web scraping is the automated process of extracting data from websites using tools or code. It
involves fetching a webpage and then parsing its HTML to pull out specific information, such as
text, links, images, or structured data.
1. Price Monitoring
o Track product prices from e-commerce sites like Amazon or eBay for comparison or
alerting systems.
2. News Aggregation
o Collect headlines, summaries, or full articles from multiple news sources to build a
custom news feed.
4. Market Research
o Gather user reviews, competitor data, or social media content to understand trends
and customer sentiment.
o Pull data from online databases, forums, or digital libraries for research, especially
when APIs are not available.
5.1.b) Name the tools and libraries commonly used for advanced web scraping, such as
browser-based parsing and spidering?
A) For advanced web scraping, especially when dealing with browser-based parsing or spidering
(crawling websites), several powerful tools and libraries can help manage complex tasks. Here's a
list of commonly used ones:
1. Scrapy
o Features: Includes tools for managing crawling, handling requests, parsing data,
storing it in various formats (JSON, CSV, databases), and even handling login
sessions.
o Use Case: Perfect for complex, multi-page scraping or web crawlers that need to
follow links recursively.
2. Selenium
o Purpose: A browser automation tool that can be used for web scraping, particularly
when you need to interact with dynamic, JavaScript-rendered content.
o Features: Automates browser actions like clicking, filling out forms, and waiting for
content to load. Works with Chrome, Firefox, and other browsers.
o Use Case: Best for scraping websites that require JavaScript rendering, login
authentication, or interaction before the data is accessible.
3. Playwright
o Purpose: A modern web automation tool similar to Selenium but more powerful and
faster.
o Features: Automates browser actions, supports multiple browsers (Chromium,
Firefox, WebKit), and allows handling complex interactions like scrolling, file uploads,
and capturing screenshots.
o Use Case: Suitable for modern, dynamic websites with heavy JavaScript content.
o Purpose: A Python library for parsing HTML and XML documents, mainly used with
requests to scrape static pages.
o Features: Simple and effective for parsing raw HTML, finding tags, attributes, and
data, and extracting relevant content.
o Use Case: Good for static pages where the data is readily available in the HTML
source without JavaScript rendering.
5. Pyppeteer
o Features: Similar to Playwright and Selenium, but focuses on headless (no GUI)
browsing. It allows interaction with modern web pages and dynamic content.
o Use Case: Best for scraping JavaScript-heavy websites and automating actions in the
browser like a real user.
6. Splash
o Purpose: A headless browser designed for web scraping, which can render JavaScript
content.
o Use Case: Often used with Scrapy in cases where websites heavily rely on JavaScript
to load data.
7. Requests-HTML
o Purpose: A Python library for parsing HTML and rendering JavaScript content, built
on top of PyQuery and Pyppeteer.
o Features: Easy-to-use API, can render JavaScript, and works like a lightweight
browser.
o Use Case: Useful for websites with moderate JavaScript rendering, and simpler to
use than Selenium or Playwright.
o Install: pip install requests-html
8. Octoparse (GUI-based)
o Use Case: Ideal for users who need an easy way to scrape data without coding but
still want advanced features.
• Splash: For integrating with Scrapy when dynamic content needs rendering.
• Octoparse: Non-technical users or teams who need a GUI tool for scraping.
5.2.a) Discuss the significance of analyzing a web page before initiating a scraping process?
A) Analyzing a web page before starting the scraping process is a critical first step that ensures
efficiency, accuracy, and compliance. Here's why it's so important:
• Why it matters: Web pages are made of HTML, often with complex nested elements.
o Classes and IDs that consistently contain the data you want
• Example: Knowing that product names are always in <h2 class="title"> makes your scraper
precise and reliable.
• Why it matters: Static content is directly in the HTML, while dynamic content loads via
JavaScript after the page loads.
• How to tell:
o View source vs. inspect element — if data isn’t in the source, it’s dynamic
• Impact:
• Why it matters: Important for scraping multiple pages of data (e.g., all products or articles).
• What to check:
• Common signs:
o CAPTCHA challenges
o JavaScript obfuscation
• Solution:
o Add headers, delays, proxies, or use browser automation to mimic human behavior.
• Where to look:
o https://example.com/robots.txt
Example:
5.2.b) Assess the effectiveness of PySpider for large-scale web crawling compared to Scrapy?
A) When comparing PySpider and Scrapy for large-scale web crawling, both have strengths, but
they differ in architecture, scalability, and use cases. Here's a detailed assessment:
PySpider
• Supports task retrying, priority queueing, and JavaScript rendering via PhantomJS (though
PhantomJS is now deprecated).
Scrapy
• No built-in web UI (you need to use tools like ScrapyRT, Frontera, or create your own
dashboard).
Verdict: PySpider has a better out-of-the-box setup for managing distributed crawlers. Scrapy
requires extra setup but is more extensible.
2. Scalability
PySpider
• Supports distributed crawling with a backend scheduler, message queue (e.g., RabbitMQ),
and worker system.
Scrapy
• Designed for high-speed crawling, but scaling to distributed workloads requires integration
with tools like Frontera, Scrapy Cluster, or custom setups with Kafka, Redis, etc.
Verdict: PySpider is easier to scale quickly, but Scrapy has higher ceiling for performance and
flexibility in enterprise-scale projects with custom engineering.
3. Performance
PySpider
• Easier to set up for small to medium jobs, but might be less performant under extremely
high load without customization.
Scrapy
• Async I/O support (Twisted-based), fast when handling thousands of requests per second on
optimized settings.
Verdict: Scrapy wins on raw performance for massive crawls, especially when JS rendering
isn't required.
JavaScript Support
PySpider
Scrapy
• No native JS rendering, but works well with Splash or Selenium for headless browsing.
Verdict: Both need extensions, but Scrapy has more maintained and robust integrations.
PySpider
• Fewer third-party plugins and updates (somewhat stale).
• Built-in UI is a plus.
Scrapy
• Actively maintained with a large community and ecosystem (Scrapy Cloud, Crawlera, etc.).
Verdict: Scrapy is more modern, with broader community and better long-term support.
Ease of Use
• PySpider: Easier to use for quick jobs with visual control and simple scripting.
• Scrapy: Steeper learning curve, but more powerful for custom logic and complex crawlers.
Final Recommendation
Conclusion: Use PySpider for quick setup of distributed crawls with a UI, especially in research or
prototyping. Choose Scrapy for production-level, scalable, and customizable crawling pipelines.