0% found this document useful (0 votes)

5 views16 pages

Lecture Week 6-Data Scraping and Data Wrangling

Data wrangling, also known as data cleaning, involves transforming raw data into usable formats through various processes such as merging, identifying gaps, and removing outliers. The steps of data wrangling include discovery, structuring, cleaning, enriching, validating, and publishing data, each tailored to the specific project needs. While data cleaning is a critical part of the wrangling process, they are distinct, with wrangling encompassing the overall transformation of data.

Uploaded by

Layan Mahasneh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views16 pages

Lecture Week 6-Data Scraping and Data Wrangling

Uploaded by

Layan Mahasneh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

LO2: Data

Scraping and Data

Wrangling

Python-Week 6
• Data wrangling—also called data cleaning
or data remediation—refers to a variety of
processes designed to transform raw data
into more readily used formats. The exact
methods differ from project to project
depending on the data you’re leveraging
and the goal you’re trying to achieve.
• Most Commonly-used Data Wrangling

Data
include:
• Merging multiple data sources into a
single dataset for analysis
Wrangling • Identifying gaps in data (for example,
empty cells in a spreadsheet) and either
filling or deleting them
• Deleting data that’s either unnecessary
or irrelevant to the project you’re
working on
• Identifying extreme outliers in data and
either explaining the discrepancies or
removing them so that analysis can
take place
Data Wrangling
Steps
• Each data project requires a unique
approach to ensure its final dataset is
reliable and accessible. That being
said, several processes typically
inform the approach. These are
commonly referred to as data
wrangling steps or activities:

Data Wrangling: What It Is & Why It’s Important (hbs.edu)

1. Discovery: refers to the process of
familiarizing yourself with data so you can
conceptualize how you might use it. During
discovery, you may identify trends or
patterns in the data, along with obvious
issues, such as missing or incomplete
values that need to be addressed. This is an
important step, as it will inform every
activity that comes afterward.
• df.head(), df.columns, df.tail(), df.info(),
Data Wrangling df.shape, df.isnull()

Steps 2. Structuring: Raw data is typically unusable

in its raw state because it’s either
incomplete or misformatted for its intended
application. Data structuring is the process
of taking raw data and transforming it to be
more readily leveraged. The form your data
takes will depend on the analytical model
you use to interpret it.
• quantile-based binning (from numeric to
categorical):
pd.qcut(df['points'],q=[0,0.16,0.84,0.9,1])
• encoding (categorical to numeric): OneHotEncoder
3. Data cleaning is the process of removing
inherent errors in data that might distort
your analysis or render it less valuable.
Cleaning can come in different forms,
including deleting empty cells or rows,
removing outliers, and standardizing
inputs. The goal of data cleaning is to
ensure there are no errors (or as few as
Data Wrangling possible) that could influence your final
analysis. Identifying and removing any
Steps bad data greatly impacts the rest of the
wrangling processes.
• df.drop_duplicates(inplace=True),
df.dropna(inplace=True),
df2['co2'].fillna(ave_co2, inplace
=True), df2["co2"].interpolate(),
4. Enriching: Once you understand your
existing data and have transformed it
into a more usable state, you must
determine whether you have all of the
data necessary for the project at hand.
If not, you may choose to enrich or
Data Wrangling augment your data by incorporating
values from other datasets. For this
Steps reason, it’s important to understand
what other data is available for use. Of
course: If you decide that enrichment
is necessary, you need to repeat the
steps above for any new data.
4. Validating: Data validation refers to the
process of verifying that your data is both
consistent and of a high enough quality.
During validation, you may discover issues
you need to resolve or conclude that your
data is ready to be analyzed. Validation is
typically achieved through various
automated processes and requires
programming. Consistent means: Data is
consistently represented in a standard way
Data Wrangling throughout the dataset.
For the data to be of high quality:
Steps • Complete: The dataset contains all required
values and fields — nothing important is
missing..
• Unique: The data contains no duplicates or
redundant records.
• Valid: Data conforms to the syntax and
structure defined by the business
requirements.
• Timely: Data is sufficiently up to date for its
intended use.
• Publishing: Once your data has been
validated, you can publish it. This
involves making it available to others
Data Wrangling within your organization for analysis. The
format you use to share the information
Steps —such as a written report or electronic
file—will depend on your data and the
organization’s goals.
Data Cleaning
Data Cleaning
Data Wrangling vs. Data Cleaning

• Despite the terms being used

interchangeably, data wrangling and
data cleaning are two different
processes. It’s important to make the
distinction that data cleaning is a critical
step in the data wrangling process to
remove inaccurate and inconsistent
data. Meanwhile, data-wrangling is the
overall process of transforming raw data
into a more usable form
•The choice between low and high variability in
data for data analytics hinges upon the precise
objectives and contextual factors guiding the

Is low or analysis. Data analytics necessitates a careful

balance between achieving precision and
encompassing the entire spectrum of data

high variability.

•In circumstances characterized by low data

variability variability, a notable advantage emerges: it

demands a smaller dataset to achieve a given
level of precision compared to situations with

better? higher variability. However, if the primary aim is

to comprehensively encompass a broad range
of scenarios, embracing high variability is
imperative, albeit at the cost of necessitating a
larger dataset.

•Generally speaking, it is best to consider the

specific task and situation in order to determine
which variability level is best suited
Mean imputation
• Simply calculate the mean of the
observed values for that variable for all
individuals who are non-missing.

Imputation • It has the advantage of keeping the

same mean and the same sample size,
but many, many disadvantages. Pretty
much every method listed below is
better than mean imputation.
Substitution
• Impute the value from a new individual who was
not selected to be in the sample. In other words,
go find a new subject and use their value instead.

Hot deck imputation

• A randomly chosen value from an individual in the
sample who has similar values on other variables.

Imputation In other words, find all the sample subjects who

are similar on other variables, then randomly
choose one of their values on the missing
variable.
• One advantage is you are constrained to only
possible values. In other words, if Age in your
study is restricted to being between 5 and 10, you
will always get a value between 5 and 10 this way.
• Another advantage is the random component,
which adds in some variability. This is important
for accurate standard errors.
Cold deck imputation
• A systematically chosen value from an
individual who has similar values on other
variables.
• This is similar to Hot Deck in most ways,
but removes the random variation. So for
example, you may always choose the

Imputation
third individual in the same experimental
condition and block.
Regression imputation
• The predicted value obtained by
regressing the missing variable on other
variables. So instead of just taking the
mean, you’re taking the predicted value,
based on other variables. This preserves
relationships among variables involved in
the imputation model.
Regression
Imputation

Kenichi Nomoto - Nomoken 1
100% (1)
Kenichi Nomoto - Nomoken 1
163 pages
#08207A: Customer Satisfaction - Sunroof Water Leak - Install Drain Tube Extensions - (Sep 15, 2008)
100% (1)
#08207A: Customer Satisfaction - Sunroof Water Leak - Install Drain Tube Extensions - (Sep 15, 2008)
12 pages
The Cooperative Solution
100% (1)
The Cooperative Solution
136 pages
Commissioning 1. Commissioning: ES200 Easy
100% (2)
Commissioning 1. Commissioning: ES200 Easy
4 pages
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages
Data Wrangling
No ratings yet
Data Wrangling
6 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
2-Data wrangling
No ratings yet
2-Data wrangling
13 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
DATA WRANGLING
No ratings yet
DATA WRANGLING
9 pages
Math211101020
No ratings yet
Math211101020
12 pages
Data Wrangling
No ratings yet
Data Wrangling
17 pages
UNIT-1(DWV)[1]
No ratings yet
UNIT-1(DWV)[1]
12 pages
DWDV UNIT 1
No ratings yet
DWDV UNIT 1
21 pages
DWDV notes
No ratings yet
DWDV notes
111 pages
DATA WRANGLING New
No ratings yet
DATA WRANGLING New
13 pages
Data Wrangling and Visualization
No ratings yet
Data Wrangling and Visualization
48 pages
Data Preparation
No ratings yet
Data Preparation
17 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
UNIT V
No ratings yet
UNIT V
47 pages
Data Wrangling Steps
No ratings yet
Data Wrangling Steps
10 pages
BIA 5000 Introduction To Analytics - Lesson 6
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 6
59 pages
Unit-1 DM
No ratings yet
Unit-1 DM
10 pages
Data Wrangling
0% (1)
Data Wrangling
5 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
Unit 4
No ratings yet
Unit 4
60 pages
211101088math - Data Ass 2
No ratings yet
211101088math - Data Ass 2
12 pages
Unit IV (3)
No ratings yet
Unit IV (3)
27 pages
Optimisation and ddddDimension Reduction Tech-unlocked
No ratings yet
Optimisation and ddddDimension Reduction Tech-unlocked
29 pages
Lesson 5 Data Wrangling in Data Science.
100% (1)
Lesson 5 Data Wrangling in Data Science.
11 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
110 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Unit 4
No ratings yet
Unit 4
60 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
DATA WRANGLING AND DATA VISUALIZATION -Unit-01
No ratings yet
DATA WRANGLING AND DATA VISUALIZATION -Unit-01
19 pages
step by step data wrangling
No ratings yet
step by step data wrangling
4 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Data Wrangling
No ratings yet
Data Wrangling
13 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
P6
No ratings yet
P6
24 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
1708443470801
No ratings yet
1708443470801
71 pages
Data Analytics_Module-1.1
No ratings yet
Data Analytics_Module-1.1
42 pages
Unit 2 Data Preprocessing (1)
No ratings yet
Unit 2 Data Preprocessing (1)
66 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
Ijitcs V10 N1 4
No ratings yet
Ijitcs V10 N1 4
9 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
lecture18 (1)
No ratings yet
lecture18 (1)
49 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
Data Sceince - UNIT -4
No ratings yet
Data Sceince - UNIT -4
70 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
DATA WRANGLING
No ratings yet
DATA WRANGLING
3 pages
Unit II Notes
No ratings yet
Unit II Notes
39 pages
Data Wrangling, Also Known As Data Munging, Is An Iterative Process That Involves Data
No ratings yet
Data Wrangling, Also Known As Data Munging, Is An Iterative Process That Involves Data
9 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Expt2
No ratings yet
Expt2
3 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Lecture Week 3-Databases
No ratings yet
Lecture Week 3-Databases
17 pages
Lecture Week 9- Regression
No ratings yet
Lecture Week 9- Regression
41 pages
Lecture Week 7- Plotting and Visualization
No ratings yet
Lecture Week 7- Plotting and Visualization
25 pages
QPI-S2025-L9+L10
No ratings yet
QPI-S2025-L9+L10
43 pages
Cell Structure Summary Notes
No ratings yet
Cell Structure Summary Notes
5 pages
Screenshot 2021-11-17 at 7.48.41 PM
No ratings yet
Screenshot 2021-11-17 at 7.48.41 PM
1 page
Toronto Islands Nature Checklist
No ratings yet
Toronto Islands Nature Checklist
16 pages
Anindra Nallapat (33Y/M) Diabetc Profle - Advanced New: Report For Tests Asked
No ratings yet
Anindra Nallapat (33Y/M) Diabetc Profle - Advanced New: Report For Tests Asked
28 pages
Unit 2 - XML
No ratings yet
Unit 2 - XML
48 pages
Hitachi v-209 Oscilloscope
No ratings yet
Hitachi v-209 Oscilloscope
47 pages
Structure of Mind
100% (2)
Structure of Mind
8 pages
VIPEDIA 12 VIPA Contacts User Manual
No ratings yet
VIPEDIA 12 VIPA Contacts User Manual
40 pages
Cadet 6' X 42" Whirlpool and Bathing Pool 2774.018W Whirlpool
No ratings yet
Cadet 6' X 42" Whirlpool and Bathing Pool 2774.018W Whirlpool
2 pages
112PC VMM Fixture Instruction
No ratings yet
112PC VMM Fixture Instruction
3 pages
Vajrayogini Sadhana
100% (2)
Vajrayogini Sadhana
13 pages
The School of Physics: Annual Report
No ratings yet
The School of Physics: Annual Report
69 pages
Graphical Methd For LPP
No ratings yet
Graphical Methd For LPP
20 pages
Rethinking Project Management - A Structured Literature Review With A Critical Look - IJPM.2014
No ratings yet
Rethinking Project Management - A Structured Literature Review With A Critical Look - IJPM.2014
13 pages
Overview of Artificial Lift Systems
No ratings yet
Overview of Artificial Lift Systems
19 pages
Grade 5-6 - MSEP
No ratings yet
Grade 5-6 - MSEP
63 pages
How to check FSMO roles in Active Directory
No ratings yet
How to check FSMO roles in Active Directory
7 pages
Ugc Net Education Question Paper 2021 15
No ratings yet
Ugc Net Education Question Paper 2021 15
75 pages
Specification: Project: Owner: Location: Brgy. 83 3Rd St. Paraiso, San Jose, Tacloban City, Leyte
No ratings yet
Specification: Project: Owner: Location: Brgy. 83 3Rd St. Paraiso, San Jose, Tacloban City, Leyte
2 pages
Contextualized LP in General Mathematics
No ratings yet
Contextualized LP in General Mathematics
6 pages
Press Trans MBS 3000 PDF
100% (1)
Press Trans MBS 3000 PDF
4 pages
Siemens em 231
No ratings yet
Siemens em 231
1 page
MGT Dexandra
0% (1)
MGT Dexandra
12 pages
Ackreceipt - GSTNN - 157049700063771 PDF
100% (2)
Ackreceipt - GSTNN - 157049700063771 PDF
1 page
animals-15-01210
No ratings yet
animals-15-01210
15 pages
Environmental Aspects in Tunnel Design: Safe & Reliable Tunnels. Innovative European Achievements
No ratings yet
Environmental Aspects in Tunnel Design: Safe & Reliable Tunnels. Innovative European Achievements
12 pages
Miners 2008
No ratings yet
Miners 2008
7 pages
PDF College Algebra 10th Edition Sullivan Test Bank download
100% (11)
PDF College Algebra 10th Edition Sullivan Test Bank download
60 pages

Lecture Week 6-Data Scraping and Data Wrangling

Uploaded by

Lecture Week 6-Data Scraping and Data Wrangling

Uploaded by

LO2: Data

Scraping and Data

Data Wrangling: What It Is & Why It’s Important (hbs.edu)

Steps 2. Structuring: Raw data is typically unusable

• Despite the terms being used

Is low or analysis. Data analytics necessitates a careful

•In circumstances characterized by low data

variability variability, a notable advantage emerges: it

better? higher variability. However, if the primary aim is

•Generally speaking, it is best to consider the

Imputation • It has the advantage of keeping the

Hot deck imputation

Imputation In other words, find all the sample subjects who

You might also like