0% found this document useful (0 votes)
5 views16 pages

Lecture Week 6-Data Scraping and Data Wrangling

Data wrangling, also known as data cleaning, involves transforming raw data into usable formats through various processes such as merging, identifying gaps, and removing outliers. The steps of data wrangling include discovery, structuring, cleaning, enriching, validating, and publishing data, each tailored to the specific project needs. While data cleaning is a critical part of the wrangling process, they are distinct, with wrangling encompassing the overall transformation of data.

Uploaded by

Layan Mahasneh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views16 pages

Lecture Week 6-Data Scraping and Data Wrangling

Data wrangling, also known as data cleaning, involves transforming raw data into usable formats through various processes such as merging, identifying gaps, and removing outliers. The steps of data wrangling include discovery, structuring, cleaning, enriching, validating, and publishing data, each tailored to the specific project needs. While data cleaning is a critical part of the wrangling process, they are distinct, with wrangling encompassing the overall transformation of data.

Uploaded by

Layan Mahasneh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

LO2: Data

Scraping and Data


Wrangling

Python-Week 6
• Data wrangling—also called data cleaning
or data remediation—refers to a variety of
processes designed to transform raw data
into more readily used formats. The exact
methods differ from project to project
depending on the data you’re leveraging
and the goal you’re trying to achieve.
• Most Commonly-used Data Wrangling

Data
include:
• Merging multiple data sources into a
single dataset for analysis
Wrangling • Identifying gaps in data (for example,
empty cells in a spreadsheet) and either
filling or deleting them
• Deleting data that’s either unnecessary
or irrelevant to the project you’re
working on
• Identifying extreme outliers in data and
either explaining the discrepancies or
removing them so that analysis can
take place
Data Wrangling
Steps
• Each data project requires a unique
approach to ensure its final dataset is
reliable and accessible. That being
said, several processes typically
inform the approach. These are
commonly referred to as data
wrangling steps or activities:

Data Wrangling: What It Is & Why It’s Important (hbs.edu)


1. Discovery: refers to the process of
familiarizing yourself with data so you can
conceptualize how you might use it. During
discovery, you may identify trends or
patterns in the data, along with obvious
issues, such as missing or incomplete
values that need to be addressed. This is an
important step, as it will inform every
activity that comes afterward.
• df.head(), df.columns, df.tail(), df.info(),
Data Wrangling df.shape, df.isnull()

Steps 2. Structuring: Raw data is typically unusable


in its raw state because it’s either
incomplete or misformatted for its intended
application. Data structuring is the process
of taking raw data and transforming it to be
more readily leveraged. The form your data
takes will depend on the analytical model
you use to interpret it.
• quantile-based binning (from numeric to
categorical):
pd.qcut(df['points'],q=[0,0.16,0.84,0.9,1])
• encoding (categorical to numeric): OneHotEncoder
3. Data cleaning is the process of removing
inherent errors in data that might distort
your analysis or render it less valuable.
Cleaning can come in different forms,
including deleting empty cells or rows,
removing outliers, and standardizing
inputs. The goal of data cleaning is to
ensure there are no errors (or as few as
Data Wrangling possible) that could influence your final
analysis. Identifying and removing any
Steps bad data greatly impacts the rest of the
wrangling processes.
• df.drop_duplicates(inplace=True),
df.dropna(inplace=True),
df2['co2'].fillna(ave_co2, inplace
=True), df2["co2"].interpolate(),
4. Enriching: Once you understand your
existing data and have transformed it
into a more usable state, you must
determine whether you have all of the
data necessary for the project at hand.
If not, you may choose to enrich or
Data Wrangling augment your data by incorporating
values from other datasets. For this
Steps reason, it’s important to understand
what other data is available for use. Of
course: If you decide that enrichment
is necessary, you need to repeat the
steps above for any new data.
4. Validating: Data validation refers to the
process of verifying that your data is both
consistent and of a high enough quality.
During validation, you may discover issues
you need to resolve or conclude that your
data is ready to be analyzed. Validation is
typically achieved through various
automated processes and requires
programming. Consistent means: Data is
consistently represented in a standard way
Data Wrangling throughout the dataset.
For the data to be of high quality:
Steps • Complete: The dataset contains all required
values and fields — nothing important is
missing..
• Unique: The data contains no duplicates or
redundant records.
• Valid: Data conforms to the syntax and
structure defined by the business
requirements.
• Timely: Data is sufficiently up to date for its
intended use.
• Publishing: Once your data has been
validated, you can publish it. This
involves making it available to others
Data Wrangling within your organization for analysis. The
format you use to share the information
Steps —such as a written report or electronic
file—will depend on your data and the
organization’s goals.
Data Cleaning
Data Cleaning
Data Wrangling vs. Data Cleaning

• Despite the terms being used


interchangeably, data wrangling and
data cleaning are two different
processes. It’s important to make the
distinction that data cleaning is a critical
step in the data wrangling process to
remove inaccurate and inconsistent
data. Meanwhile, data-wrangling is the
overall process of transforming raw data
into a more usable form
•The choice between low and high variability in
data for data analytics hinges upon the precise
objectives and contextual factors guiding the

Is low or analysis. Data analytics necessitates a careful


balance between achieving precision and
encompassing the entire spectrum of data

high variability.

•In circumstances characterized by low data

variability variability, a notable advantage emerges: it


demands a smaller dataset to achieve a given
level of precision compared to situations with

better? higher variability. However, if the primary aim is


to comprehensively encompass a broad range
of scenarios, embracing high variability is
imperative, albeit at the cost of necessitating a
larger dataset.

•Generally speaking, it is best to consider the


specific task and situation in order to determine
which variability level is best suited
Mean imputation
• Simply calculate the mean of the
observed values for that variable for all
individuals who are non-missing.

Imputation • It has the advantage of keeping the


same mean and the same sample size,
but many, many disadvantages. Pretty
much every method listed below is
better than mean imputation.
Substitution
• Impute the value from a new individual who was
not selected to be in the sample. In other words,
go find a new subject and use their value instead.

Hot deck imputation


• A randomly chosen value from an individual in the
sample who has similar values on other variables.

Imputation In other words, find all the sample subjects who


are similar on other variables, then randomly
choose one of their values on the missing
variable.
• One advantage is you are constrained to only
possible values. In other words, if Age in your
study is restricted to being between 5 and 10, you
will always get a value between 5 and 10 this way.
• Another advantage is the random component,
which adds in some variability. This is important
for accurate standard errors.
Cold deck imputation
• A systematically chosen value from an
individual who has similar values on other
variables.
• This is similar to Hot Deck in most ways,
but removes the random variation. So for
example, you may always choose the

Imputation
third individual in the same experimental
condition and block.
Regression imputation
• The predicted value obtained by
regressing the missing variable on other
variables. So instead of just taking the
mean, you’re taking the predicted value,
based on other variables. This preserves
relationships among variables involved in
the imputation model.
Regression
Imputation

You might also like