Module 2 Data Science New
Module 2 Data Science New
DATA SCIENCE
PROCESS
Step 1: Setting Research Goal
A project starts by understanding the questions.
Data.gov
Data.gov The
Thehome
homeofofthe
theUSGovernment’s
USGovernment’s
opendata
opendata
https://open-
https://open- The
Thehome
homeoothe
theEuropean
European
data.europa.eu/
data.europa.eu/ Commission’s
Commission’sopen
opendata
data
Freebase.org
Freebase.org An
Anopen
opendatabase
databasethat
thatretrieves
retrieves
its
itsinformation
informationfrom
fromsites
siteslike
like
Wikipedia,
Wikipedia,MusicBrains,
MusicBrains,and
andthe
the
SEC
SECarchive
archive
Data.worldbank.org
Data.worldbank.org Open
Opendata
datainitiative
initiativefrom
fromthe
the
WorldBank
WorldBank
Aiddata.org
Aiddata.org Open
Opendata
datafor
forinternational
international
development
development
Open.fda.gov
Open.fda.gov Open
Opendata
datafrom
fromthetheUSFood
USFoodand
and
Drug
DrugAdministration
Administration
Step 3: Data Preparation
Step 3: Data Preparation
Data Cleansing
Removal of interpretation errors
Removal of consistency errors
Combining Data
Joining Tables
Appending Tables
Using Views to simulate Joins and Appends
Transforming Data
Transforming Variables
Reducing the number of variables
Transforming variables into dummies
Step 3: Data Preparation
Data Cleansing
Data cleansing is a sub process of the data
science process that focuses on removing errors
in your data so your data becomes a true and
consistent representation of the processes it
originates from.
Recalculate
Different units of
measurement
Bring to same level of
measurement by
Different levels of aggregation
aggregation or extrapolation
Interpretation Errors
Data Entry Errors
Data collection and data entry are error-
prone processes.
They often require human intervention,
and because humans are only human,
they make typos or lose their
concentration
for a second and introduce an error into
the chain.
Data collected from machines or
computers isn’t free from errors either.
Errors can arise from human sloppiness,
whereas others are due to machine or
Solution for Data Entry Errors
When you have a variable that can take only
two values: “Good” and “Bad”, you can
create a frequency table and see if those are
truly the only two values.present.
The values “Godo ” and “Bade” point out
something went wrong in at least 16 cases.
Most errors of this type are easy to fix with
simple assignment statements and if-
thenelse rules:
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
Redundant White Spaces
Whitespaces tend to be hard to
detect but cause errors like other
redundant characters would.
Solution for Redundant White
Spaces
Many programming languages provide string
functions that will remove the leading and trailing
whitespaces.
For instance, in Python you can use the
strip() function to remove leading and trailing
spaces.
Model Fit
Predictor variables have a
coefficient
Predictor significance
Model Fit