0% found this document useful (0 votes)
2 views

Module 2 Data Science New

The document outlines the data science process, detailing steps from setting research goals to building models. It emphasizes the importance of teamwork, data retrieval, preparation, exploratory analysis, and model selection and execution. Key techniques for data cleansing, visualization, and model diagnostics are also discussed to ensure effective analysis and accurate predictions.

Uploaded by

Pratibha S
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 2 Data Science New

The document outlines the data science process, detailing steps from setting research goals to building models. It emphasizes the importance of teamwork, data retrieval, preparation, exploratory analysis, and model selection and execution. Key techniques for data cleansing, visualization, and model diagnostics are also discussed to ensure effective analysis and accurate predictions.

Uploaded by

Pratibha S
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

MODULE 2

DATA SCIENCE
PROCESS
Step 1: Setting Research Goal
A project starts by understanding the questions.

•What does the company expect you to do?


•Why does management place such a value on your
research?
•Is it part of a bigger strategic picture or a“lone wolf”
project originating from an opportunity someone
detected?

•The outcome should be a clear research goal, a


good understanding of the con- text, well
defined deliverables, and a plan of action with
a timetable as a project charter.
A project charter requires teamwork, and
covers at least the following:
 A clear research goal
 The project mission and context
 How you’re going to perform your
analysis
 What resources you expect to use
 Proof that it’s an achievable project,
or proof of concepts
 Deliverables and a measure of success
 A timeline
Step2: Data Retrieval
(Data may be Internal or External )
(Retrieval of Internal Data )
Startwith data stored within
the company

 Data is stored in the form of


 Database -Storage of Data
 Datawarehouse -Reading and
Analysing data
 Datamart -Subset of Data
warehouse
 DataLake -Data in Natural
format
(Retrieval of External Data )
Don’t be afraid to shop
around
Government and Non government
organizations, companies share their
data to enrich their services and
ecosystem.
Example, Twitter,Linked In, Facebook
OpenData
OpenDataSite
Site Description
Description

Data.gov
Data.gov The
Thehome
homeofofthe
theUSGovernment’s
USGovernment’s
opendata
opendata

https://open-
https://open- The
Thehome
homeoothe
theEuropean
European
data.europa.eu/
data.europa.eu/ Commission’s
Commission’sopen
opendata
data

Freebase.org
Freebase.org An
Anopen
opendatabase
databasethat
thatretrieves
retrieves
its
itsinformation
informationfrom
fromsites
siteslike
like
Wikipedia,
Wikipedia,MusicBrains,
MusicBrains,and
andthe
the
SEC
SECarchive
archive

Data.worldbank.org
Data.worldbank.org Open
Opendata
datainitiative
initiativefrom
fromthe
the
WorldBank
WorldBank
Aiddata.org
Aiddata.org Open
Opendata
datafor
forinternational
international
development
development
Open.fda.gov
Open.fda.gov Open
Opendata
datafrom
fromthetheUSFood
USFoodand
and
Drug
DrugAdministration
Administration
Step 3: Data Preparation
Step 3: Data Preparation
Data Cleansing
Removal of interpretation errors
 Removal of consistency errors

Combining Data
Joining Tables
Appending Tables
Using Views to simulate Joins and Appends

Transforming Data
 Transforming Variables
 Reducing the number of variables
 Transforming variables into dummies
Step 3: Data Preparation
Data Cleansing
 Data cleansing is a sub process of the data
science process that focuses on removing errors
in your data so your data becomes a true and
consistent representation of the processes it
originates from.

Two types of errors:


Interpretation error such as when you take the value
as granted.
Example : age of a person is greater than 300years
Consistency Error are inconsistencies between data
sources against standardized company values.

 Example: Another example is that you use Pounds in


one table and Dollars in another.
Errors pointing to false values within one
data set

Error Description Possible Solution


Mistakes during data Manual overrules
entry
Use string functions
Redundant white space
Manual overrules
Impossible values
Remove observation or value
Missing values
Validate and, if erroneous, treat as
Outliers missing value (remove or insert)
Errors pointing to inconsistencies
between data sets

Error Description Possible Solution


Deviations from a code Match on keys or else use
book manual overrules

Recalculate
Different units of
measurement
Bring to same level of
measurement by
Different levels of aggregation
aggregation or extrapolation
Interpretation Errors
Data Entry Errors
Data collection and data entry are error-
prone processes.
They often require human intervention,
and because humans are only human,
they make typos or lose their
concentration
for a second and introduce an error into
the chain.
Data collected from machines or
computers isn’t free from errors either.
 Errors can arise from human sloppiness,
whereas others are due to machine or
Solution for Data Entry Errors
 When you have a variable that can take only
two values: “Good” and “Bad”, you can
create a frequency table and see if those are
truly the only two values.present.
 The values “Godo ” and “Bade” point out
something went wrong in at least 16 cases.
 Most errors of this type are easy to fix with
simple assignment statements and if-
thenelse rules:
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
Redundant White Spaces
Whitespaces tend to be hard to
detect but cause errors like other
redundant characters would.
Solution for Redundant White
Spaces
 Many programming languages provide string
functions that will remove the leading and trailing
whitespaces.
 For instance, in Python you can use the
strip() function to remove leading and trailing
spaces.

FIXING CAPITAL LETTER MISMATCHES


Capital letter mismatches are common.
 Most programming languages make a distinction
between “Brazil” and “brazil”.
In this case you can solve the problem by applying
a function that returns both strings in lowercase,
such as .lower() in Python. “Brazil”.lower()
==“brazil”.lower() should result in true.
IMPOSSIBLE VALUES AND SANITY
CHECKS
• Check the value against
physically or theoretically
impossible values such as people
taller than 3 meters or someone
with an age of 299 years.

• Sanity checks can be directly


expressed
with rules: check = 0 <= age <=
120
OUTLIERS

An outlier is an observation that


seems to be distant from other
observations or, more specifically, one
observation that follows a different
logic or generative process than the
other observations.

The easiest way to find outliers is to


use a plot or a table with the minimum
and maximum values.
DEALING WITH MISSING VALUES

Missing values aren’t necessarily


wrong, but you still need to
handle them separately.

Certain modeling techniques


can’t handle missing values. They
might be an indicator
that something went wrong in
your data collection or
An overview of techniques to
handle missing data
DEVIATIONS FROM A CODE BOOK
 Detecting errors in larger data sets against a
code book or against standardized values can
be done with the help of set operations.
 A code book is a description of your data,a
form of metadata. It contains things such as
the number of variables per observation,the
number of observations, and what each
encodin g within a variable means.
(For instance “0” equals “negative”, “5”
stands for “very positive”.)
 A code book also tells the type of data you’re
looking at: is it hierarchical, graph, something
else?
 If you have multiple values to check, it’s better
to put them from the code book into a table
DIFFERENT UNITS OF
MEASUREMENT
When integrating two data sets, you
have to pay attention to their
respective units of measurement.
Example: When you study the prices
of gasoline in the world.Data sets
can contain prices per gallon and
others can contain prices per liter.

• A simple conversion will do the trick


in this case
DIFFERENT LEVELS OF
AGGREGATION
Having different levels of
aggregation is similar to having
different types of measurement.
Example: A data set containing
data per week versus one
containing data per work week.
This type of error is generally easy
to detect, and summarizing (or
the inverse, expanding) the data
sets will fix it.
Correct errors as early as possible
Data should be cleansed when acquired for many
reasons:
• Not everyone spots the data anomalies.
Decision-makers may make costly mistakes on
information based on incorrect data from
applications that fail to correct for the faulty
data.
• If errors are not corrected early on in the
process, the cleansing will have to be done for
every project that uses that data.
• Data errors may point to a business process
that isn’t working as designed. For instance,
both authors worked at a retailer in the past,
and they designed a couponing system to
attract more people and make a higher profit.
• Data errors may point to defective
equipment, such as broken transmission lines
and defective sensors.
• Data errors can point to bugs in software or
in the integration of software that y be critical
to the company.
 While doing a small project at a bank we
discovered that two software applications
used different local settings. This caused
problems with numbers greater than 1,000.
For one app the number 1.000 meant one,
and for the other it meant one thousand
Combining the Data
Joining
Tables
Appending Tables
Using Views to simulate Joins
and Appends
Joining Tables
ENRICHING AGGREGATED
MEASURES
Relationships between an input
variable and an output variable
aren’t always linear.
Transforming the input variables
greatly simplifies the estimation
problem. Other
times you might want to combine
two variables into a new variable.
Example:
Take, for instance, a relationship of the
form y = aebx. Taking the log of the
independent variables simplifies the
estimation problem dramatically.
Reducing the number of
variables
Having too many variables in
your model
makes the model difficult to
handle, and certain techniques
don’t perform well when you
overload them with too many
input variables.
 Data scientists use special
methods to reduce the number of
variables but retain the
Principal Component Analysis is a
well-known dimension reduction
technique.
It transforms the variables into a
new set of variables called as
principal components.
These PCA is well suite for
multidimensional data.
TURNING VARIABLES INTO
DUMMIES
Dummy variables can only take two
values: true(1) or false(0).
They’re used to indicate the
absence of a
categorical effect that may explain
the observation
Step 4: Data Exploratory Analysis

Information becomes much easier


to grasp when shown in a picture,
therefore you mainly use graphical
techniques to gain an
understanding of your data and the
interactions between variables.
The visualization techniques you
use in this phase range from simple
line graphs or
histograms
The Exploratory Data Analysis
(EDA) phase is where data
scientists dive deep into the
dataset, unravelling its patterns,
trends, and characteristics. This
phase employs statistical analysis
and visualisation techniques to
gain insights that will inform
subsequent modelling decisions.
Visualization techniques
Simple graphs
Non graphical techniques
Link and brush
Combined graphs
Barchart
Line Plot
Distribution Plot
Multiple Plots can help you
understand the structure of your
data over multiple variables.
Link and brush allows you to select
observations in one plot and highlight
the same observations in the other plots.
Histogram: the number of people in the age groups of 5-year
intervals
Boxplot: each user category has a
distribution of the appreciation each
has for a certain picture on a
photography website.
Step 5: Building the Model
Building a model is an iterative process. The way you build your model depends on
whether you go with classic statistics or the somewhat more recent machine learning
school, and the type of technique you want to use.

Model and Variable selection


Model Execution
Model diagnostics and model
comparison
Model and Variable Selection

You’llneed to select the variables you


want to include in your model and a
modeling technique.
Choosing the right model for a problem
requires thinking on factors such as:
 Must the model be moved to a production
environment and, if so, would it be easy
to implement?
 How difficult is the maintenance on the
model: how long will it remain relevant if
left untouched?
 Does the model need to be easy to
explain?
Picking the correct algorithm is
crucial, contingent on the issue,
data type, and goal. Options
include classification, regression,
clustering, and deep learning.
Model Execution
Most programming languages,
such as Python, already have
libraries such as StatsModels or
Scikit-learn..
Coding a model is a nontrivial task
in most cases, so having these
libraries available can speed up
the process.

Important factors in Model Execution

Model Fit
Predictor variables have a
coefficient
Predictor significance
Model Fit

For model fit, the R-squared or adjusted R-


squared is used.
This measure is an indication of the amount
of variation in the data that gets captured
by the model.
The difference between the adjusted R-
squared and the R-squared is minimal here
because the adjusted one is the normal one
+ a penalty or
model complexity.
For research however, often very low model
fits (<0.2 even) are found.
Predictor variables have a
coefficient
For a linear model this is easy to interpret.
In our example if you add “1” to x1, it will
change y by “0.7658
Detecting influences is more important in
scientific studies than perfectly fitting
models.
If, for instance, you determine that a
certain gene is significant as a cause for
cancer, this is important knowledge, even
if that gene in itself doesn’t determine
whether a person will get cancer.
Predictor significance
We compared the prediction with the
real values, true, but we never
predicted based on fresh data.
The prediction was done using the
same data as the data used to build the
model.
This is all fine and dandy to make
yourself feel good, but it gives you no
indication of whether your model will
work when it encounters truly new
data.
Model diagnostics and model comparison

 Working with a holdout sample helps you pick


the best-performing model.
 A holdout sample is a part of the data you
leave out of the model building so it can be
used to evaluate the model afterward.
 The principle here is simple: the model should
work on unseen data.
 Choose the model with the lowest error.
 Many models make strong assumptions, such
as independence of the inputs, and you have
to verify that these assumptions are indeed
met. This is called model diagnostics
Model evaluation is not a one-
time task; it is an iterative
process. If the model falls short of
expectations, data scientists go
back to previous stages, adjust
parameters, or even reconsider
the algorithm choice. This
iterative refinement is crucial for
achieving optimal model
performance.
Step 6: Presentation and
Automation
It is sufficient that you implement
only the model scoring; other times
you might build an application that
automatically updates reports. Excel
spreadsheets, or PowerPoint
presentations.
The last stage of the data science
process is where your soft skills will
be most useful, and yes, they’re
extremely important.

You might also like