0% found this document useful (0 votes)
4 views

Data Science Project - Flow Graph

The document outlines a comprehensive flow graph for a data science project, detailing steps from data preparation to model fine-tuning and final testing. It emphasizes the importance of exploratory data analysis, data cleansing, feature selection, and evaluation metrics for both regression and classification models. The process includes handling missing values, outlier detection, and dataset partitioning to ensure effective model training and validation.

Uploaded by

su
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Data Science Project - Flow Graph

The document outlines a comprehensive flow graph for a data science project, detailing steps from data preparation to model fine-tuning and final testing. It emphasizes the importance of exploratory data analysis, data cleansing, feature selection, and evaluation metrics for both regression and classification models. The process includes handling missing values, outlier detection, and dataset partitioning to ensure effective model training and validation.

Uploaded by

su
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Data Science Project - Flow graph

1. Data preparation 2. Flat-file generation


Data enrichment and transformation

3. Exploratory Data Analysis (EDA)

5. Missing values 4. Outlier detection


Imputation and treatment

6. Variable Selection
7. Train-Dev-Test preparation

9. Model Fine-tuning 8.Model Selection

Dr. Tomas Karpati MD (c) 2019-2021


Data Science Project - Flow graph
I. Data Preparation

1. Select and define clearly the Outcome - the definition must answer to the following
questions:
a. What do you want to predict?
b. Through how much time from today?
c. On which population/events the model will be generated?
d. To whom/which events the model will not apply?
2. Find all the data sources that you consider will help to include the variables that can
contribute to the training of the model. Add the respective variables to the dataset.

II. Exploratory Data Analysis (EDA)

1. Statistical Analysis:
● Mean and Standard Deviation
● Median and IQR (25%-75%)
● Minimum, Maximum, Count
2. Table One
3. Graphics:
● Categorical variables: barplot
● Numeric Variables: histogram (distplot)
● Numeric vs Categorical: boxplot
● Numeric vs Numeric: scatter plot
● Pairs

III. Data cleansing


1. Outlier detection - univariate:
● Normal distribution: Standard Deviation (z-scores)
● Non-normal distribution: IQR
2. Outlier detection - multivariate:
● DBSCAN
3. Outlier treatment:
● Obviously incorrect: convert value to NA
○ Example: age 450 years, body temperature 2°C, size -1.2 mt
● Does change the distribution but don’t change relationships: convert value to NA and
report.

Dr. Tomas Karpati MD (c) 2019-2021


Data Science Project - Flow graph

● Does change both the distribution and the relationships: make analysis of the effect or
influence in the presence and absence of the outliers. Report the findings and explain your
decision of leaving or substituting the outliers with NAs.

● Makes a correlation where there is no correlation: convert value to NA

● Multivariate outliers: DBSCAN returns a list of rows containing mv-outliers. Add an variable
indicator for outliers (0/1). Check if this indicator has a relation with the dependent variable
(y or outcome). You can use correlation analysis or check the distribution of y between
outlier and non outlier cases. You can drop it if there is no correlation or if the distribution is
similar. If not, leave the outlier indicator in the dataset.

Dr. Tomas Karpati MD (c) 2019-2021


Data Science Project - Flow graph
4. Missing values: Determinate the missingness mechanism.
● Missing Completely at Random (MCAR): missing values are generated randomly and there
is no possible way to explain why those values are missing.
● Missing at Random (MAR): There is a variable that influences the generation of missing
values, but within subgroups of such a variable the missingness is completely at random.
For example, women tend to write less about their age or weight than men. However, if we
check the mechanism by separating women from men, their generative mechanism is
MCAR.
● Missing not at Random (MNAR): There is a clear cause for the generation of missing
values. An example is the visits to gynecologists (women’s doctors). No men will have a
visit in the dataset. Another example is a question made only to individuals with age 60 and
more. No individual with 55 years old will have the question fulfilled.

5. Missingness treatment: If the missing mechanism is MCAR or MAR, we can use imputation
techniques, otherwise we have to decide if dropping the rows or dropping the columns.
● Detecting Missingness Mechanism
○ For each variable (feature) having missing values, generate a dummy variable
indicating for each value if it is full (0) or is missing (1).
○ Using only the variables which have no missing data, run a logistic regression (in R:
glm; in Python: OLS) where the X is the dataset with only those variables and the
outcome (y) is the missing dummy variable.
○ If all the p-values for the coefficients (betas) are non-significant (p>=0.05), we can
assume that the mechanism is MCAR.
○ If there is one variable that was significant (p-value < 0.05) then we have to inspect
the relationship between this variable and the missing indicator. We can check it
using a boxplot and determine a value which can divide the dataset into two
subsets. Each subset will be analyzed separately as we did before. If in both groups
we can demonstrate that the mechanism is MCAR, we say that for this variable the
original mechanism is MAR.
○ In the case we can’t demonstrate MCAR or MAR, we say that the mechanism is
MNAR.
● Imputation techniques:
○ Statistical imputation: using mean, median or mode (depending on the data scale).
Not recommended!
○ Model based imputation: we can use a predictive model to impute the missing
values
■ KNN: this algorithm is very popular because it easily and quickly imputes
data. We divide the dataset into complete data (including the variable that
we want to impute) and train the model. Then we use the predict function to
complete the data in the incomplete subset.
■ Random Forest (use as we explained with KNN)
■ Decisión trees
○ Multiple imputation: The most common method is the Multiple Imputation by
Chained Equations (MICE) which uses all the dataset for imputation. It begins with
the variable with less missing values and imputes it, Then uses the second with less
Dr. Tomas Karpati MD (c) 2019-2021
Data Science Project - Flow graph
missings and imputes it, and so on. MICE generates a number of imputed datasets
(default or most common is 5) that has different values for each imputed variable.
The mean, median, standard deviation and IRQ for each variable on each imputed
dataset are constant. Any further analysis must be made using all the imputed
datasets (5 different analyses each time, the final result is calculated as the mean of
the outcome (y) of all the models.
○ When having a lot of data (>20,000 rows), a simple model is recommended. With
smaller datasets, multiple imputation is recommended.

III. Data enrichment and transformation


1. Combination of two or more variables: sum/difference/multiplication,division
2. Variable modification: Polynomial, Logaritm, Square root, exponencial, etc.
3. Transform categorical variables - Use one-hot-encoding/dummy encoding, bin-counting
3. Enrich with cluster analysis
4. Enrich with external data

IV. Feature Selection


1. Univariate Analysis: Each variable on the dataset is analyzed by comparing its relationship with
the dependent variable (outcome or y). The analysis depends on the independent (x) and
dependent (y) data types:
● If x is nominal and y is nominal or ordinal: use chi-square
● If x is nominal and y is continuous: use spearman correlation
● If x is continuous and y is binomial (0/1, yes/no, true/false): use independent t-test
● If x is continuous and y is multinomial (more than two categories) or ordinal: use anova
● If both x and y are continuous, use pearson or spearman correlation
In case that y is nominal or ordinal with less than 6 categories, you can use tableone (R and
Python) setting the y as the grouping category. Tableone makes the whole analysis automatically.
You will want to include in your analysis those variables with a significant p-value (< 0.05).

2. Multivariate analysis: Using the whole dataset and running predictive models that are able to
return a list of recommended features by defining their influence in the model. For this step you
don’t need to have to partition the dataset into train, dev and test.
● LASSO (L1 penalization): Lasso penalizes the growth of the values of the coefficients
(betas) and can get them down to zero. Those variables with a zeroed coefficient are
excluded from the analysis, giving to this algorithm the capacity of feature selection.
● Random Forest: this algorithm is able to generate a list of the most influential variables and
their lift.
● Gradient Descent: same as Random forest.
● Support Vector Machine (SVM): It can be used for feature selection by using L1
penalization.
● Principal Component Analysis (PCA): can be used for the selection of variables comparing
the variables with the highest correlation with the principal components that catch the most
of the variability (80% cumulative variance).

Dr. Tomas Karpati MD (c) 2019-2021


Data Science Project - Flow graph
3. Selection based on voting: using many of the techniques (univariate and multivariate), we make
a table with all the variables on the dataset and indicate the recommended variables for each
technique, then we select a threshold for the total votings and on this basis we select the variables
that will be used to train our models.

V. Dataset Partitioning

The process of data partitioning includes the following steps:


1. Create a Test dataset with about 10-20% of the total data. This partition will be set aside and will
not be used until the end of the project.
2. The remaining data can partitioned using one of the following methods:
a. Divide the dataset into train (60-80%) and development/validation (20-40%). The train
dataset will be used to train the models, while the dev dataset will be used for assessment
of the model performance (using the metric selected on the previous phase).
b. Use k-fold cross validation, where k will be the number of partitions that will be used on the
remaining dataset. In this case, each of the partitions will be used at different iterations to
train or evaluate the model. The number of partitions will depend on the number of
available rows on the dataset.

VI Evaluation Metric Selection


Dr. Tomas Karpati MD (c) 2019-2021
Data Science Project - Flow graph
Regression Models
● Need a measure for how good the model performs ⇨ R2
● Only need to compare between models:
○ Outliers are absent ⇨ MSE / RMSE
○ Outliers are present ⇨ MAE
○ Large difference between values of Y ⇨ RMSLE

Classification Models
● Balanced data ⇨ Maximize the true classification: Accuracy / Log-loss
● Unbalanced data
○ Capture all positives (Minimize false negative) ⇨ Maximize Recall
○ Capture all negatives (Minimize false positives) ⇨ Maximize Precision
○ Balance between Precision and Recall
■ No need for selecting a cutoff ⇨ F1-score
■ Need to select a cutoff ⇨ AUC

VI Model Selection

The model selection phase consists of the following steps:


1. Run many different prediction models, appropriate to the outcome type (classification/regression),
without changing their parameters (the exception may be SVM, where you have to select the
appropriate kernel trick for your outcome characteristics).
2. Check the model performance using the adequate evaluation metrics.
3. Select the best performing model: best metric value and lower level of under/over fitting
4. Check the model prediction on different subgroups of data. Try to find specific sub-groups where
the prediction has bad performance. Check if there are any kind of transformations or changes that
may correct this behaviour but be careful to not add bias into the data.

VII Model Fine-Tuning

After selecting the best performing model, we will try to fine-tune the selected model:
1. Depending on the model, create vectors with a wide range of different values for each of the
parameters that may affect the performance of the model
2. Use Random-Search for applying the parameters, use cross-validation for checking the selection
of the best parameters combination.
3. In case that there seems to be more place for optimization, use Grid-Search with a more reduced
range of parameters, based on the results of the previous section.
4. Select the parameters that give the best model performance.

VIII Final Model Test

At this point check the model with the test dataset (that we set aside in the dataset partition phase).

Dr. Tomas Karpati MD (c) 2019-2021

You might also like