Data Science Project - Flow Graph
Data Science Project - Flow Graph
6. Variable Selection
7. Train-Dev-Test preparation
1. Select and define clearly the Outcome - the definition must answer to the following
questions:
a. What do you want to predict?
b. Through how much time from today?
c. On which population/events the model will be generated?
d. To whom/which events the model will not apply?
2. Find all the data sources that you consider will help to include the variables that can
contribute to the training of the model. Add the respective variables to the dataset.
1. Statistical Analysis:
● Mean and Standard Deviation
● Median and IQR (25%-75%)
● Minimum, Maximum, Count
2. Table One
3. Graphics:
● Categorical variables: barplot
● Numeric Variables: histogram (distplot)
● Numeric vs Categorical: boxplot
● Numeric vs Numeric: scatter plot
● Pairs
● Does change both the distribution and the relationships: make analysis of the effect or
influence in the presence and absence of the outliers. Report the findings and explain your
decision of leaving or substituting the outliers with NAs.
● Multivariate outliers: DBSCAN returns a list of rows containing mv-outliers. Add an variable
indicator for outliers (0/1). Check if this indicator has a relation with the dependent variable
(y or outcome). You can use correlation analysis or check the distribution of y between
outlier and non outlier cases. You can drop it if there is no correlation or if the distribution is
similar. If not, leave the outlier indicator in the dataset.
5. Missingness treatment: If the missing mechanism is MCAR or MAR, we can use imputation
techniques, otherwise we have to decide if dropping the rows or dropping the columns.
● Detecting Missingness Mechanism
○ For each variable (feature) having missing values, generate a dummy variable
indicating for each value if it is full (0) or is missing (1).
○ Using only the variables which have no missing data, run a logistic regression (in R:
glm; in Python: OLS) where the X is the dataset with only those variables and the
outcome (y) is the missing dummy variable.
○ If all the p-values for the coefficients (betas) are non-significant (p>=0.05), we can
assume that the mechanism is MCAR.
○ If there is one variable that was significant (p-value < 0.05) then we have to inspect
the relationship between this variable and the missing indicator. We can check it
using a boxplot and determine a value which can divide the dataset into two
subsets. Each subset will be analyzed separately as we did before. If in both groups
we can demonstrate that the mechanism is MCAR, we say that for this variable the
original mechanism is MAR.
○ In the case we can’t demonstrate MCAR or MAR, we say that the mechanism is
MNAR.
● Imputation techniques:
○ Statistical imputation: using mean, median or mode (depending on the data scale).
Not recommended!
○ Model based imputation: we can use a predictive model to impute the missing
values
■ KNN: this algorithm is very popular because it easily and quickly imputes
data. We divide the dataset into complete data (including the variable that
we want to impute) and train the model. Then we use the predict function to
complete the data in the incomplete subset.
■ Random Forest (use as we explained with KNN)
■ Decisión trees
○ Multiple imputation: The most common method is the Multiple Imputation by
Chained Equations (MICE) which uses all the dataset for imputation. It begins with
the variable with less missing values and imputes it, Then uses the second with less
Dr. Tomas Karpati MD (c) 2019-2021
Data Science Project - Flow graph
missings and imputes it, and so on. MICE generates a number of imputed datasets
(default or most common is 5) that has different values for each imputed variable.
The mean, median, standard deviation and IRQ for each variable on each imputed
dataset are constant. Any further analysis must be made using all the imputed
datasets (5 different analyses each time, the final result is calculated as the mean of
the outcome (y) of all the models.
○ When having a lot of data (>20,000 rows), a simple model is recommended. With
smaller datasets, multiple imputation is recommended.
2. Multivariate analysis: Using the whole dataset and running predictive models that are able to
return a list of recommended features by defining their influence in the model. For this step you
don’t need to have to partition the dataset into train, dev and test.
● LASSO (L1 penalization): Lasso penalizes the growth of the values of the coefficients
(betas) and can get them down to zero. Those variables with a zeroed coefficient are
excluded from the analysis, giving to this algorithm the capacity of feature selection.
● Random Forest: this algorithm is able to generate a list of the most influential variables and
their lift.
● Gradient Descent: same as Random forest.
● Support Vector Machine (SVM): It can be used for feature selection by using L1
penalization.
● Principal Component Analysis (PCA): can be used for the selection of variables comparing
the variables with the highest correlation with the principal components that catch the most
of the variability (80% cumulative variance).
V. Dataset Partitioning
Classification Models
● Balanced data ⇨ Maximize the true classification: Accuracy / Log-loss
● Unbalanced data
○ Capture all positives (Minimize false negative) ⇨ Maximize Recall
○ Capture all negatives (Minimize false positives) ⇨ Maximize Precision
○ Balance between Precision and Recall
■ No need for selecting a cutoff ⇨ F1-score
■ Need to select a cutoff ⇨ AUC
VI Model Selection
After selecting the best performing model, we will try to fine-tune the selected model:
1. Depending on the model, create vectors with a wide range of different values for each of the
parameters that may affect the performance of the model
2. Use Random-Search for applying the parameters, use cross-validation for checking the selection
of the best parameters combination.
3. In case that there seems to be more place for optimization, use Grid-Search with a more reduced
range of parameters, based on the results of the previous section.
4. Select the parameters that give the best model performance.
At this point check the model with the test dataset (that we set aside in the dataset partition phase).