ML Concepts Papers
ML Concepts Papers
In Ordinal data, while encoding, one should retain the information regarding the
order in which the category is provided. Like in the above example the highest
degree a person possesses, gives vital information about his qualification. The
degree is an important feature to decide whether a person is suitable for a post or
not.
One hot encoder and dummy encoder are two powerful and effective encoding
schemes. They are also very popular among the data scientists, But may not be as
effective when-
1.A large number of levels are present in data. If there are multiple
categories in a feature variable in such a case we need a similar number of
dummy variables to encode the data. For example, a column with 30
different values will require 30 new variables for coding.
2.If we have multiple categorical features in the dataset similar situation will
occur and again we will end to have several binary features each
representing the categorical feature and their multiple categories e.g a
dataset having 10 or more categorical columns.
ML based techniques include several steps. First, features are extracted by calculating over
multiple packets of flows (such as packet lengths, flow duration or inter-packet arrival
times) [17]. Then features are refined by feature selection algorithms if possible.
Dummy Variable Trap is a scenario in which variables are highly correlated to each
other.
A categorical variable has too many levels. This pulls down performance level of the
model. For example, a cat. variable “zip code” would have numerous levels.
•A categorical variable has levels which rarely occur. Many of these levels have
minimal chance of making a real impact on model fit. For example, a variable
‘disease’ might have some levels which would rarely occur.
•There is one level which always occurs i.e. for most of the observations in data
set there is only one level. Variables with such levels fail to make a positive
impact on model performance due to very low variation.
•If the categorical variable is masked, it becomes a laborious task to decipher its
meaning. Such situations are commonly found in data science competitions.
•You can’t fit categorical variables into a regression equation in their raw form.
They must be treated.
•Most of the algorithms (or ML libraries) produce better result with numerical
variable. In python, library “sklearn” requires features in numerical arrays. Look at
the below snapshot. I have applied random forest using sklearn library on titanic
data set (only two features sex and pclass are taken as independent variables). It
has returned an error because feature “sex” is categorical and has not been
converted to numerical form.