0% found this document useful (0 votes)
19 views3 pages

ML Concepts Papers

This document discusses various machine learning concepts including different encoding techniques for categorical variables like one hot encoding and dummy encoding. It notes that these techniques can result in a large number of variables for categories with many levels. Ordinal and nominal data are described, with ordinal data retaining information about category order. Label encoding and one hot encoding are also summarized, along with challenges like dummy variable traps and multicollinearity. Reasons categorical variables may need preprocessing before machine learning algorithms are provided.

Uploaded by

Krunal R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views3 pages

ML Concepts Papers

This document discusses various machine learning concepts including different encoding techniques for categorical variables like one hot encoding and dummy encoding. It notes that these techniques can result in a large number of variables for categories with many levels. Ordinal and nominal data are described, with ordinal data retaining information about category order. Label encoding and one hot encoding are also summarized, along with challenges like dummy variable traps and multicollinearity. Reasons categorical variables may need preprocessing before machine learning algorithms are provided.

Uploaded by

Krunal R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 3

Machine learning Concepts

One hot Encoding


•Dummy Encoding
•Effect Encoding
•Binary Encoding
•BaseN Encoding
•Hash Encoding
•Target Encoding

 Ordinal Data: The categories have an inherent order


•Nominal Data: The categories do not have an inherent order

In Ordinal data, while encoding, one should retain the information regarding the
order in which the category is provided. Like in the above example the highest
degree a person possesses, gives vital information about his qualification. The
degree is an important feature to decide whether a person is suitable for a post or
not.

While encoding Nominal data, we have to consider the presence or absence of a


feature. In such a case, no notion of order is present. For example, the city a
person lives in. For the data, it is important to retain where a person lives. Here,
We do not have any order or sequence. It is equal if a person lives in Delhi or
Bangalore

Drawbacks of  One-Hot and Dummy Encoding

One hot encoder and dummy encoder are two powerful and effective encoding
schemes. They are also very popular among the data scientists, But may not be as
effective when-
1.A large number of levels are present in data. If there are multiple
categories in a feature variable in such a case we need a similar number of
dummy variables to encode the data. For example, a column with 30
different values will require 30 new variables for coding.
2.If we have multiple categorical features in the dataset similar situation will
occur and again we will end to have several binary features each
representing the categorical feature and their multiple categories e.g a
dataset having 10 or more categorical columns.

Machine learning algorithm bascics

ML based techniques include several steps. First, features are extracted by calculating over
multiple packets of flows (such as packet lengths, flow duration or inter-packet arrival
times) [17]. Then features are refined by feature selection algorithms if possible.

Label Encoding is a popular encoding technique for handling categorical variables.


In this technique, each label is assigned a unique integer based on alphabetical
ordering.Due to this, there is a very high probability that the model captures the
relationship between countries such as India < Japan < the US.

One-Hot Encoding is the process of creating dummy variables.


Challenges of One-Hot Encoding: Dummy Variable Trap

One-Hot Encoding results in a Dummy Variable Trap as the outcome of one


variable can easily be predicted with the help of the remaining variables.

Dummy Variable Trap is a scenario in which variables are highly correlated to each
other.

The Dummy Variable Trap leads to the problem known as multicollinearity.


Multicollinearity occurs where there is a dependency between the independent
features. Multicollinearity is a serious issue in machine learning models like Linear
Regression and Logistic Regression

A categorical variable has too many levels. This pulls down performance level of the
model. For example, a cat. variable “zip code” would have numerous levels.
•A categorical variable has levels which rarely occur. Many of these levels have
minimal chance of making a real impact on model fit. For example, a variable
‘disease’ might have some levels which would rarely occur.
•There is one level which always occurs i.e. for most of the observations in data
set there is only one level. Variables with such levels fail to make a positive
impact on model performance due to very low variation.
•If the categorical variable is masked, it becomes a laborious task to decipher its
meaning. Such situations are commonly found in data science competitions.
•You can’t fit categorical variables into a regression equation in their raw form.
They must be treated.
•Most of the algorithms (or ML libraries) produce better result with numerical
variable. In python, library “sklearn” requires features in numerical arrays. Look at
the below snapshot. I have applied random forest using sklearn library on titanic
data set (only two features sex and pclass are taken as independent variables). It
has returned an error because feature “sex” is categorical and has not been
converted to numerical form.

You might also like