0% found this document useful (0 votes)
89 views4 pages

What - Why: Dummy Variables

Dummy variables are used to represent categorical variables in machine learning models that cannot process categorical data directly. There are two main techniques for converting categorical variables to dummy variables: label encoding and one-hot encoding. Label encoding assigns integer values to each category, but this can incorrectly imply an ordering between categories. One-hot encoding creates a new binary feature for each unique category value and avoids any ordering implications. It is preferable when the categorical variable is nominal rather than ordinal.

Uploaded by

Naing Naing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views4 pages

What - Why: Dummy Variables

Dummy variables are used to represent categorical variables in machine learning models that cannot process categorical data directly. There are two main techniques for converting categorical variables to dummy variables: label encoding and one-hot encoding. Label encoding assigns integer values to each category, but this can incorrectly imply an ordering between categories. One-hot encoding creates a new binary feature for each unique category value and avoids any ordering implications. It is preferable when the categorical variable is nominal rather than ordinal.

Uploaded by

Naing Naing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Dummy Variables

 What - A dummy variable is a numerical variable that represents


categorical variables.

 Why – A lot of machine learning algorithms cannot work with


categorical variables directly, they need to be converted to numbers.

 How – There are multiple ways of handling Categorical variables


1. Label Encoding
2. One-Hot Encoding
Label Encoding
 Each categorical label is simply assigned a unique integer.

Country Age Salary Country Age Salary

India 44 32000 0 44 32000

US 34 33400 2 34 33400

Japan 43 45000 1 43 45000

US 23 23000 2 23 23000

Japan 23 67000 1 23 67000

 An effective technique when categorical data is ordinal.


 Challenge – Country is a nominal variable, there is no inherent ordering, Label encoding creates ranks for
countries. For eg here: India < Japan < US.
 This will affect model interpretation.
 We can use one-hot encoding to overcome this.
One-Hot Encoding
 One hot encoding is a representation of categorical variables as binary vectors.
 It creates additional features based on the number of unique labels in the categorical feature

Country Age Salary Country.India Country.Japan Country.US Age Salary

India 44 32000 1 0 0 44 32000

US 34 33400 0 0 1 34 33400

Japan 43 45000 0 1 0 43 45000


US 23 23000
0 0 1 23 23000
Japan 23 67000
0 1 0 23 67000

 3 new features are added in place of Country


 We solved the problem of ranking as each category is represented by a binary vector.
 Apply this technique when the categorical data is not ordinal
 Challenges – If number of categories is high, it can lead to high dimensionality.
Note : For One Hot Encoding

 The regression model won't actually need all the dummy variables.
 It doesn't need the final dummy variable as it can deduce that information from the combination of
all other dummy variables!
 To avoid multicollinearity, drop one dummy variable (use n-1 of them for model building).

You might also like