0% found this document useful (0 votes)
40 views

Unit 3 Categorical - Data

This document discusses categorical data in pandas. Categorical data can take on a limited, fixed set of values and cannot perform numerical operations. Categorical data is useful for string variables with a small set of values, variables where the logical order differs from lexical order, and as a signal to libraries about variable type. The document describes how to create categorical data using Series and the Categorical constructor. It also covers describing, renaming, adding/removing categories, and comparing categorical data.

Uploaded by

Vatsal Bhalani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Unit 3 Categorical - Data

This document discusses categorical data in pandas. Categorical data can take on a limited, fixed set of values and cannot perform numerical operations. Categorical data is useful for string variables with a small set of values, variables where the logical order differs from lexical order, and as a signal to libraries about variable type. The document describes how to create categorical data using Series and the Categorical constructor. It also covers describing, renaming, adding/removing categories, and comparing categorical data.

Uploaded by

Vatsal Bhalani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

L.J.

Institute of Engineering & Technology Semester: V (2022) PDS (3150713)

 Categorical data
Categorical variables can take on only a limited, and usually fixed number of possible values.
Besides the fixed length, categorical data might have an order but cannot perform numerical
operation. Categorical are a Pandas data type.

The categorical data type is useful in the following cases:

A string variable consisting of only a few different values. Converting such a string variable to a
categorical variable will save some memory.

The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By
converting to a categorical and specifying an order on the categories, sorting and min/max will
use the logical order instead of the lexical order.

As a signal to other python libraries that this column should be treated as a categorical variable
(e.g. to use suitable statistical methods or plot types).

Object Creation

Categorical object can be created in multiple ways. The different ways have been described
below :

import pandas as pd

s = pd.Series(["a","b","c","a"], dtype="category")
print s
Its output is as follows,
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
The number of elements passed to the series object is four, but the categories are only three.
Observe the same in the output Categories.

pd.Categorical
Using the standard pandas Categorical constructor, we can create a category object.
pandas.Categorical(values, categories, ordered)

Let’s take an example

import pandas as pd

1
L.J. Institute of Engineering & Technology Semester: V (2022) PDS (3150713)

cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])


print cat
Its output is as follows,
[a, b, c, a, b, c]
Categories (3, object): [a, b, c]

Let’s have another example,

import pandas as pd

cat = cat=pd.Categorical(['a','b','c','a','b','c','d'], ['c', 'b', 'a'])


print cat
Its output is as follows,
[a, b, c, a, b, c, NaN]
Categories (3, object): [c, b, a]
Here, the second argument signifies the categories. Thus, any value which is not present in the
categories will be treated as NaN.

Now, take a look at the following example,


import pandas as pd

cat = cat=pd.Categorical(['a','b','c','a','b','c','d'], ['c', 'b', 'a'],ordered=True)


print cat
Its output is as follows,

[a, b, c, a, b, c, NaN]

Categories (3, object): [c < b < a]

Logically, the order means that, a is greater than b and b is greater than c.

Describe()
Using the .describe() command on the categorical data, we get similar output to
a Series or DataFrame of the type string.
Describe() is used to view some basic statistical details like percentile, mean, std etc. of a data
frame or a series of numeric values.

import pandas as pd
import numpy as np

2
L.J. Institute of Engineering & Technology Semester: V (2022) PDS (3150713)

cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])


df = pd.DataFrame({"cat":cat, "s":["a", "c", "c", np.nan]})

print df.describe()
print df["cat"].describe()

Its output is as follows,


cat s
count 3 3
unique 2 2
top cc
freq 2 2
count 3
unique 2
top c
freq 2
Name: cat, dtype: object

Renaming Categories
Renaming categories is done by assigning new values to the series.cat.categories property.
import pandas as pd

s = pd.Series(["a","b","c","a"], dtype="category")
s.cat.categories = ["Group %s" % g for g in s.cat.categories]
print s.cat.categories
Its output is as follows ,
Index([u'Group a', u'Group b', u'Group c'], dtype='object')
Initial categories [a,b,c] are updated by the s.cat.categories property of the object.

Appending New Categories


Using the Categorical.add.categories() method, new categories can be appended.
import pandas as pd

s = pd.Series(["a","b","c","a"], dtype="category")
s = s.cat.add_categories([4])
print s.cat.categories
Its output is as follows ,

3
L.J. Institute of Engineering & Technology Semester: V (2022) PDS (3150713)

Index([u'a', u'b', u'c', 4], dtype='object')

Removing Categories
Using the Categorical.remove_categories() method, unwanted categories can be removed.
import pandas as pd

s = pd.Series(["a","b","c","a"], dtype="category")
print ("Original object:")
print s

print ("After removal:")


print s.cat.remove_categories("a")
Its output is as follows −
Original object:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]

After removal:
0 NaN
1 b
2 c
3 NaN
dtype: category
Categories (2, object): [b, c]

Comparison of Categorical Data


Comparing categorical data with other objects is possible in three cases −
 comparing equality (== and !=) to a list-like object (list, Series, array, ...) of the same
length as the categorical data.
 all comparisons (==, !=, >, >=, <, and <=) of categorical data to another categorical
Series, when ordered==True and the categories are the same.
 all comparisons of a categorical data to a scalar.

Take a look at the following example

4
L.J. Institute of Engineering & Technology Semester: V (2022) PDS (3150713)

import pandas as pd

cat = pd.Series([1,2,3]).astype("category", categories=[1,2,3], ordered=True)


cat1 = pd.Series([2,2,2]).astype("category", categories=[1,2,3], ordered=True)

print cat>cat1

Its output is as follows,

0 False

1 False

2 True

dtype: bool

You might also like