0% found this document useful (0 votes)
7 views

Introducing Pandas String Operations & Plots

The document introduces Pandas string operations, highlighting the advantages of vectorization for handling string data and missing values compared to Numpy. It details various Pandas string methods that mirror Python's built-in string methods, including regex functionalities, and provides examples of their usage. Additionally, it covers plotting capabilities in Pandas with Matplotlib, showcasing different types of plots such as line, bar, histogram, and scatter plots.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Introducing Pandas String Operations & Plots

The document introduces Pandas string operations, highlighting the advantages of vectorization for handling string data and missing values compared to Numpy. It details various Pandas string methods that mirror Python's built-in string methods, including regex functionalities, and provides examples of their usage. Additionally, it covers plotting capabilities in Pandas with Matplotlib, showcasing different types of plots such as line, bar, histogram, and scatter plots.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Introducing Pandas String Operations

 Vectorization of operations simplifies the syntax of operating on arrays of data.


 For array of strings, Numpy does not provide simple access, and thus you're stuck
using a more verbose(=long-winded) loop syntax.
# In[1]
data=['peter','Paul','MARY','gUIDO']
[s.capitalize() for s in data]
# Out[1]
['Peter', 'Paul', 'Mary', 'Guido']
 This is perhaps sufficient to work with some data, but it will break if there are any
missing values, so this approach requires putting in extra checks.
# In[2]
data=['peter','Paul',None,'MARY','gUIDO']
[s if s is None else s.capitalize() for s in data]
# Out[2]
['Peter', 'Paul', None, 'Mary', 'Guido']
 Pandas includes features to address both this need for vectorized string operations as
well as the need for correctly handling missing data via the str attribute of Pandas
Series and Index objects containing strings.
# In[3]
names=pd.Series(data)
names.str.capitalize()
# Out[3]
0 Peter
1 Paul
2 None
3 Mary
4 Guido
dtype: object
Tables of Pandas String Methods
Methods Similar to Python String Methods
 All of Python's built-in string methods are mirrored by a Pandas vectorized string
method.
 The Pandas str methods mirror Python string methods.
# In[4]
monte=pd.Series(['Graham Chapman','John Cleese','Terry Gilliam',
'Eric Idle','Terry Jones','Michael Palin'])

# In[5]
monte.str.lower()
# Out[5]
0 graham chapman
1 john cleese
2 terry gilliam
3 eric idle
4 terry jones
5 michael palin
dtype: object

# In[6]
monte.str.len()
# Out[6]
0 14
1 11
2 13
3 9
4 11
5 13
dtype: int64

# In[7]
monte.str.startswith('T')
# Out[7]
0 False
1 False
2 True
3 False
4 True
5 False
dtype: bool

# In[8]
monte.str.split()
# Out[8]
0 [Graham, Chapman]
1 [John, Cleese]
2 [Terry, Gilliam]
3 [Eric, Idle]
4 [Terry, Jones]
5 [Michael, Palin]
dtype: object
Methods Using Regular Expression
 There are several methods that accept regular expression (regexps) to examine the
content of each string element, and follow some of the API conventions of Python's
built-in re module.
Mapping between Pandas methods and functions in Python's re module
Metho
Description
d
match Calls re.match on each element, returning a Boolean
Calls re.match on each element, returning matched groups as
extract
strings
findall Calls re.findall on each element
replace Replaces occurrences of pattern with some other string
contains Calls re.search on each element, returning a Boolean
count Counts occurrences of pattern
split Equivalent to str.split, but accepts regexps
rsplit Equivalent to str.rsplit, but accepts regexps
 With these, we can do a wide range of operations.
# In[9]
monte.str.extract('([A-Za-z]+)',expand=False)
# Out[9]
0 Graham
1 John
2 Terry
3 Eric
4 Terry
5 Michael
dtype: object

# In[10]
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
# Out[10]
0 [Graham Chapman]
1 []
2 [Terry Gilliam]
3 []
4 [Terry Jones]
5 [Michael Palin]
dtype: object
 In here, start-of-string(^) and end-of-string($) are used as regular expression
characters.
If you want to know more about Pandas string methods and regular expressions, reference
these urls :
1. About string methods
2. About regular expressions
Miscellaneous Methods
Other Pandas string methods
Method Description
get Indexes each element
slice Slices each element
slice_replace Replaces slice in each element with the passed value
cat Concatenates strings
repeat Repeats values
normalize Returns Unicode form of strings
pad Adds whitespace to left, right, or both sides of strings
wrap Splits long strings into lines with length less than a given width
join Joins strings in each element of the Series with the passed separator
get_dummie
Extracts dummy variable as a DataFrame
s
Vectorized item access and slicing
 The get and slice operations, in particular, enable vectorized element access from each
array.
 We can get a slice of the first three characters of each array using str.sliec(0,3)
 This behavior is also available through Python's normal indexing
syntax; df.str.slice(0,3) is equivalent to df.str[0,3]
# In[11]
monte.str[0:3]
# Out[11]
0 Gra
1 Joh
2 Ter
3 Eri
4 Ter
5 Mic
dtype: object
 Indexing via df.str.get(i) and df.str[i] are likewise similar.
 These indexing methods also let you access elements of arrays returned by split
# In[12]
monte.str.split().str[-1]
# Out[12]
0 Chapman
1 Cleese
2 Gilliam
3 Idle
4 Jones
5 Palin
dtype: object
Indicator variables
 get_dummies method is useful when your data has a column containing some sort of
coded indicator.
# In[13]
full_monte=pd.DataFrame({'name':monte,
'info':['B | C | D','B | D','A | C','B | D','B | C','B | C | D']})
full_monte
# Out[13]
name info
0 Graham Chapman B | C | D
1 John Cleese B|D
2 Terry Gilliam A|C
3 Eric Idle B|D
4 Terry Jones B|C
5 Michael Palin B|C|D
 The get_dummies routine lets us split out these indicator variables into a DataFrame.
# In[14]
full_monte['info'].str.get_dummies('|')
# Out[14]
A B C D
0 0 1 1 1
1 0 1 0 1
2 1 0 1 0
3 0 1 0 1
4 0 1 1 0
5 0 1 1 1
 With these operations as building blocks, you can construct an endless range of string
processing procedures when cleaning your data.

Plotting with pandas and matplotlib: Line Plots, Bar Plots, Histograms and Density
Plots, Scatter or Point Plots.
We have different types of plots in matplotlib library which can help us to make a suitable
graph as you needed. As per the given data, we can make a lot of graph and with the help of
pandas, we can create a dataframe before doing plotting of data. Let’s discuss the different
types of plot in matplotlib by using Pandas.
Use these commands to install matplotlib, pandas and numpy:
pip install matplotlib
pip install pandas
pip install numpy

Types of Plots:
 Basic plotting: In this basic plot we can use the randomly generated data to plot graph
using series and matplotlib.
 Python3

# import libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

ts = pd.Series(np.random.randn(1000), index = pd.date_range(


'1/1/2000', periods = 1000))
ts = ts.cumsum()
ts.plot()

plt.show()

Output:
 lot of different data: Using more than one list of data in a plot.
 Python3

# importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

ts = pd.Series(np.random.randn(1000), index = pd.date_range(


'1/1/2000', periods = 1000))

df = pd.DataFrame(np.random.randn(1000, 4),
index = ts.index, columns = list('ABCD'))

df = df.cumsum()
plt.figure()
df.plot()
plt.show()

Output:
 Plot on given axis: We can explicitly define the name of axis and plot the data on the
basis of this axis.
 Python3

# importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

ts = pd.Series(np.random.randn(1000), index = pd.date_range(


'1/1/2000', periods = 1000))

df = pd.DataFrame(np.random.randn(1000, 4), index = ts.index,


columns = list('ABCD'))

df3 = pd.DataFrame(np.random.randn(1000, 2),


columns =['B', 'C']).cumsum()

df3['A'] = pd.Series(list(range(len(df))))
df3.plot(x ='A', y ='B')
plt.show()

Output:
 Bar plot using matplotlib: Find different types of bar plot to clearly understand the
behaviour of given data.
 Python3

# importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

ts = pd.Series(np.random.randn(1000), index = pd.date_range(


'1/1/2000', periods = 1000))

df = pd.DataFrame(np.random.randn(1000, 4), index = ts.index,


columns = list('ABCD'))

df3 = pd.DataFrame(np.random.randn(1000, 2),


columns =['B', 'C']).cumsum()

df3['A'] = pd.Series(list(range(len(df))))
df3.iloc[5].plot.bar()
plt.axhline(0, color ='k')

plt.show()
Output:

 Histograms:
 Python3

# importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df4 = pd.DataFrame({'a': np.random.randn(1000) + 1,


'b': np.random.randn(1000),
'c': np.random.randn(1000) - 1},
columns =['a', 'b', 'c'])
plt.figure()

df4.plot.hist(alpha = 0.5)
plt.show()

Output:
 Box plot using Series and matplotlib: Use box to plot the data of dataframe.
 Python3

# importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(10, 5),
columns =['A', 'B', 'C', 'D', 'E'])

df.plot.box()
plt.show()

Output:
 Density plot:
 Python3

# importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(10, 5),
columns =['A', 'B', 'C', 'D', 'E'])

ser = pd.Series(np.random.randn(1000))

ser.plot.kde()

plt.show()

Output:
 Area plot using matplotlib:
 Python3

# importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(10, 5),
columns =['A', 'B', 'C', 'D', 'E'])

df.plot.area()
plt.show()

Output:
 Scatter plot:
 Python3

# importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(500, 4),
columns =['a', 'b', 'c', 'd'])

df.plot.scatter(x ='a', y ='b')


plt.show()

Output:
 Hexagonal Bin Plot:
 Python3

# importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(1000, 2), columns =['a', 'b'])

df['a'] = df['a'] + np.arrange(1000)


df.plot.hexbin(x ='a', y ='b', gridsize = 25)
plt.show()

Output:
 Pie plot:

 Python3

# importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

series = pd.Series(3 * np.random.rand(4),


index =['a', 'b', 'c', 'd'], name ='series')

series.plot.pie(figsize =(4, 4))


plt.show()

Output:

You might also like