HKU - 7001 - 3.2 Managing Data II
HKU - 7001 - 3.2 Managing Data II
It defines numerical
array and matrix types
[ [ 5.2, 3.0, 4.5], An array of rank 2, i.e., it has 2 axes. The first
[9.1, 0.1, 0.3] ] of length 3, the second of length 3 (a matrix
with 2 rows and 3 columns
The NumPy Arrays
• NumPy array is a fast, flexible container for large data sets in
Python
• Before using NumPy, we need to import the numpy module
import numpy as np
Creating Arrays
• The easiest way to create an array is to use the array()
method
• This accepts any sequence-like object (e.g., list, tuple, and
dictionary) and produces a new NumPy array containing the
data passed to it.
[[1 2 3 4]
print(arr2)
[5 6 7 8]]
Dimension 2
arr2.ndim
of the array
arr1.dtype dtype('float64')
arr2.dtype dtype('int32')
Creating Arrays
• The size method returns the entire number of items in the
array
• We can call the method on an array object, or call the
numpy module and pass the array as an argument
arr2.size
np.size(arr2)
8
Creating ndarrays
• We can convert an array from one shape to another without
copying any data. To do this, pass a tuple indicating the new
shape to the reshape method.
array([[0, 1],
[2, 3],
[4, 5],
[6, 7]])
Creating Special Arrays
• In addition to array, there are a number of other special
methods for creating new arrays.
• As examples, zeros and ones create arrays of 0’s or 1’s,
respectively, with a given length or shape. empty creates an
array without initializing its values to any particular value.
• To create a higher dimensional array with these methods,
pass a tuple for the shape
Creating Special Arrays
np.zeros(10) array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
[[7.17473078e-091, 4.42510289e-062],
[4.31926418e-038, 4.19564746e+175],
[6.48224659e+170, 5.82471487e+257]]])
Creating an Array of Fixed Intervals
• arange is an array-valued version of the built-in Python
range function
np.arange(8)
array([0, 1, 2, 3, 4, 5, 6, 7])
np.arange(0,8,2)
array([0, 2, 4, 6])
array([0, 2, 4, 6, 8])
Creating an Array of Fixed Intervals
• arange and linspace are going to be quite useful when
we are making plots.
• They can be used to generate data for the X axis
X axis
Creating a Random Array
• We can also use rand method in the random module to
generate some random data
• Pass shape to the argument
np.random.rand(5)
np.random.rand(4, 2)
array([[0.02473339, 0.93360131],
[0.21580826, 0.29531976],
[0.44972526, 0.53165493],
[0.50354605, 0.35704748]])
Creating a Random Array
• We can also use randn method in the random module to
draw data from a standard normal distribution
• Mean 0, variance 1
np.random.randn(5, 3)
shape
arr = np.arange(10) [0 1 2 3 4 5 6 7 8 9]
print(arr)
arr.mean() np.mean(arr)
arr.sum() np.sum(arr)
arr.std() np.std(arr)
Mathematical and Statistical Methods
• Basic array statistical methods
Reading from and Writing to Text Files
• loadtxt method reads text file data into a 2D array
• savetxt method performs the inverse operation: writing
an array to a delimited text file.
values1 = np.random.random((10, 5))
np.savetxt(r'../data/nparray.txt', values1)
Given: [8 8 3 7 7 0 4 2 5 2 2 2]
[ 2 5 6 5 8 13]
pandas
What is pandas?
• pandas contains high-level data structures and manipulation
tools designed to make data analysis fast and easy in
Python.
• pandas has two workhorse data structures: Series and
DataFrame.
• To use pandas, first import the module
import pandas as pd
Creating a Series
• A Series is a one-dimensional array-like object containing an
array of data (of any NumPy data type) and an associated
array of data labels, called its index.
• The simplest Series is formed from only an array of data
• We can use the Series method to create a series object
0 4
1 7
2 -5
3 3
dtype: int64
Creating a Series
• The string representation of a series displayed interactively
shows the index on the left and the values on the right.
• By default, the index is consisted of integers 0 through N – 1
• Often it will be desirable to create a series with an index
identifying each data point
a 4
b 7
c -5
d 3
dtype: int64
Creating a Series
• We can also transform a Python dictionary to a Series
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
Selecting Index and Values in a Series
• We can get the array representation and index object of a
series object via its values and index attributes by calling the
values and index methods.
• A series’ index can be altered in place by assignment
obj2.values array([ 4, 7, -5, 3], dtype=int64)
obj2['a'] 4
obj2['d'] = 6 c -5
obj2[['c','a','d']] a 4
d 6
dtype: int64
Series Operations
• Array operations, such as filtering with a Boolean array,
scalar multiplication, or applying math functions, will
preserve the index-value link
a 4 a 8 a 54.598150
b 7 b 14 b 1096.633158
d 6 c -10 c 0.006738
dtype: int64 d 12 d 403.428793
dtype: int64 dtype: float64
Replacing Values
• The replace method provides a simple and flexible way to
modify a subset of values
0 1.0
1 -999.0
data = pd.Series([1., -999., 2., 2 2.0
-999., -1000., 3.]) 3 -999.0
data 4 -1000.0
5 3.0
dtype: float64
0 1.0
1 NaN
import numpy as np 2 2.0
data.replace(-999, np.nan) 3 NaN
4 -1000.0
5 3.0
dtype: float64
DataFrame
• A DataFrame represents a tabular, spreadsheet-like data
structure containing an ordered collection of columns, each
of which can be a different value type (numeric, string,
Boolean, etc.).
• The DataFrame has both a row and column index; it can be
thought of as a dictionary of Series (one for all sharing the
same index).
Creating a DataFrame
• One of the most common way to create a DataFrame is from
a dictionary of equal-length lists
• The resulting DataFrame will have its index assigned
automatically as with Series
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame1 = pd.DataFrame(data)
frame1
pd.DataFrame(data, columns =
['year', 'state', 'pop'])
frame1.describe()
Adding a Column to a DataFrame
• If you pass a column that isn’t contained in data, it will
appear with NaN values in the result
frame2 = pd.DataFrame(data,
columns = ['year', 'state', 'pop', 'debt'],
index = ['one', 'two', 'three', 'four', 'five'])
frame2
• We can use the attribute and the column label to find the
values in a column
frame2['state'] frame2.state
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
Name: state, dtype: object
Selecting Index and Values from a DataFrame
frame2.loc['three'] frame2.iloc[2]
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
Modify Values in a DataFrame
• Values in columns can be modified by assignment
• For example, the empty 'debt' column could be assigned a
scalar value or an array of values
year state pop debt
frame2.debt = 16.5 one 2000 Ohio 1.5 16.5
frame2 two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
,TV,radio,newspaper,sales header
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3 data
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9
Reading from a CSV file
df = pd.read_csv(r'../advertising.csv')
df
type(df) pandas.core.frame.DataFrame
Reading from a CSV file
• If data in the file already has index, then you can specify it
through index_col =
df = pd.read_csv(r'../data/advertising.csv',
index_col = 0)
df
Reading from a CSV file
• Use head or tail to narrow down the view of data
• The argument is the number of observations you want to
view. Default is 5
df.head(3)
df.tail(2)
Reading from a CSV file
• By default, it reads the first row as the header
• We can specify if the data has no header by header = None
• We can also assign header names by passing a list to the
argument names =
pd.read_csv(r'../data/advertising.csv',
index_col = 0, header = None).head(3)
pd.read_csv(r'../data/advertising.csv',
index_col = 0, names = ['col1', 'col2',
'col3', 'col4']).head(3)
Writing to a CSV file
• Using DataFrame’s to_csv method, we can write the data
out to a CSV file
frame.to_csv(r'../data/states.csv')
,state,year,pop
0,Ohio,2000,1.5
First column name is 1,Ohio,2001,1.7
empty since it’s the index 2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
Writing to a CSV file
• By default, the delimiter is comma, the first row is header,
and there is index
• We can also specify the delimiter by sep =
• We can specify no header by header = False
• We can also specify no index by index = False
frame.to_csv(r'../data/states1.csv',
sep = '|', index = False, header = False)
Ohio|2000|1.5
Ohio|2001|1.7
Ohio|2002|3.6
Nevada|2001|2.4
Nevada|2002|2.9
Summary
• Creating a series
• Creating a dataframe
• Modifying a dataframe
• Removing duplicates
• Reading from and writing to csv files
Exercise 3
• Given a dataframe of three columns, write a program to add
a 4th column which is a lag of the 1st column, a 5th column
which is the lead of the 2nd column
Exercise 4
• Read “advertising.csv”. Take the first 5 rows, and add a
‘social’ column and populate the column with 5 random
values rounded to one decimal place. Then write the new
data to a csv file named “ads_new.csv”
More Exercises
• NumPy and pandas are a big universe. We can only cover a
very thin surface to get you started
• Practice is the only way to progress
• Find more exercises and solutions below
• NumPy: https://github.com/rougier/numpy-100
• pandas:
https://www.machinelearningplus.com/python/101-
pandas-exercises-python/
Before We Move On
Text, CSV, JSON
Regular Expression
Managing NumPy
Data
Pandas
StatsModels
Tableau
Data
Visualization Matplotlib
Install BeautifulSoup 4
• To prepare for the coming sessions, you need to install a
powerful Python package called BeautifulSoup 4.
• Following the instructions on the following page to
download and install the package
https://www.crummy.com/software/BeautifulSoup/
• The latest version is 4.9
• To test if you have successfully installed the package, run
the following code in Python. If it does not show anything,
then it’s installed.
from bs4 import BeautifulSoup