0% found this document useful (0 votes)
27 views67 pages

HKU - 7001 - 3.2 Managing Data II

Uploaded by

lo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views67 pages

HKU - 7001 - 3.2 Managing Data II

Uploaded by

lo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Managing Data II

MSBA7001 Business Intelligence and Analytics


HKU Business School
The University of Hong Kong

Instructor: Dr. DING Chao


Agenda
• SciPy
• NumPy
• Pandas
SciPy
What is SciPy?
• SciPy (pronounced /saɪpaɪ/) is a Python-based ecosystem of
open-source software for mathematics, science, and
engineering.
• The SciPy ecosystem includes general and specialized tools
for data management and computation, productive
experimentation and high-performance computing.
• It offers over 1000 modules/packages for Python
The SciPy Ecosystem

It defines numerical
array and matrix types

It makes possible It provides high-


Jupyter Notebook performance, easy to
use data structures
NumPy
What is the problem with lists?
• Lists are ok for storing small amounts of one-dimensional
data
• But, can’t use them directly with arithmetical operators
such as +, -, *, /, …
• Need efficient arrays with arithmetic and better
multidimensional tools
What is NumPy?
NumPy (pronounced /nʌmpaɪ/), short for Numerical Python,
is the fundamental package required for high performance
scientific computing and data analysis.
• It provides:
 Arrays, a fast and space-efficient multidimensional array providing
vectorized arithmetic operations and sophisticated broadcasting
capabilities
 Standard mathematical functions for fast operations on entire
arrays of data without having to write loops
 Tools for reading / writing array data to disk and working with
memory-mapped files
 Linear algebra, random number generation, and Fourier transform
capabilities
The NumPy Arrays
• A NumPy array (also called ndarray) is a table of elements
(usually numbers), all of the same type, indexed by a tuple
of positive integers. Typical examples of multidimensional
arrays include vectors, matrices, images and spreadsheets.
• Dimensions are usually called axes, number of axes is the
rank
The NumPy Arrays

[7, 2, 9, 10] An array of rank 1, i.e., it has 1 axis of length 4

[ [ 5.2, 3.0, 4.5], An array of rank 2, i.e., it has 2 axes. The first
[9.1, 0.1, 0.3] ] of length 3, the second of length 3 (a matrix
with 2 rows and 3 columns
The NumPy Arrays
• NumPy array is a fast, flexible container for large data sets in
Python
• Before using NumPy, we need to import the numpy module

import numpy as np
Creating Arrays
• The easiest way to create an array is to use the array()
method
• This accepts any sequence-like object (e.g., list, tuple, and
dictionary) and produces a new NumPy array containing the
data passed to it.

data1 = [6, 7.5, 8, 0, 1]


arr1 = np.array(data1, float)
arr1
array([6. , 7.5, 8. , 0. , 1. ])
Specify data type.
It could also be int
Creating Arrays
• From nested sequences, like a list of lists
data2= [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2= np.array(data2)
arr2
array([[1, 2, 3, 4],
[5, 6, 7, 8]])

[[1 2 3 4]
print(arr2)
[5 6 7 8]]

Dimension 2
arr2.ndim
of the array

Structure of arr2.shape (2, 4)


the array
Creating Arrays
• array() tries to infer a good data type for the array that it
creates.
• The data type is stored in a special dtype object

arr1.dtype dtype('float64')

arr2.dtype dtype('int32')
Creating Arrays
• The size method returns the entire number of items in the
array
• We can call the method on an array object, or call the
numpy module and pass the array as an argument

arr2.size

np.size(arr2)

8
Creating ndarrays
• We can convert an array from one shape to another without
copying any data. To do this, pass a tuple indicating the new
shape to the reshape method.

arr3 = np.array([0, 1, 2, 3, 4, 5, 6, 7])


arr3.reshape((4, 2))

array([[0, 1],
[2, 3],
[4, 5],
[6, 7]])
Creating Special Arrays
• In addition to array, there are a number of other special
methods for creating new arrays.
• As examples, zeros and ones create arrays of 0’s or 1’s,
respectively, with a given length or shape. empty creates an
array without initializing its values to any particular value.
• To create a higher dimensional array with these methods,
pass a tuple for the shape
Creating Special Arrays

np.zeros(10) array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

np.zeros((3,6)) array([[0., 0., 0., 0., 0., 0.],


[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])

np.empty((2,3,2)) array([[[1.05442863e-311, 2.86558075e-322],


[0.00000000e+000, 0.00000000e+000],
[7.56599806e-307, 2.92966904e-033]],

[[7.17473078e-091, 4.42510289e-062],
[4.31926418e-038, 4.19564746e+175],
[6.48224659e+170, 5.82471487e+257]]])
Creating an Array of Fixed Intervals
• arange is an array-valued version of the built-in Python
range function
np.arange(8)

array([0, 1, 2, 3, 4, 5, 6, 7])

• arange has three arguments


arange(start, stop, step)

np.arange(0,8,2)
array([0, 2, 4, 6])

• arange(8) is equivalent to arange(0,8,1)


Creating an Array of Fixed Intervals
• linspace returns evenly spaced numbers over a specified
interval. We can specify the number of values to be
generated
• linspace has a number of arguments

array([ 0. , 2.5, 5. , 7.5, 10. ])

np.linspace(0, 10, num = 5, endpoint = True, dtype = float)

Starting end Number If True, end is The type of


value value of values, the last value. the output
positive Default is True array

np.linspace(0, 10, num = 5, endpoint = False, dtype = int)

array([0, 2, 4, 6, 8])
Creating an Array of Fixed Intervals
• arange and linspace are going to be quite useful when
we are making plots.
• They can be used to generate data for the X axis

X axis
Creating a Random Array
• We can also use rand method in the random module to
generate some random data
• Pass shape to the argument
np.random.rand(5)

array([0.47487993, 0.55756924, 0.3188104 , 0.27839417, 0.11052682])

np.random.rand(4, 2)

array([[0.02473339, 0.93360131],
[0.21580826, 0.29531976],
[0.44972526, 0.53165493],
[0.50354605, 0.35704748]])
Creating a Random Array
• We can also use randn method in the random module to
draw data from a standard normal distribution
• Mean 0, variance 1

np.random.randn(5, 3)

shape

array([[-1.08369819, -0.12103409, -0.98555855],


[-0.89341613, -0.46729387, -0.36880701],
[ 2.03070419, -0.23967288, -1.04078775],
[-0.8740701 , -0.42289868, 0.00337789],
[-0.98268423, -0.65690555, 0.60583936]])
Creating a Random Array
• To draw data from any normal distribution 𝑁(𝜇, 𝜎 2 ), use
normal method

np.random.normal(mu, sigma, 10)

mean Standard size


deviation
array([83.83931654, 80.63749432, 77.88960687, 77.54264136, 79.23475542,
82.09625153, 75.7728796 , 82.87793414, 77.42665788, 78.72251953])

• For a full list of available distributions, see


https://numpy.org/doc/stable/reference/random/legacy.html
NumPy Constants
• np.e returns the Euler’s constant, which is the base of
natural logarithms
np.e 2.718281828459045

• np.nan returns Not a Number (NaN). This is used when


there is missing data or when the mathematical calculation
is not valid such as log(-1)
np.nan nan
• np.pi returns pi
np.pi 3.141592653589793
Array operations
• Arrays are important because they enable you to express
batch operations on data without writing any for loops
• This is usually called vectorization
• Any arithmetic operations between equal-size arrays applies
the operation elementwise
arr = np.array([[1, 2, 3],[4, 5, 6]])
arr
array([[1, 2, 3],
[4, 5, 6]])

arr * arr array([[ 1, 4, 9],


[16, 25, 36]])

1 / arr array([[1. , 0.5 , 0.33333333],


[0.25 , 0.2 , 0.16666667]])
Array operations
• The fill method sets all values in an array

zero_arr = np.zeros(5, int) array([0, 0, 0, 0, 0])


zero_arr

zero_arr.fill(4) array([4, 4, 4, 4, 4])


zero_arr

zero_arr[0] = 5.8 array([5, 4, 4, 4, 4])


zero_arr

5.8 is truncated since


the datatype is int
Universal Functions
• A universal function, or ufunc, is a function that performs
elementwise operations on data in ndarrays
• Like sqrt or exp. They take 1 array, thus unary ufuncs

arr = np.arange(10) [0 1 2 3 4 5 6 7 8 9]
print(arr)

np.sqrt(arr) array([0. , 1. , 1.41421356, 1.73205081,


2. ,
2.23606798, 2.44948974, 2.64575131,
2.82842712, 3. ])

np.exp(arr) array([1.00000000e+00, 2.71828183e+00,


7.38905610e+00, 2.00855369e+01,
5.45981500e+01, 1.48413159e+02,
4.03428793e+02, 1.09663316e+03,
2.98095799e+03, 8.10308393e+03])
Universal Functions
• add or maximum take 2 arrays and return a single array as
the result. They are binary ufuncs

x = np.random.randn(8) [0.59054432 0.79793215 -0.44441787


print(x) 0.74250776 -1.30106831 0.01595154
-0.63769349 -0.2519021 ]

y = np.random.randn(8) [0.55356053 -1.54007186 -0.40315681


print(y) -2.00758763 1.08729518 0.40433778
0.1940852 -0.57839798]

np.maximum(x, y) array([0.59054432, 0.79793215, -0.40315681,


0.74250776, 1.08729518,
0.40433778, 0.1940852 , -0.2519021])
Universal Functions
• Some unary ufuncs
Universal Functions
• Some binary ufuncs

• For a full list of ufuncs, go to


https://numpy.org/doc/stable/reference/ufuncs.html
Mathematical and Statistical Methods
• Aggregations (often called reductions) like sum, mean, and
standard deviation std can either be used by calling the
array method or using the top level NumPy function

arr.mean() np.mean(arr)

arr.sum() np.sum(arr)

arr.std() np.std(arr)
Mathematical and Statistical Methods
• Basic array statistical methods
Reading from and Writing to Text Files
• loadtxt method reads text file data into a 2D array
• savetxt method performs the inverse operation: writing
an array to a delimited text file.
values1 = np.random.random((10, 5))
np.savetxt(r'../data/nparray.txt', values1)

values2 = np.loadtxt(r'../data/nparray.txt’, dtype = float)


print(values2)
[[0.54135298 0.92492694 0.9705508 0.39579461 0.79874527]
[0.63508815 0.22996917 0.05120709 0.02846381 0.12284775]
[0.22021252 0.82902275 0.28549183 0.78106408 0.50466581]
……
Summary
• Creating NumPy arrays
• Array operations
• Universal functions
• Reading from and writing to text files
Exercise 1
• Given a random array of integers, write a program to
calculate the variance of the array. Do not use built-in
NumPy function np.var()
Exercise 2
• Given a random array, write a program to compute the
moving average at the interval of 3.

Given: [8 8 3 7 7 0 4 2 5 2 2 2]

Output: [6.3, 6. , 5.7, 4.7, 3.7, 2. , 3.7, 3. , 3. , 2. ]

• Hint: numpy.cumsum() returns an array of cumulative sum

arr = np.array([2, 3, 1, -1, 3, 5])


cumsum = np.cumsum(arr)
print(cumsum)

[ 2 5 6 5 8 13]
pandas
What is pandas?
• pandas contains high-level data structures and manipulation
tools designed to make data analysis fast and easy in
Python.
• pandas has two workhorse data structures: Series and
DataFrame.
• To use pandas, first import the module

import pandas as pd
Creating a Series
• A Series is a one-dimensional array-like object containing an
array of data (of any NumPy data type) and an associated
array of data labels, called its index.
• The simplest Series is formed from only an array of data
• We can use the Series method to create a series object

obj1 = pd.Series([4, 7, -5, 3])


obj1

0 4
1 7
2 -5
3 3
dtype: int64
Creating a Series
• The string representation of a series displayed interactively
shows the index on the left and the values on the right.
• By default, the index is consisted of integers 0 through N – 1
• Often it will be desirable to create a series with an index
identifying each data point

obj2 = pd.Series([4, 7, -5, 3], index = ['a', 'b', 'c', 'd'])


obj2

a 4
b 7
c -5
d 3
dtype: int64
Creating a Series
• We can also transform a Python dictionary to a Series

sdata = {'Ohio': 35000, 'Texas': 71000,


'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
Selecting Index and Values in a Series
• We can get the array representation and index object of a
series object via its values and index attributes by calling the
values and index methods.
• A series’ index can be altered in place by assignment
obj2.values array([ 4, 7, -5, 3], dtype=int64)

obj2.index Index(['a', 'b', 'c', 'd'], dtype='object')

obj2.index = ['Bob', 'Steve', 'Jeff', 'Ryan']


obj2
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
Selecting Index and Values in a Series
• We can use the index to find values and also change the
values.
• Very much like a dictionary

obj2['a'] 4

obj2['d'] = 6 c -5
obj2[['c','a','d']] a 4
d 6
dtype: int64
Series Operations
• Array operations, such as filtering with a Boolean array,
scalar multiplication, or applying math functions, will
preserve the index-value link

obj2[obj2 > 0] obj2 * 2 np.exp(obj2)

a 4 a 8 a 54.598150
b 7 b 14 b 1096.633158
d 6 c -10 c 0.006738
dtype: int64 d 12 d 403.428793
dtype: int64 dtype: float64
Replacing Values
• The replace method provides a simple and flexible way to
modify a subset of values
0 1.0
1 -999.0
data = pd.Series([1., -999., 2., 2 2.0
-999., -1000., 3.]) 3 -999.0
data 4 -1000.0
5 3.0
dtype: float64

0 1.0
1 NaN
import numpy as np 2 2.0
data.replace(-999, np.nan) 3 NaN
4 -1000.0
5 3.0
dtype: float64
DataFrame
• A DataFrame represents a tabular, spreadsheet-like data
structure containing an ordered collection of columns, each
of which can be a different value type (numeric, string,
Boolean, etc.).
• The DataFrame has both a row and column index; it can be
thought of as a dictionary of Series (one for all sharing the
same index).
Creating a DataFrame
• One of the most common way to create a DataFrame is from
a dictionary of equal-length lists
• The resulting DataFrame will have its index assigned
automatically as with Series
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame1 = pd.DataFrame(data)
frame1

state year pop


0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
Creating a DataFrame
• If you specify a sequence of columns, the DataFrame’s
columns will be exactly what you pass

pd.DataFrame(data, columns =
['year', 'state', 'pop'])

year state pop


0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
Creating a DataFrame
• Call the describe method to see the basic statistics of the
numerical values in the dataframe

frame1.describe()
Adding a Column to a DataFrame
• If you pass a column that isn’t contained in data, it will
appear with NaN values in the result

frame2 = pd.DataFrame(data,
columns = ['year', 'state', 'pop', 'debt'],
index = ['one', 'two', 'three', 'four', 'five'])
frame2

year state pop debt


one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
Selecting Index and Values from a DataFrame

• We can use the attribute and the column label to find the
values in a column

frame2.columns Index(['year', 'state', 'pop', 'debt'], dtype='object')

frame2['state'] frame2.state

one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
Name: state, dtype: object
Selecting Index and Values from a DataFrame

• To retrieve values from a row:


• Use loc on the row [label]
• Or, use iloc on the original row [index]

frame2.loc['three'] frame2.iloc[2]

year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
Modify Values in a DataFrame
• Values in columns can be modified by assignment
• For example, the empty 'debt' column could be assigned a
scalar value or an array of values
year state pop debt
frame2.debt = 16.5 one 2000 Ohio 1.5 16.5
frame2 two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5

year state pop debt


frame2['debt'] = np.arange(5) one 2000 Ohio 1.5 0
frame2 two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4
Reading from a CSV file
• pandas features a number of functions for reading tabular
data as a DataFrame object
• The most used one is read_csv
• Let’s use advertising.csv

,TV,radio,newspaper,sales header
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3 data
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9
Reading from a CSV file

df = pd.read_csv(r'../advertising.csv')
df

Our data already has index

type(df) pandas.core.frame.DataFrame
Reading from a CSV file
• If data in the file already has index, then you can specify it
through index_col =

df = pd.read_csv(r'../data/advertising.csv',
index_col = 0)
df
Reading from a CSV file
• Use head or tail to narrow down the view of data
• The argument is the number of observations you want to
view. Default is 5

df.head(3)

df.tail(2)
Reading from a CSV file
• By default, it reads the first row as the header
• We can specify if the data has no header by header = None
• We can also assign header names by passing a list to the
argument names =

pd.read_csv(r'../data/advertising.csv',
index_col = 0, header = None).head(3)

pd.read_csv(r'../data/advertising.csv',
index_col = 0, names = ['col1', 'col2',
'col3', 'col4']).head(3)
Writing to a CSV file
• Using DataFrame’s to_csv method, we can write the data
out to a CSV file

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],


'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)

frame.to_csv(r'../data/states.csv')

,state,year,pop
0,Ohio,2000,1.5
First column name is 1,Ohio,2001,1.7
empty since it’s the index 2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
Writing to a CSV file
• By default, the delimiter is comma, the first row is header,
and there is index
• We can also specify the delimiter by sep =
• We can specify no header by header = False
• We can also specify no index by index = False

frame.to_csv(r'../data/states1.csv',
sep = '|', index = False, header = False)

Ohio|2000|1.5
Ohio|2001|1.7
Ohio|2002|3.6
Nevada|2001|2.4
Nevada|2002|2.9
Summary
• Creating a series
• Creating a dataframe
• Modifying a dataframe
• Removing duplicates
• Reading from and writing to csv files
Exercise 3
• Given a dataframe of three columns, write a program to add
a 4th column which is a lag of the 1st column, a 5th column
which is the lead of the 2nd column
Exercise 4
• Read “advertising.csv”. Take the first 5 rows, and add a
‘social’ column and populate the column with 5 random
values rounded to one decimal place. Then write the new
data to a csv file named “ads_new.csv”
More Exercises
• NumPy and pandas are a big universe. We can only cover a
very thin surface to get you started
• Practice is the only way to progress
• Find more exercises and solutions below

• NumPy: https://github.com/rougier/numpy-100
• pandas:
https://www.machinelearningplus.com/python/101-
pandas-exercises-python/
Before We Move On
Text, CSV, JSON

Regular Expression
Managing NumPy
Data
Pandas

StatsModels

Web Beautiful Soup


Scraping

Tableau
Data
Visualization Matplotlib
Install BeautifulSoup 4
• To prepare for the coming sessions, you need to install a
powerful Python package called BeautifulSoup 4.
• Following the instructions on the following page to
download and install the package
https://www.crummy.com/software/BeautifulSoup/
• The latest version is 4.9
• To test if you have successfully installed the package, run
the following code in Python. If it does not show anything,
then it’s installed.
from bs4 import BeautifulSoup

You might also like