lecture4
lecture4
*in case you fail to import the library, you may install it first by executing the following code in a cell:
conda install matplotlib
4.1.1 Frequency plot
For bar chart, two arrays are required for the x-axis and y-axis.
You can use the following syntax to sort a DataFrame by one variable:
df_name = df_name.sort_values(“col_name”)
For pie chart, only an array of value is necessary. We might also put
the labelling and auto-percentage optionally.
4.1.2 Histogram
plt.hist(data_set, bins=number_of_bins)
…… ?
plt.hist(data_set, bins=[point_0,point_1,...,point_n])
? ? ? ?
In the example, without putting any parameter, the data is automatically
put into 10 bins of equal width along the real number line. The end
points of the intervals and the frequency of data in each interval are
shown in the array.
Suppose we would like to use 20 bins with narrower interval, simply
set the parameter 20.
Suppose the rubric of ama1234 is shown below. We may form an array
with the threshold marks together with 0 mark and 100 marks to divide
the interval for data binning. Notice that the width of each interval is not
even. This is illustrated by width of rectangles on the histogram.
Rubrics:
𝐴: 85,100 ,
𝐵: 70,85 ,
𝐶: 60,70 ,
𝐷: 50,59 ,
𝐹: 0,50
4.1.3 Line graph
Moreover, the colour and style of the line can be adjusted by the
commands inside the brackets.
A dataset might contain more than one variable, under the same
index. For example, in a set of health data of a class of students,
height (m) and weight (kg) are two variables. We can call such
dataset bivariate data.
For example, we might want to study the relation between height and
weight. After reading the csv file "health.csv" into a DataFrame, extract
the columns representing the two variables (height and weight) into two
Series. For each data index, a point with x-coordinate being its height
and y-coordinate being its weight is plot on the figure. As a result, a
scatter plot should contain as many points as the number of rows of the
original DataFrame.
In case the data points can be classified into different categories (e.g.
male and female), we might assign colours to each point with
c=[colour0,colour1,...].
If the data points are weighted, we might show each point with a
different size using s=[size0,size1,...].
Lab 4.2.2
The file ama3456.csv contains the test and exam marks of students in
the course. Plot a scatter plot to figure out the relation between test
and exam marks. What can you conclude?
4.2.3 Covariance and correlation
∑ 𝑥 𝑥̅ 𝑦 𝑦
𝑟
∑ 𝑥 𝑥̅ ∑ 𝑦 𝑦
The result of correlation coefficient 𝑟 is a value between 1 and 1
inclusively. We can interpret the relationship between the two
variables by the value of 𝑟 :
We can easily draw a straight line with positive slope with the
points lying close to it.
𝑟 0.9
𝑦 𝑦
𝑥 𝑥
To illustrate this idea, compare the three figures below.
We can also draw a line with negative slope, but the points are
farther apart from it.
𝑟 0.5
𝑦 𝑦
𝑥 𝑥
To illustrate this idea, compare the three figures below.
𝑟 0
𝑥
Example 4.2.4
The table is symmetry with the diagonal values equal to 1 since the
correlation coefficient of a variable with itself must be 1.
Lab 4.2.5
The file ama3456.csv contains the test and exam marks of students in
the course. Evaluate the covariance and coefficient of correlation.
Are these values consistent with the scatter plot?
4.2.6 Linear regression
For simplicity, we are looking for a straight line that best fit the data set
of two variables. Recall the figure in the previous section. We can draw
a straight line graphically to fit the points. In general, we need to know
how to find the equation of such line, and how to evaluate the
performance of the line fitting the points.
We define 𝑦 as the predicted value, which
is a linear function 𝑥 given by:
𝑦 𝛼 𝛽𝑥 𝛼
𝐸
Our target is to find the weights with the least sum of square error
(SSE). In other words, try to minimize:
𝑆𝑆𝐸 𝐸 𝑦 𝑦
Given a set of bivariate data 𝑥 , 𝑦 with sample size 𝑛. The value
of coefficients 𝑎, 𝑏 in the linear regression 𝑦 𝛼 𝛽𝑥 are given by:
∑ ∑
Since 𝑥̅ and 𝑦 are the mean values of 𝑥 , 𝑦 ,
∑ ∑
also 𝑥 and 𝑥𝑦 are the mean values of 𝑥 , 𝑥 𝑦 ,
we can convert the formula to:
Although Pandas doesn’t contain a built-in function to evaluate the two
coefficients, we can simply write a code to do so.
Suppose x and y are two Series, or two columns from a DataFrame. Also
assume that x and y have the same sample size and index.
With vectorized calculation we can create two new Series x2 and xy. Their
mean values can be calculated using the built-in function. Hence the
coefficients a, b can be evaluated as a tuple.
def lr(x,y):
x2 = x**2
xy = x*y
b = (xy.mean()‐x.mean()*y.mean())/(x2.mean()‐x.mean()**2)
a = y.mean()‐b*x.mean()
return a, b
In order to plot the regression line, we need to import the numpy library.
(numerical python) which contains useful features for numerical
methods. One of them is called linspace, which stands for linear space.
The syntax is linspace(p,q,n). Such a linear space is a special
type of array of n numbers going from p to q with uniform increment, or
equivalently 𝑡 𝑖 𝑝 𝑖 for 𝑖 0,1,2, … , 𝑛 1.
For example:
import numpy as np
t = np.linspace(2, 3, 11)
t[i] 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
index i 0 1 2 3 4 5 6 7 8 9 10
Example 4.3.1 The file “gpa.csv” contains data of 20 students. They can
be categorized by Gender (M/F), School (FS/FB/FE) and Local (Y/N).
Each student has a GPA representing their academic performance.
import pandas as pd
import seaborn as sns
df = pd.read_csv("gpa.csv")
In a seaborn barplot, x and y axis can be chosen as any column head of the
DataFrame. The mean and confidence interval based on a DataFrame are
evaluated automatically, with level of confidence is of 95% by default.
sns.barplot(x="School",y="GPA",data=df)
sns.catplot(x="School",y="GPA",hue="Gender",
col="Local",kind="bar",data=df)
A histogram is a kind of bar plot that gives a discretized display of value
frequency. The data points are split into discrete, evenly spaced bins, and
the number of data points in each bin is plotted. Using the plot.hist method
on the Series we can generate a histogram.
df["GPA"].plot.hist(bins=6)
Example 4.3.3 Show the pairs plot of the four corporations based on
their percentage change in stock price.
stock = pd.read_csv('stock_returns.csv')
sns.pairplot(stock, diag_kind='kde',
plot_kws={'alpha': 0.2})
Lab 4.3.4
Redo the scatter plot and regression of salary against experience
of “hr.csv” using seaborn.
4.3.5 (project for self-learning)
For you to practice the techniques in this lecture, here we introduce a
famous dataset. In 1978, David Harrison Jr. and Daniel L. Rubinfeld
published a paper called "Hedonic housing prices and the demand for
clean air". To support their findings, they referred to the data for
census tracts in the Boston Standard Metropolitan Statistical Area
(SMSA) in 1970. This dataset is clean with lots of variables including
the following:
$$$$$ $$
feature variables (factors)
CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000's
target variable
This dataset has been store in "boston.csv". It contains 506 rows and
14 columns. We can first read it as a DataFrame.
Given this dataset, how would you study the factors related to the
house price in Boston? Recall the data science process steps.