0% found this document useful (0 votes)
3 views

lecture4

Uploaded by

hungyathei130
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

lecture4

Uploaded by

hungyathei130
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

The Hong Kong Polytechnic University

Empowering Gifted Minds:


Nurturing the Next Generation of Data Scientist

Topic 1: Introduction to data science

Lecture 4: Data Visualization and Bivariate Analysis

4.1 Data visualization


4.2 Analytics of bivariate data
4.3 Further techniques in data visualization
4.1 Graph of one variable

Data visualization refers to the representation of data through use of


graphics. Making informative visualization is an important task in
data analysis, no matter as a part of the exploratory process or as a
way of generating ideas for models.
Python has many add-on libraries for making static or dynamic
visualizations. One commonly used library is matplotlib which stands
for mathematics-plot-library and the techniques of making various
graphs for presenting mathematical figures and statistical data.
Another one is called seaborn, a data visualization library for drawing
attractive and informative statistical graphics.

In fact, matplotlib is a desktop plotting package designed for creating


publication quality plots. There are many modules under it including
pyplot. We usually import it by:

*in case you fail to import the library, you may install it first by executing the following code in a cell:
conda install matplotlib
4.1.1 Frequency plot

Consider a set of one dimensional data. The values can be either


numerical values (e.g. marks of students) or non-numerical values
(e.g. letter grades of students). Frequency plot is based on the number
of occurrence of each unique value.

Let's use an example to illustrate different plots. The file "ama1234.csv"


contains the marks and letter grades of 100 students. We can first read
the file as a single DataFrame, then extract the two columns as two
Series.
Base on the grades, we can directly plot a histogram using hist() to
show the frequency of students obtaining each grades. The show()
method displays the plot.
In the previous chapter, we have introduced the value_counts()
method in Pandas which gives a frequency table of a set of data by
counting the frequency of each unique value. The result is a Series with
the index being each unique value and the values being the frequency
of each unique value. We might first store it as two arrays:
Using these two arrays x(index) and y(values), we can plot a bar chart
or a pie chart to illustrate frequency of the set of data.

For bar chart, two arrays are required for the x-axis and y-axis.

You can use the following syntax to sort a DataFrame by one variable:
df_name = df_name.sort_values(“col_name”)
For pie chart, only an array of value is necessary. We might also put
the labelling and auto-percentage optionally.
4.1.2 Histogram

If we look at the Series of marks in the example problem, we can see


that the entries are float point numerical values ranged between 0 to
100, corrected to 1 decimal point. It would be unrealistic to make a
frequency plot of each unique mark. Instead, we might group similar
marks together.

In statistics, data binning is a way to group numbers of more-or-less


continuous values into a smaller number of "bins". After binning, we
might plot a histogram or a density plot base on the frequency of
binned values. 23
19
8 12 20
25

[0,10) [10,20) [20,30)


In pyplot, when we plot a histogram of an array of values from a
continuous variable, it will be automatically binned. We can also specify
the binning criteria by the number of even width intervals, or by the end
points between intervals by the following syntax respectively:

plt.hist(data_set, bins=number_of_bins)

…… ?

plt.hist(data_set, bins=[point_0,point_1,...,point_n])

? ? ? ?
In the example, without putting any parameter, the data is automatically
put into 10 bins of equal width along the real number line. The end
points of the intervals and the frequency of data in each interval are
shown in the array.
Suppose we would like to use 20 bins with narrower interval, simply
set the parameter 20.
Suppose the rubric of ama1234 is shown below. We may form an array
with the threshold marks together with 0 mark and 100 marks to divide
the interval for data binning. Notice that the width of each interval is not
even. This is illustrated by width of rectangles on the histogram.

Rubrics:
𝐴: 85,100 ,
𝐵: 70,85 ,
𝐶: 60,70 ,
𝐷: 50,59 ,
𝐹: 0,50
4.1.3 Line graph

Some set of data is dependent to a variable as an ordered series. For


example, stock price is dependent on the time variable. To present such
data, we may use a line graph. This is good for showing the trend and
local extreme values.

As an example, we will try to study the stock price of two companies,


Apple (AAPL) and Microsoft (MSFT). The files "AAPL.csv" and "MSFT.csv"
contains the historical data in 5 years. The column of adjusted close
price is stored as a Series.
In pyplot, we can directly use plot() to make a line graph:
(i) with a Series or array; or
(ii) by selecting two columns from a DataFrame.

Moreover, the colour and style of the line can be adjusted by the
commands inside the brackets.

command colour command style


'k' black '--' dashed line
'g' green ':' dotted line
'r' red '*' points with stars
'b' blue 'o' points with circle
'y' yellow '+' points with plus sign
Method (i)
Directly apply the plot function with a Series as input.
Method (ii)
Choose the date as x-axis, adjusted price as y-axis.
For better display we can rotate the date values (optional).
However, this is hard for comparison. In matplotlib, we can plot several
lines on the same figure. To show information of the figure and distinguish
the lines, we can also add title, labels and legend. This applies not only on
line graph but also on the graphs we have introduced before. Upon
executing the show() command, all these graphs and information before
will be displayed on the same figure.
However, it is still hard to compare the performance of the two stocks,
as they have different scales. A solution is to divide the daily adjusted
close price by the first entry.
In fact, those data can be found and downloaded as csv file from the
internet. For example, Yahoo Finance is a useful source of financial data.
https://finance.yahoo.com/
Lab 4.1.4 Stock price
Choose a stock or index from Yahoo Finance. Download the data in recent
one year. Plot the adjusted price (USD) with time (date). For example:
4.2 Analytics of bivariate data

A dataset might contain more than one variable, under the same
index. For example, in a set of health data of a class of students,
height (m) and weight (kg) are two variables. We can call such
dataset bivariate data.

In statistics and data science, we are interested to study if there is


any relationship between these two variables. The methos
involved in such process is called bivariate analytics. This includes
but not limited to correlation, linear regression and clustering.
4.2.1 Scatter plot

For bivariate data, the purpose of visualization is to show the


relationship between the two variables. This can be illustrated by a
scatter plot. In a scatterplot, each data point represents a sample. The
x-coordinate and y-coordinate represent the value of the two variables.

For example, we might want to study the relation between height and
weight. After reading the csv file "health.csv" into a DataFrame, extract
the columns representing the two variables (height and weight) into two
Series. For each data index, a point with x-coordinate being its height
and y-coordinate being its weight is plot on the figure. As a result, a
scatter plot should contain as many points as the number of rows of the
original DataFrame.
In case the data points can be classified into different categories (e.g.
male and female), we might assign colours to each point with
c=[colour0,colour1,...].

If the data points are weighted, we might show each point with a
different size using s=[size0,size1,...].
Lab 4.2.2
The file ama3456.csv contains the test and exam marks of students in
the course. Plot a scatter plot to figure out the relation between test
and exam marks. What can you conclude?
4.2.3 Covariance and correlation

A scatterplot gives visualization to a bivariate data. Consider the


previous example, do you think that weight and height are
uncorrelated? Or do you believe a tall person should be heavier
(positive correlation)? Or do you believe a tall person should be
lighter (negative correlation)? The measure of association gives us
an quantitative tool to evaluate the relationship between two
variables.
In descriptive statistics, we have introduced variance and standard
deviation as measures of dispersion, meaning how far the values
apart from the mean are. For the variables 𝑥 and 𝑦, their variance
and standard deviation are 𝜎 , 𝜎 and 𝜎 , 𝜎 respectively.

In order measure their association, we further introduce a


measure called population covariance:
∑ 𝑥 𝑥̅ 𝑦 𝑦
𝜎
𝑁

where 𝑥 , 𝑦 are the values of variables 𝑥, 𝑦 of the 𝑖 -th sample,


𝑥̅, 𝑦 are the arithmetic means of the two variables, 𝑁 is the
population size.
This is to say, for each sample we evaluate product of difference between
each variable with its mean. Notice that this can be positive or negative.
The covariance is the average value of these products. Notice that the
covariance is commutative, meaning that 𝜎 𝜎 . Also, the covariance
of a variable with its own self results in the variance.

For sample covariance, we may simply replace 𝑁 by 𝑛 1 where 𝑛 is


the sample size. In pandas, we can use the function cov() to show the
sample covariance between two variables, or to give a covariance table.
However, the value of covariance also depends on the scale of the
variables. For example, if we use centimetre and pounds as the
units for height and weight, the covariance will be difference.

In order to study the association between two variables without


considering scale and unit, we can standardize the covariance by
the standard deviation of the two variables. This gives the
correlation coefficient:
𝜎
𝑟
𝜎 𝜎

The complete formula is written as:

∑ 𝑥 𝑥̅ 𝑦 𝑦
𝑟
∑ 𝑥 𝑥̅ ∑ 𝑦 𝑦
The result of correlation coefficient 𝑟 is a value between 1 and 1
inclusively. We can interpret the relationship between the two
variables by the value of 𝑟 :

𝑟 0 indicates no linear relationship, or in other words


uncorrelated.

𝑟 1 indicates a perfect positive linear relationship: as one


variable increases in its values, the other variable also increases in
its values through an exact linear rule.

𝑟 1 indicates a perfect negative linear relationship: as one


variable increases in its values, the other variable decreases in its
values through an exact linear rule.
Other than these special cases, we can also distinguish the linear
relationship between the variables as follows:

weak moderate strong

positive 0 𝑟 0.3 0.3 𝑟 0.7 0.7 𝑟 1

negative 0.3 𝑟 0 0.7 𝑟 0.3 1 𝑟 0.7


To illustrate this idea, compare the three figures below:

First figure: strong positive correlation

We can easily draw a straight line with positive slope with the
points lying close to it.

𝑟 0.9

𝑦 𝑦

𝑥 𝑥
To illustrate this idea, compare the three figures below.

Second figure: moderate negative correlation

We can also draw a line with negative slope, but the points are
farther apart from it.

𝑟 0.5

𝑦 𝑦

𝑥 𝑥
To illustrate this idea, compare the three figures below.

Third figure: zero correlation

We can hardly draw a linear relationship between the points.

𝑟 0

𝑥
Example 4.2.4

We can evaluate the coefficient of correlation using pandas easily.


Let's go back to the example of health data of five students. Read the
csv file as a DataFrame first. We can then regard height and weight as
the two variables. For convenience, we may assign the two columns
as two Series x and y.

Then we can evaluate the correlation coefficient by corr(). Notice


that this operation is commutative meaning that x.corr(y) and
y.corr(x) give the same result. The coefficient 0.669 indicates that
height and weight are (moderate) positively correlated.
In case there are more than two variables in the dataset, it would be
inconvenient to compare every pair of them. Instead, we might apply
corr() to the whole DataFrame. This will give a table of correlation
coefficient between each column pairwisely.

Notice that non-numerical variables (e.g. sex) is ignored. Only


variables with numerical values are accounted.

The table is symmetry with the diagonal values equal to 1 since the
correlation coefficient of a variable with itself must be 1.
Lab 4.2.5
The file ama3456.csv contains the test and exam marks of students in
the course. Evaluate the covariance and coefficient of correlation.
Are these values consistent with the scatter plot?
4.2.6 Linear regression

In the previous section, we have introduced the correlation coefficient


for measuring the association between two variables. The value of this
coefficient suggests whether it is a linear relationship between the two
variables. To further analyze the association and make reasonable
prediction, we can use a statistical technique called linear regression.

For simplicity, we are looking for a straight line that best fit the data set
of two variables. Recall the figure in the previous section. We can draw
a straight line graphically to fit the points. In general, we need to know
how to find the equation of such line, and how to evaluate the
performance of the line fitting the points.
We define 𝑦 as the predicted value, which
is a linear function 𝑥 given by:
𝑦 𝛼 𝛽𝑥 𝛼
𝐸

where the coefficients 𝛼, 𝛽 represents 𝑦


𝑦 𝛼 𝛽𝑥
y-intercept and slope of the line.

For the i-th sample, we define the error 𝐸


as the difference between the predicted
value 𝑦 and the actual value 𝑦 . 𝑥

Our target is to find the weights with the least sum of square error
(SSE). In other words, try to minimize:
𝑆𝑆𝐸 𝐸 𝑦 𝑦
Given a set of bivariate data 𝑥 , 𝑦 with sample size 𝑛. The value
of coefficients 𝑎, 𝑏 in the linear regression 𝑦 𝛼 𝛽𝑥 are given by:

∑ ∑
Since 𝑥̅ and 𝑦 are the mean values of 𝑥 , 𝑦 ,
∑ ∑
also 𝑥 and 𝑥𝑦 are the mean values of 𝑥 , 𝑥 𝑦 ,
we can convert the formula to:
Although Pandas doesn’t contain a built-in function to evaluate the two
coefficients, we can simply write a code to do so.

Suppose x and y are two Series, or two columns from a DataFrame. Also
assume that x and y have the same sample size and index.

With vectorized calculation we can create two new Series x2 and xy. Their
mean values can be calculated using the built-in function. Hence the
coefficients a, b can be evaluated as a tuple.

def lr(x,y):
x2 = x**2
xy = x*y
b = (xy.mean()‐x.mean()*y.mean())/(x2.mean()‐x.mean()**2)
a = y.mean()‐b*x.mean()
return a, b
In order to plot the regression line, we need to import the numpy library.
(numerical python) which contains useful features for numerical
methods. One of them is called linspace, which stands for linear space.
The syntax is linspace(p,q,n). Such a linear space is a special
type of array of n numbers going from p to q with uniform increment, or
equivalently 𝑡 𝑖 𝑝 𝑖 for 𝑖 0,1,2, … , 𝑛 1.

For example:
import numpy as np
t = np.linspace(2, 3, 11)

t[i] 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
index i 0 1 2 3 4 5 6 7 8 9 10

Notice that there are 11 points with 10 intervals. Vectorized


calculation also applies to linspace objects.
With a set of points on the x-axis generated by linspace, we can plot
any mathematical function easily. The y-coordinate can be evaluated
by vectorized calculation, or by a for-loop, or by the apply function.

plot(x, y) plots all y coordinates vs.


all x coordinates
xlabel('some text') label x-axis
ylabel('some text') label y-axis
show() displays the figure

For example, we have assigned t as a linear space


t = np.linspace(2, 3, 11)
We can further assign y = t**2 as a square function and then
use plt.plot(t,y) to plot the graph of 𝑦 𝑡 against 𝑡.
Example 4.2.7 Linear regression of health data
Let’s test it with the health data. The function returns the two coefficients
as a tuple. They can be separated with the following command.
To visualize the data, we can plot both the scatter plot and regression line
on the same graph. Remember to label the axes.
Lab 4.2.8
The file "hr.csv" contains the human resource record of a company
with 30 staff, showing their experience (years) working in the
company and annual salary (USD).

(a) Read the file as a DataFrame.


(b) Evaluate the correlation between experience and salary.
(c) Find the linear regression model of salary against experience.
(d) Make a scatter plot of these two variables with the regression
line on the same graph.
(e) What can you conclude about this company?
4.3 Further techniques in data visualization

With data that requires aggregation or summarization before making a


plot, using the seaborn package can make things much simpler. Plotting
functions in seaborn take a data argument, which can be a pandas
DataFrame. The other arguments refer to column names.

Example 4.3.1 The file “gpa.csv” contains data of 20 students. They can
be categorized by Gender (M/F), School (FS/FB/FE) and Local (Y/N).
Each student has a GPA representing their academic performance.

import pandas as pd
import seaborn as sns
df = pd.read_csv("gpa.csv")
In a seaborn barplot, x and y axis can be chosen as any column head of the
DataFrame. The mean and confidence interval based on a DataFrame are
evaluated automatically, with level of confidence is of 95% by default.
sns.barplot(x="School",y="GPA",data=df)

Barplot also have a hue option that enables us to split by an additional


categorical value.
sns.barplot(x="School",y="GPA",hue="Gender",data=df)
One way to visualize data with many categorical variables is to
use a facet grid. Seaborn has a useful built-in function catplot that
simplifies making many kinds of faceted plots.

sns.catplot(x="School",y="GPA",hue="Gender",
col="Local",kind="bar",data=df)
A histogram is a kind of bar plot that gives a discretized display of value
frequency. The data points are split into discrete, evenly spaced bins, and
the number of data points in each bin is plotted. Using the plot.hist method
on the Series we can generate a histogram.
df["GPA"].plot.hist(bins=6)

A related plot type is a density plot, which is formed by computing an


estimate of a continuous probability distribution that might have generated
the observed data. Using plot.density makes a density plot using the
conventional mixture-of-normals estimate.
df["GPA"].plot.density()
We have learn scatter plot and regression line to visualize the
relationship between two variables in bivariate data.
Seaborn support such features in just one simple command.

import seaborn as sns


sns.regplot(col_a,col_b,data = DataFrame_name)

Example 4.3.2 Stock comparison by percentage return


“stock_returns.csv” contains the daily percentage changes of four IT
related stocks. We would like to make a scatter plot and draw a
regression line between percentage changes of Apple and Microsoft.
stock = pd.read_csv('stock_returns.csv')
sns.regplot('AAPL','MSFT',data = stock)
plt.title('% change in price, Apple vs Microsoft')
To show the inter-relationship between all 4 corporations, we can look at
the pairs plot or scatter plot matrix using pairplot in seaborn. Along the
diagonal, it shows the histograms or density estimation of each variable.
At the non-diagonal positions, it shows the scatter plots between two
variables. This visualization gives more rich information compared with
only a correlation table. The option of ‘alpha’ changes opacity of the graph.

Example 4.3.3 Show the pairs plot of the four corporations based on
their percentage change in stock price.

stock = pd.read_csv('stock_returns.csv')
sns.pairplot(stock, diag_kind='kde',
plot_kws={'alpha': 0.2})
Lab 4.3.4
Redo the scatter plot and regression of salary against experience
of “hr.csv” using seaborn.
4.3.5 (project for self-learning)
For you to practice the techniques in this lecture, here we introduce a
famous dataset. In 1978, David Harrison Jr. and Daniel L. Rubinfeld
published a paper called "Hedonic housing prices and the demand for
clean air". To support their findings, they referred to the data for
census tracts in the Boston Standard Metropolitan Statistical Area
(SMSA) in 1970. This dataset is clean with lots of variables including
the following:

$$$$$ $$
feature variables (factors)
CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000's
target variable
This dataset has been store in "boston.csv". It contains 506 rows and
14 columns. We can first read it as a DataFrame.

Given this dataset, how would you study the factors related to the
house price in Boston? Recall the data science process steps.

You might also like