0% found this document useful (0 votes)
10 views177 pages

AP Shah Ads Notes Pt 1

The document discusses hypothesis testing, a statistical method used to make inferences about a population based on sample data. It explains key concepts such as null and alternative hypotheses, types of errors, significance levels, and p-values, along with examples to illustrate these concepts. The document also outlines the steps to perform hypothesis testing and emphasizes the importance of understanding the relationship between p-values and significance levels in decision-making.

Uploaded by

shivamabc67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views177 pages

AP Shah Ads Notes Pt 1

The document discusses hypothesis testing, a statistical method used to make inferences about a population based on sample data. It explains key concepts such as null and alternative hypotheses, types of errors, significance levels, and p-values, along with examples to illustrate these concepts. The document also outlines the steps to perform hypothesis testing and emphasizes the importance of understanding the relationship between p-values and significance levels in decision-making.

Uploaded by

shivamabc67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 177

Subject: Applied Data Science Semester: VIII

Hypothesis testing

Hypothesis testing helps in data analysis by providing a way to make inferences about a
population based on a sample of data. It allows analysts to make decisions about whether
to accept or reject a given assumption or hypothesis about the population based on the
evidence provided by the sample data. For example, hypothesis testing can be used to
determine whether a sample mean is significantly different from a hypothesized population
mean or whether a sample proportion is significantly different from a hypothesized
population proportion. This information can be used to make decisions about whether to
accept or reject a given assumption or hypothesis about the population.

In statistical analysis, hypothesis testing is used to make inferences about a population


based on a sample of data.

In machine learning, hypothesis testing is used to evaluate the performance of a model


and determine the significance of its parameters. For example, a t-test or z-test can be
used to compare the means of two groups of data to determine if there is a significant
difference between them. This information can then be used to improve the model, or
select the best set of features.

Definition of Hypothesis Testing

The hypothesis is a statement, assumption or claim about the value of the parameter
(mean, variance, median etc.).
A hypothesis is an educated guess about something in the world around you. It should be
testable, either by experiment or observation.

Like, if we make a statement that “Dhoni is the best Indian Captain ever.” This is an
assumption that we are making based on the average wins and losses team had under his
captaincy. We can test this statement based on all the match data.

Null and Alternative Hypothesis Testing

The null hypothesis is the hypothesis to be tested for possible rejection under the
assumption that it is true. The concept of the null is similar to innocent until proven guilty
We assume innocence until we have enough evidence to prove that a suspect is guilty.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

In simple language, we can understand the null hypothesis as already accepted


statements, For example, Sky is blue. We already accept this statement.

It is denoted by H0.
The alternative hypothesis complements the Null hypothesis. It is the opposite of the null
hypothesis such that both Alternate and null hypothesis together cover all the possible
values of the population parameter.

It is denoted by H1.
Let’s understand this with an example:

A soap company claims that its product kills on an average of 99% of the germs.

Suppose lifebuoy claims that, it kills 99.9% of germs. So how can they say so? There has
to be a testing technique to prove this claim right?? So hypothesis testing uses to prove a
claim or any assumptions.

To test the claim of this company we will formulate the null and alternate hypothesis.

Null Hypothesis(H0): Average =99%

Alternate Hypothesis(H1): Average is not equal to 99%.

Note: When we test a hypothesis, we assume the null hypothesis to be true until there is
sufficient evidence in the sample to prove it false. In that case, we reject the
null hypothesis and support the alternate hypothesis. If the sample fails to provide
sufficient evidence for us to reject the null hypothesis, we cannot say that the null
hypothesis is true because it is based on just the sample data. For saying the null
hypothesis is true we will have to study the whole population data.

Simple and Composite Hypothesis Testing

When a hypothesis specifies an exact value of the parameter, it is a simple hypothesis and
if it specifies a range of values then it is called a composite hypothesis.

e.g. Motor cycle company claiming that a certain model gives an average mileage of
100Km per liter, this is a case of simple hypothesis.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

The average age of students in a class is greater than 20. This statement is a composite
hypothesis.

One-tailed and two-tailed Hypothesis Testing

If the alternate hypothesis gives the alternate in both directions (less than and greater than)
of the value of the parameter specified in the null hypothesis, it is called a Two-tailed test.

If the alternate hypothesis gives the alternate in only one direction (either less than or
greater than) of the value of the parameter specified in the null hypothesis, it is called
a One-tailed test.

e.g. if H0: mean= 100 H1: mean not equal to 100

here according to H1, mean can be greater than or less than 100. This is an example of a
Two-tailed test

Similarly, if H0: mean>=100 then H1: mean< 100

Here, the mean is less than 100. It is called a One-tailed test.

Critical Region

The critical region is that region in the sample space in which if the calculated value lies
then we reject the null hypothesis.

Let’s understand this with an example:

Suppose you are looking to rent an apartment. You listed out all the available apartments
from different real state websites. You have a budget of Rs. 15000/ month. You cannot
spend more than that. The list of apartments you have made has a price ranging from
7000/month to 30,000/month.

You select a random apartment from the list and assume below hypothesis:

H0: You will rent the apartment.

H1: You won’t rent the apartment.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Now, since your budget is 15000, you have to reject all the apartments above that price.

Here all the Prices greater than 15000 become your critical region. If the random
apartment’s price lies in this region, you have to reject your null hypothesis and if the
random apartment’s price doesn’t lie in this region, you do not reject your null hypothesis.

The critical region lies in one tail or two tails on the probability distribution curve according
to the alternative hypothesis. The critical region is a pre-defined area corresponding to a
cut off value in the probability distribution curve. It is denoted by α.

Critical values are values separating the values that support or reject the null hypothesis
and are calculated on the basis of alpha.

We will see more examples later on and it will be clear how do we choose α.

Based on the alternative hypothesis, three cases of critical region arise:

Case 1) This is a double-tailed test.

Case 2) This scenario is also called a Left-tailed test.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Case 3) This scenario is also called a Right-tailed test.

Type I and Type II Error

So Type I and type II error is one of the most important topics of hypothesis testing. Let’s
simplify it by breaking down this topic into a smaller portion.

A false positive (type I error) — when you reject a true null hypothesis.

A false negative (type II error) — when you accept a false null hypothesis.

The probability of committing Type I error (False positive) is equal to the significance level
or size of critical region α.

α= P [rejecting H0 when H0 is true]

The probability of committing Type II error (False negative) is equal to the beta β. It is
called the ‘power of the test’.

β = P [not rejecting H0 when h1 is true]

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Let’s take another example to understand.

A person is on trial for a criminal offense, and the judge needs to provide a verdict on his
case. Now, there are four possible combinations in such a case:

• First Case: The person is innocent, and the judge identifies the person as innocent
• Second Case: The person is innocent, and the judge identifies the person as guilty
• Third Case: The person is guilty, and the judge identifies the person as innocent
• Fourth Case: The person is guilty, and the judge identifies the person as guilty

Here

H0: Person is innocent

H1: Person is guilty

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

As you can clearly see, there can be two types of error in the judgment –

Type I error will be if the Jury convicts the person [rejects H0] although the person was
innocent [H0 is true].

Type II error will be the case when Jury released the person [Do not reject H0] although
the person is guilty [H1 is true].

According to the Presumption of Innocence, the person is considered innocent until proven
guilty. We consider the Null Hypothesis to be true until we find strong evidence against
it. Then we accept the Alternate Hypothesis. That means the judge must find the
evidence which convinces him “beyond a reasonable doubt.” This phenomenon
of “Beyond a reasonable doubt” can be understood as Significance Level (⍺) ie.
(Judge Decided Guilty | Person is Innocent) should be small. Thus, if ⍺ is smaller, it
will require more evidence to reject the Null Hypothesis.

The basic concepts of Hypothesis Testing are actually quite analogous to this situation.

Steps to Perform Hypothesis Testing


There are four steps to performing Hypothesis Testing:

1. Set the Null and Alternate Hypotheses


2. Set the Significance Level, Criteria for a decision
3. Compute the test statistic
4. Make a decision

It must be noted that z-Test & t-Tests are Parametric Tests, which means that the Null
Hypothesis is about a population parameter, which is less than, greater than, or equal to

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

some value. Steps 1 to 3 are quite self-explanatory but on what basis can we make a
decision in step 4? What does this p-value indicate?

We can understand this p-value as the measurement of the Defense Attorney’s argument.
If the p-value is less than ⍺ , we reject the Null Hypothesis, and if the p-value is greater
than ⍺, we fail to reject the Null Hypothesis.

Level of significance(α)

The significance level, in the simplest of terms, is the threshold probability of incorrectly
rejecting the null hypothesis when it is in fact true. This is also known as the type I error
rate.

It is the probability of a type 1 error. It is also the size of the critical region.

Generally, strong control of α is desired and in tests, it is prefixed at very low levels like
0.05(5%) or 01(1%).

If H0 is not rejected at a significance level of 5%, then one can say that our null hypothesis
is true with 95% assurance.

The p-value is the smallest level of significance at which a null hypothesis can be
rejected.

Decision making with p-value

We compare p-value to significance level(alpha) for taking a decision on Null Hypothesis.

If p-value is greater than alpha, we do not reject the null hypothesis.

If p-value is smaller than alpha, we reject the null hypothesis.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

p-value
To understand this question, we will pick up the normal distribution:

p-value is the cumulative probability (area under the curve) of the values to the right of the red
point in the figure above.
Or,

p-value corresponding to the red point tells us about the ‘total probability’ of getting any value
to the right hand side of the red point, when the values are picked randomly from the population
distribution.
A large p-value implies that sample scores are more aligned or similar to the population score.

Alpha value is nothing but a threshold p-value, which the group conducting the
test/experiment decides upon before conducting a test of similarity or significance ( Z-test or a
T-test).

Consider the above normal distribution again. The red point in this distribution represents the
alpha value or the threshold p-value. Now, let’s say that the green and orange points represent
different sample results obtained after an experiment.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

We can see in the plot that the leftmost green point has a p-value greater than the alpha. As a
result, these values can be obtained with fairly high probability and the sample results are
regarded as lucky.

The point on the rightmost side (orange) has a p-value less than the alpha value (red). As a
result, the sample results are a rare outcome and very unlikely to be lucky. Therefore, they
are significantly different from the population.

The alpha value is decided depending on the test being performed. An alpha value of 0.05 is
considered a good convention if we are not sure of what value to consider.

Let’s look at the relationship between the alpha value and the p-value closely.

p-value < alpha


Consider the following population distribution:

Here, the red point represents the alpha value. This is basically the threshold p-value. We can
clearly see that the area under the curve to the right of the threshold is very low.

The orange point represents the p-value using the sample population. In this case, we can
clearly see that the p-value is less than the alpha value (the area to the right of the red point is
larger than the area to the right of the orange point). This can be interpreted as:

The results obtained from the sample is an extremity of the population distribution (an
extremely rare event), and hence there is a good chance it may belong to some other
distribution (as shown below).

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Considering our definitions of alpha and the p-value, we consider the sample results obtained
as significantly different. We can clearly see that the p-value is far less than the alpha value.

p-value > alpha:


p-value greater than the alpha means that the results are in favor of the null hypothesis and
therefore we fail to reject it. This result is often against the alternate hypothesis (obtained results
are from another distribution) and the results obtained are not significant and simply a matter
of chance or luck.

Again, consider the same population distribution curve with the red point as alpha and the
orange point as the calculated p-value from the sample:

So, p-value > alpha (considering the area under the curve to the right-hand side of the red and
the orange points) can be interpreted as follows:

The sample results are just a low probable event of the population distribution and are very
likely to be obtained by luck.

We can clearly see that the area under the population curve to the right of the orange point is
much larger than the alpha value. This means that the obtained results are more likely to be
part of the same population distribution than being a part of some other distribution.

Example of p-value in Statistics

In the National Academy of Archery, the head coach intends to improve the performance of
the archers ahead of an upcoming competition. What do you think is a good way to improve
the performance of the archers?

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

He proposed and implemented the idea that breathing exercises and meditation before the
competition could help. The statistics before and after experiments are below:

Interesting. The results favor the assumption that the overall score of the archers improved. But
the coach wants to make sure that these results are because of the improved ability of the
archers and not by luck or chance. So what do you think we should do?

This is a classic example of a similarity test (Z-test in this case) where we want to check
whether the sample is similar to the population or not. In order to solve this, we will follow a
step-by-step approach:

1. Understand the information given and form the alternate and null hypothesis
2. Calculate the Z-score and find the area under the curve
3. Calculate the corresponding p-value
4. Compare the p-value and the alpha value
5. Interpret the final results

Step 1: Understand the given information

• Population Mean = 74
• Population Standard Deviation = 8
• Sample Mean = 78
• Sample Size = 60

We have the population mean and standard deviation with us and the sample size is over
30, which means we will be using the Z-test.

According to the problem above, there can be two possible conditions:

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

1. The after-experiment results are a matter of luck, i.e. mean before and after experiment
are similar. This will be our “Null Hypothesis”
2. The after-experiment results are indeed very different from the pre-experiment ones.
This will be our “Alternate Hypothesis”

Step 2: Calculating the Z-Score

On plugging in the corresponding values, Z-Score comes out to be – 3.87.

Step 3: Referring to the Z-table and finding the p-value:


If we look up the Z-table for 3.87, we get a value of ~0.999. This is the area under the curve
or probability under the population distribution. But this is the probability of what?

The probability that we obtained is to the left of the Z-score (Red Point) which we calculated.
The value 0.999 represents the “total probability” of getting a result “less than the sample
score 78”, with respect to the population.

Here, the red point signifies where the sample mean lies with respect to the population
distribution. But we have studied earlier that p value is to the right-hand side of the red point,
so what do we do?

For this, we will use the fact that the total area under the normal Z distribution is
1. Therefore the area to the right of Z-score (or p-value represented by the unshaded region)
can be calculated as:

p-value = 1 – 0.999= 0.001

0.001 (p-value) is the unshaded area to the right of the red point. The value 0.001 represents
the “total probability” of getting a result “greater than the sample score 78”, with respect to
the population.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Step 4: Comparing p-value and alpha value:

We were not given any value for alpha, therefore we can consider alpha = 0.05. According to
our understanding, if the likeliness of obtaining the sample (p-value) result is less than the alpha
value, we consider the sample results obtained as significantly different.

We can clearly see that the p-value is far less than the alpha value:

0.001 (red region) << 0.05 (orange region)

This says that the likeliness of obtaining the mean as 78 is a rare event with respect to the
population distribution. Therefore, it is convenient to say that the increase in the
performance of the archers in the sample population is not the result of luck. The sample
population belongs to some other (better in this case) distribution of itself.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Box plot

A box and whisker plot—also called a box plot—displays the five-number summary of
a set of data. The five-number summary is the minimum, first quartile, median, third
quartile, and maximum.

It is also termed as box and whisker plot.

In a box plot, we draw a box from the first quartile to the third quartile. A vertical line
goes through the box at the median. The whiskers go from each quartile to the
minimum or maximum.

Minimum: The minimum value in the given dataset


First Quartile (Q1): The first quartile is the median of the lower half of the data set.
Median: The median is the middle value of the dataset, which divides the given
dataset into two equal parts. The median is considered as the second quartile.
Third Quartile (Q3): The third quartile is the median of the upper half of the data.
Maximum: The maximum value in the given dataset.
Apart from these five terms, the other terms used in the box plot are:
Interquartile Range (IQR): The difference between the third quartile and first quartile
is known as the interquartile range. (i.e.) IQR = Q3-Q1

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Outlier: The data that falls on the far left or right side of the ordered data is tested to
be the outliers. Generally, the outliers fall more than the specified distance from the
first and third quartile.
(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 . IQR).

Applications
It is used to know:

• The outliers and their values


• Symmetry of Data
• Tight grouping of data
• Data skewness – if, in which direction and how.
o Positively Skewed: If the distance from the median to the maximum is
greater than the distance from the median to the minimum, then the box
plot is positively skewed.
o Negatively Skewed: If the distance from the median to minimum is
greater than the distance from the median to the maximum, then the box
plot is negatively skewed.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

o Symmetric: The box plot is said to be symmetric if the median is


equidistant from the maximum and minimum values.

Example:
3, 7, 8, 5, 12, 14, 21, 13, 18, 50
Step1: Sort the values
3, 5, 7, 8, 12, 14, 15, 18, 21, 50
Step 2: Find the median.
Q2=13
Step 3: Find the quartiles.
First quartile, Q1 = data value at position (N + 2)/4=12/4=3rd position
Third quartile, Q3 = data value at position (3N + 2)/4=8th position
Q1=7
Q3=18
Step 4: Complete the five-number summary by finding the min and the max.

The min is the smallest data point, which is 3.

The max is the largest data point, which is 50.

The five-number summary is 3,7,13,18,50.

Here IQR=Q3-Q1=18-7=12

Any point beyond Q3+ 1.5 IQR (18+1.5*12=18+18=36) and Q1-1.5 IQR (7-1.5*12) is
considered an outlier.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

So the box plot excluding outliers is as follows:

Box plot generated using matplotlib:

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Scatter plot
Scatter plots are the graphs that present the relationship between two variables in a
data-set. It represents data points on a two-dimensional plane or on a Cartesian
system. The independent variable or attribute is plotted on the X-axis, while the
dependent variable is plotted on the Y-axis. These plots are often called scatter
graphs or scatter diagrams.
A scatter plot is also called a scatter chart, scattergram, or scatter plot, XY graph. The
scatter diagram graphs numerical data pairs, with one variable on each axis, show
their relationship.
Scatter plots are used in either of the following situations.

• When we have paired numerical data


• When there are multiple values of the dependent variable for a unique value of
an independent variable
• In determining the relationship between variables in some scenarios, such as
identifying potential root causes of problems, checking whether two products
that appear to be related both occur with the exact cause and so on.

The line drawn in a scatter plot, which is near to almost all the points in the plot is
known as “line of best fit” or “trend line“.

Types of correlation
The scatter plot explains the correlation between two attributes or variables. It
represents how closely the two variables are connected. There can be three such
situations to see the relation between the two variables –

1. Positive Correlation
2. Negative Correlation
3. No Correlation

Positive Correlation
A scatter plot with increasing values of both variables can be said to have a positive
correlation. Now positive correlation can further be classified into three categories:

• Perfect Positive – Which represents a perfectly straight line


• High Positive – All points are nearby

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

• Low Positive – When all the points are scattered

Negative Correlation
A scatter plot with an increasing value of one variable and a decreasing value for
another variable can be said to have a negative correlation.These are also of three
types:

• Perfect Negative – Which form almost a straight line


• High Negative – When points are near to one another
• Low Negative – When points are in scattered form

No Correlation
A scatter plot with no clear increasing or decreasing trend in the values of the variables
is said to have no correlation

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Scatter plot Example


Let us understand how to construct a scatter plot with the help of the below example.
Question:
Draw a scatter plot for the given data that shows the number of games played and
scores obtained in each instance.

No. of games 3 5 2 6 7 1 2 7 1 7

Scores 80 90 75 80 90 50 65 85 40 100

Solution:
X-axis or horizontal axis: Number of games
Y-axis or vertical axis: Scores
Now, the scatter graph will be:

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Note: We can also combine scatter plots in multiple plots per sheet to read and
understand the higher-level formation in data sets containing multivariable, notably
more than two variables.

Scatter plot Matrix


For data variables such as x1, x2, x3, and xn, the scatter plot matrix presents all the
pairwise scatter plots of the variables on a single illustration with various scatterplots
in a matrix format. For the n number of variables, the scatterplot matrix will contain n
rows and n columns. A plot of variables xi vs xj will be located at the ith row and jth
column intersection. We can say that each row and column is one dimension, whereas
each cell plots a scatter plot of two dimensions.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Z-Test

z tests are a statistical way of testing a Null Hypothesis when either:


• We know the population variance, or
• We do not know the population variance, but our sample size is large n ≥ 30.

If we have a sample size of less than 30 and do not know the population variance, we must use a t-test. This is
how we judge when to use the z-test vs the t-test. Further, it is assumed that the z-statistic follows a standard
normal distribution. In contrast, the t-statistics follows the t-distribution with a degree of freedom equal to n-1,
where n is the sample size.

It must be noted that the samples used for z-test or t-test must be independent sample, and also must have a
distribution identical to the population distribution. This makes sure that the sample is not “biased” to/against
the Null Hypothesis which we want to validate/invalidate.

For the null hypothesis H0 if µ = µ0 then.

H1 = µ > µ0 (Right tail)

H1 = µ < µ0 (Left tail)

H1 = µ # µ0 (Two tail test)

One-Sample Z-Test
We perform the One-Sample z-Test when we want to compare a sample mean with the population mean.

Here’s an Example to Understand a One Sample z-Test


Let’s say we need to determine if girls on average score higher than 600 in the exam. We have the information
that the standard deviation for girls’ scores is 100. So, we collect the data of 20 girls by using random samples
and record their marks. Finally, we also set our ⍺ value (significance level) to be 0.05.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

In this example:
• Mean Score for Girls is 641
• The number of data points in the sample is 20
• The population mean is 600
• Standard Deviation for Population is 100

Since the P-value is less than 0.05, we can reject the null hypothesis and conclude based on our result that
Girls on average scored higher than 600.

Two-Sample Z-Test

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

We perform a Two Sample z-test when we want to compare the mean of two samples.

Here’s an Example to Understand a Two Sample Z-Test


Here, let’s say we want to know if Girls on an average score 10 marks more than the boys. We have the
information that the standard deviation for girls’ Score is 100 and for boys’ score is 90. Then we collect the data
of 20 girls and 20 boys by using random samples and record their marks. Finally, we also set our ⍺ value
(significance level) to be 0.05.

In this example:
• Mean Score for Girls (Sample Mean) is 641
• Mean Score for Boys (Sample Mean) is 613.3
• Standard Deviation for the Population of Girls’ is 100
• Standard deviation for the Population of Boys’ is 90
• Sample Size is 20 for both Girls and Boys
• Difference between Mean of Population is 10

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Thus, we can conclude based on the p-value that we fail to reject the Null Hypothesis. We don’t have enough
evidence to conclude that girls on average score of 10 marks more than the boys. Pretty simple, right?

T-Test

T-tests are a statistical way of testing a hypothesis when:


• We do not know the population variance
• Our sample size is small, n < 30.

T test is a type of inferential statistic used to study if there is a statistical difference between two groups.
Mathematically, it establishes the problem by assuming that the means of the two distributions are equal (H₀:
µ₁=µ₂). If the t-test rejects the null hypothesis (H₀: µ₁=µ₂), it indicates that the groups are highly probably different.

The statistical test can be one-tailed or two-tailed. The one-tailed test is appropriate when there is a difference
between groups in a specific direction. It is less common than the two-tailed test. When choosing a t test, you
will need to consider two things: whether the groups being compared come from a single population or two
different populations, and whether you want to test the difference in a specific direction.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

There are three main types of t-test :

• One Sample t-test : Compares mean of a single group against a known/hypothesized/ population mean.
• Two Sample: Paired Sample T Test: Compares means from the same group at different times.
• Two Sample: Independent Sample T Test: Compares means for two different groups.

One Sample t-test:

(Sample Mean- Population Mean)


t=
Standard Error

̅−µ
𝒙
𝒕=𝒔
⁄ 𝒏

𝑥̅ Sample mean
µ Population mean
𝑠 Sample standard deviation
𝑛 Sample size
Standard deviation can be calculated as:

Degree of freedom =n-1

Two-sample - Paired Sample t-test

̅
𝒅
𝒕=𝒔
⁄ 𝒏

𝑑̅ =Mean of the difference


𝑠=Standard deviation of the difference
𝑛 =is the sample size

Degree of freedom =n-1

Standard deviation can be calculated as:

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

∑𝑑 2 −∑𝑛(𝑑̅ )2
s= √ 𝑛−1

Two Sample: Independent Sample T Test:


̅̅̅
𝒙𝟏 − ̅̅̅
𝒙𝟐
𝒕=
𝟏 𝟏
𝒔𝒑 √𝒏 + 𝒏
𝟏 𝟐

(𝒏𝟏 − 𝟏)𝒔𝟏 𝟐 + (𝒏𝟐 − 𝟏)𝒔𝟐 𝟐


𝒔𝒑 = √
𝒏𝟏 + 𝒏𝟐 − 𝟐
̅̅̅:
𝒙𝟏 Mean of the First Sample
̅̅̅:
𝒙𝟐 Mean of the Second Sample
n1 : Number of items in First Sample
n2 : Number of items Second Sample
s1= Standard deviation of First Sample
s2= Standard deviation of Second Sample
Sp = Pooled Standard /Combined Standard Deviation

Degree of freedom is n1 + n2 – 2.

• If the calculated t value is greater than critical t value (obtained from a critical value table called the T-
distribution table) then reject the null hypothesis.
• P-value <significance level (𝜶) => Reject your null hypothesis in favor of your alternative
hypothesis. Your result is statistically significant.
• P-value >= significance level (𝜶) => Fail to reject your null hypothesis. Your result is not statistically
significant.

Prof.Ramya R B Dept.of Computer Engineering


11/04/2023

APPLIED DATA SCIENCE


Time series forecasting using linear regression

PROF.RAMYA.R.B , ASSISTANT PROFESSOR , COMPUTER ENGINEERING,APST THANE


OBJECTIVE

“ To demonstrate time series forecasting using linear regression. “


Note:ODD Number of years.So midpoint=2003
Note:EVEN Number of years.So
midpoint=(2003+2004)/2
=2003.5
Either multiply (year-midpoint)
by 2 or divide by 0.5 to get X
value
Note:EVEN Number of years.So
midpoint=(1998+1999)/2 =1998.5

Either multiply (year-midpoint) by 2 or divide by


0.5 to get X value

13-04-2023 7
Note:EVEN Number of years.So
midpoint=(1964+1965)/2=1964.5
Thank you
APPLIED DATA SCIENCE
Time Series Forecasting

PROF.RAMYA.R.B , ASSISTANT PROFESSOR , COMPUTER ENGINEERING,APST THANE


OBJECTIVE



To explain the taxonomy of time series forecasting methods
and time series decomposition.
• Time series is a series of observations listed in the order of time.

• The data points in a time series are usually recorded at constant successive time intervals.
• Time series analysis is the process of extracting meaningful non-trivial information and patterns from
time series.
• Time series forecasting is the process of predicting the future value of time series data based on past
observations and other inputs.

• The objective in time series forecasting is slightly different: to use historical information about a
particular quantity to make forecasts about the value of the same quantity in the future.
UTILITY OF TIME SERIES FORECASTING

• It helps understanding past behaviour


• It helps in planning future operations
• It helps in evaluating current accomplishments
• It facilitates comparison
• Economist
• Businessman
• Scientist
• Astronomist
• Geologist
• Sociologist
• Research worker
In time series analysis one is concerned with forecasting a specific variable, given that it is known
how this variable has changed over time in the past.

In all other predictive models , the time component of the data was either ignored or was not
available. Such data are known as cross-sectional data.
Taxonomy of Time Series Forecasting

The investigation of time series can also be broadly divided into descriptive modeling, called time series analysis,
and predictive modeling, called time series forecasting.
Time Series forecasting can be further classified into four broad categories of techniques:
• Forecasting based on time series decomposition,
• smoothing based techniques,
• regression based techniques, and
• machine learning-based techniques.
Time series decomposition
• Time series decomposition is the process of deconstructing a time series into the number of constituent
components with each representing an underlying phenomenon.

• Decomposition splits the time series into a trend component, a seasonal component, and a noise
component.

• The trend and seasonality components are predictable (and are called systematic components),
whereas, the noise, by definition, is random (and is called the non-systematic component).

• Before forecasting the time series, it is important to understand and describe the components that
make the time series.
Firstly, the overall drug sales is trending upward and the upward trend accelerates in the 2000s.

Secondly, there is clear seasonality in the time series of drug sales. In particular, it is a yearly seasonality. There is a
spike in drug sales at the start of the year and a dip in every February. This seasonal variation is consistent every
year.

However, even when accounting for the trend and the seasonal variations there is one more phenomenon that
could not be explained. For example, the pattern in 2007 is odd when compared with prior years or 2008.
Components/Elements of a Time Series
Trend: Trend is the long-term tendency of the data. It represents change from one period to next.

Seasonality: Seasonality is the repetitive behavior during a cycle of time. These are repeated patterns appearing over and
over again in the time series. Seasonality can be further split into hourly, daily, weekly, monthly, quarterly, and yearly
seasonality.

Cycle: Cyclic component represents longer-than-a-year patterns where there is no specific time frames between the cycles.
An example here is the economic cycle of booms and crashes. While the booms and crashes exhibit a repeated pattern, the
length of a boom period, the length of a recession, the time between subsequent booms and crashes (and even two
consecutive crashes— double dip) is uncertain and random, unlike the seasonality components.

Noise: In a time series, anything that is not represented by level, trend, seasonality, or cyclic component is the noise in the
time series. The noise component is unpredictable but follows normal distribution in ideal cases.
All the time series datasets will have noise.
These individual components can be better forecasted using regression or similar techniques and
combined together as an aggregated forecasted time series.

This technique is called forecasting with decomposition.


Time series can be thought as past observations informing future predictions. To forecast future
data, one can smooth past observations and project it to the future. Such time series forecasting
methods are called smoothing based forecasting methods.

In smoothing methods, the future value of the time series is the weighted average of past
observations.
Regression based forecasting techniques are similar to conventional supervised predictive models,
which have independent and dependent variables, but with a twist: the independent variable is
now time. The simplest of such methods is of course a linear regression model of the form:

where yt is the value of the target variable at time t.

Given a training set, the values of coefficients a and b can be estimated to forecast
future y values.

Regression based techniques can get pretty complicated with the type of function used to model
the relationship between future value and time. Commonly used functions are exponential,
polynomial, and power law functions. Most people are familiar with the trend line function
in spreadsheet programs, which offer several different function choices. The regression based
time series forecast differs from a regular function-fitting predictive model in the choice of the
independent variable.
A more sophisticated technique is based on the concept of autocorrelation.

Autocorrelation refers to the fact that data from adjacent time periods are correlated in a time
series. The most well-known among these techniques is ARIMA, which stands for Auto Regressive
Integrated Moving Average.

Any supervised classification or regression predictive models can be used to forecast the time
series too, if the time series data are transformed to a particular format with a target label and
input variables. This class of techniques are based on supervised machine learning models where
the input variables are derived from the time series using a windowing technique.
• In the time series decomposition can be classified into additive decomposition and multiplicative decomposition,
based on the nature of the different components and how they are composed.

• In an additive decomposition, the components are decomposed in such a way that when they are added together,
the original time series can be obtained.

Time series=Trend + Seasonality + Noise

• In the case of multiplicative decomposition the components are decomposed in the such a way that when they are
multiplied together, the original time series can be derived back.

Time series=Trend * Seasonality * Noise

• Both additive and multiplicative time series decomposition can be represented by these equations:

where Tt, St, and Et are trend, seasonal, and error components respectively.

• The original time series yt is just an additive or multiplicative combination of components.


• If the magnitude of the seasonal fluctuation or the variation in trend changes with the level of the time series, then
multiplicative time series decomposition is the better model.
Classical Decomposition

• The classical decomposition technique is simple, intuitive, and serves as a baseline of all other advanced
decomposition methods.
• Suppose the time series has yearly seasonality with monthly data as shown in Fig.m represents seasonal
period, which is 12 for monthly data with yearly seasonality.

• The classical decomposition technique first estimates the trend component by calculating the long term (say
12 month) moving average.
• The trend component is removed from the time series to get remaining seasonal and noise components.

• The seasonal component can be estimated by average Jan, Feb, Mar, . . ., variance in the remaining series.

• Once the trend and seasonal components are removed, what is left is noise.
The algorithm for classic additive decomposition is:

1. Estimate the trend Tt: If m is even, calculate 2*m moving average (m-MA and then a 2-MA); if m is odd, calculate m-
moving average. A moving average is the average of last m data points.

2. Calculate detrended series: Calculate yt -Tt for each data point in the series.

3. Estimate the seasonal component St: average (yt - Tt) for each m period. For example, calculate the average of all January
values of (yt -Tt) and repeat for all the months. Normalize seasonal value in such a way that mean is zero.

4. Calculate the noise component Et: Et=(yt - Tt - St) for each data point in the series.

Multiplicative decomposition is similar to additive decomposition: Replace subtraction with division in the algorithm
described.
Thank you
10/04/2023

APPLIED DATA SCIENCE


ARIMA Model

PROF.RAMYA.R.B , ASSISTANT PROFESSOR , COMPUTER ENGINEERING,APST THANE


OBJECTIVE

“ To demonstrate ARIMA model. “


14-04-2023 7
8
9
Three factors define ARIMA model, it is defined as ARIMA(p,d,q) where p, d, and q denote
the number of lagged (or past) observations to consider for autoregression, the number of
times the raw observations are differenced, and the size of the moving average window
respectively.

The below equation shows a typical autoregressive model. As the name suggests,
the new values of this model depend purely on a weighted linear combination of its
past values. Given that there are p past values, this is denoted as AR(p) or an
autoregressive model of the order p. Epsilon (ε) indicates the white noise

14-04-2023 10
Next, the moving average is defined as follows:

the moving average

Here, the future value y(t) is computed based on the errors εt made by
the previous model. So, each successive term looks one step further
into the past to incorporate the mistakes made by that model in the
current computation. Based on the window we are willing to look past,
the value of q is set. Thus, the above model can be independently
denoted as a moving average order q or simply MA(q).

14-04-2023 11
Why does ARIMA need Stationary Time-Series Data?

Stationarity
A stationary time series data is one whose properties do not depend on the time, That is why
time series with trends, or with seasonality, are not stationary. the trend and seasonality will
affect the value of the time series at different times,

A stationary time series is one whose statistical properties such as mean, variance,
autocorrelation, etc. are all constant over time.

On the other hand for stationarity it does not matter when you observe it, it should look
much the same at any point in time. In general, a stationary time series will have no
predictable patterns in the long-term.
Why does ARIMA need Stationary Time-Series Data?
Time series data must be made stationary to remove any obvious correlation and collinearity
with the past data.

In stationary time-series data, the properties or value of a sample observation does not depend
on the timestamp at which it is observed. For example, given a hypothetical dataset of the year-
wise population of an area, if one observes that the population increases two-fold each year or
increases by a fixed amount, then this data is non-stationary.

Any given observation is highly dependent on the year since the population value would rely on
how far it is from an arbitrary past year. This dependency can induce incorrect bias while
training a model with time-series data.

To remove this correlation, ARIMA uses differencing to make the data stationary.

Differencing, at its simplest, involves taking the difference of two adjacent data points.
For example, the left graph above shows Google's stock price for 200 days. While the
graph on the right is the differenced version of the first graph – meaning that it shows the
change in Google stock of 200 days. There is a pattern observable in the first graph, and
these trends are a sign of non-stationary time series data. However, no trend or
seasonality, or increasing variance is observed in the second figure. Thus, we can say that
the differenced version is stationary.

14-04-2023 14
This change can simply be modeled by

Where B denotes the backshift operator defined as

14-04-2023 15
14-04-2023 16
Combining all of the three types of models above gives the
resulting ARIMA(p,d,q) model.

14-04-2023 17
In general, it is a good practice to follow the next steps when doing time-series forecasting:

•Step 1 — Check Stationarity: If a time series has a trend or seasonality component, it must be made
stationary.
•Step 2 — Determine the d value: If the time series is not stationary, it needs to be stationarized
through differencing.
•Step 3 — Select AR and MA terms: Use the ACF and PACF to decide whether to include an AR term,
MA term, (or) ARMA.
•Step 4 — Build the model

14-04-2023 18
For a stationary time series,
the ACF will drop to zero
relatively quickly, while the
ACF of non-stationary data
decreases slowly.

14-04-2023 19
For a stationary time series,
the ACF will drop to zero
relatively quickly, while the
ACF of non-stationary data
decreases slowly.

14-04-2023 20
The right order of differencing is the minimum differencing required to get a near-stationary
series which roams around a defined mean and the ACF plot reaches to zero fairly quick.

If the autocorrelations are positive for many number of lags (10 or more), then the series
needs further differencing.

On the other hand, if the lag 1 autocorrelation itself is too negative, then the series is probably
over-differenced

14-04-2023 21
Check if the series is stationary using the Augmented Dickey
Fuller test (adfuller()), from the statsmodels package.

Why?

Because, you need differencing only if the series is non-


stationary. Else, no differencing is needed, that is, d=0.

The null hypothesis of the ADF test is that the time series is
non-stationary. So, if the p-value of the test is less than the
significance level (0.05) then you reject the null hypothesis and
infer that the time series is indeed stationary.

So, in our case, if P Value > 0.05 we go ahead with finding the
order of differencing.

14-04-2023 22
Thank you
29/03/2023

APPLIED DATA SCIENCE


Moving Average and Exponential Smoothing

PROF.RAMYA.R.B , ASSISTANT PROFESSOR , COMPUTER ENGINEERING,APST THANE


OBJECTIVE



To demonstrate smoothening methods:moving average and exponential smooth
ing.
The various time series forecasting methods are:

• Simple Average
• Moving Average
• Weighted Moving Average
• Naïve Method
• Exponential smoothing
• Time Series Analysis using Linear Regression(Least Squares
Method)
• ARIMA
Simple Average:

The method is very simple: average the data by months or


quarters or years and them calculate the average for the
period. Then find out, what percentage it is to the grand
average.
Moving Average
3-Moving Average

Example 1:
3-Moving Average

Example 2:

13-04-2023 8
4-Moving Average

Example 1

13-04-2023 9
13-04-2023 10
13-04-2023 11
4-Moving Average

Example 2:

13-04-2023 12
4-Moving Average

Example 3:

13-04-2023 13
13-04-2023 14
5-Moving Average

Example 1

13-04-2023 15
13-04-2023 16
13-04-2023 17
Weighted Moving Average

When using a moving average method described before, each of the observations used to compute the
forecasted value is weighted equally. In certain cases, it might be beneficial to put more weight on the
observations that are closer to the time period being forecast. When this is done, this is known as a weighted
moving average technique. The weights in a weighted MA must sum to 1.
Naïve Method

13-04-2023 20
Exponential Smoothing

13-04-2023 21
Continue this
process till
13-04-2023
week 10 22
Final Answer

13-04-2023 23
Thank you
Data Science Life Cycle for Weather Forecasting

1. Problem statement: Weather prediction is a major issue for the weather department as it
is associated with the human's life and the economy. For instance, excess rainfall is the major
cause of natural disasters such as drought and flood which are encountered by the people every
year across the world. The time series machine learning model can be used for forecasting weather
conditions in any state. Time series currently is becoming very popular, a reason for that is
declining hardware's cost and capability of processing.

An effort has been made to develop a SARIMA (Seasonal Autoregressive Integrated Moving
Average) model for temperature prediction using historical data from Pune, Maharashtra. The
historical dataset from the year 2009 to 2020 has been taken for observation. When there is a
repeating cycle is present in a time series, instead of decomposing it manually to fit an ARIMA
model, another very popular method is to use the seasonal autoregressive integrated moving
average (SARIMA) model. The seasonal ARIMA model is designed by running Python on
Anaconda Jupyter Notebook and using the package matplotlib for data visualization.

Description of dataset:
Temperature data recorded from 2009 to 2020 were obtained for Pune city, from the meteorology
department at one-hour intervals. The data collected has different parameters, such as date time,
temperature, humidity, moonrise, wind speed, wind direction, pressure.

2. Data Pre-processing

Time series data is normally messy. Forecasting models from simple rolling averages to LSTMs
requires data to be clean. So the techniques used before moving to forecasting are:

Detrending/ Stationarity: Before forecasting, we want our time series variables to be mean-
variance stationery. This means that the statistical properties of a model do not vary depending
on when the sample was taken. Models built on stationary data are generally more robust. This
can be achieved by using differencing.
Anomaly detection: Any outlier present in the data might skew the forecasting results so it’s
often considered a good practice to identify and normalize outliers before moving on to
forecasting.
Check for sampling frequency: This is an important step to check the regularity of sampling.
Irregular data has to be imputed or made uniform before applying any modeling techniques
because irregular sampling leads to broken integrity of the time series and doesn’t fit well with
the models.
Missing data: At times there can be missing data for some datetime values and it needs to be
addressed before modeling.

3. Data visualization
The time series data can be broken down into trend, seasonality and noise components using
multiplicative decomposition and individual plots can be created to visually inspect them.

4. Model Building
The hourly temperature data during 2009–2018 is used as the training set, while that during 2019–
2020 is used as the testing set. To evaluate the forecast accuracy, as well as to compare the results
obtained from different models, the mean- square error (MSE) is calculated.

The algorithms used in the case study are:


Auto-Regressive Integrated Moving Average (ARIMA)
Seasonal Auto-Regressive Integrated Moving Average (SARIMA)

Akaike’s Information Criterion (AIC) is the most commonly used model selection criterion . AIC
basically deals with the goodness of fit of a model. AIC is calculated as : AIC = -2 ln (maximum
likelihood) + 2p Where, p represents the number of independent constraints estimated. Therefore,
when comparing models, the one with the least AIC value is chosen.

Auto-Regressive Integrated Moving Averages: ARIMA is a form of regression analysis that


gauges the strength of one dependent variable relative to other changing variables. The model’s
goal is to predict future securities or financial market moves by examining the differences
between values in the series instead of actual values. Autoregression(AR) refers to a model that
shows a changing variable that regresses on its own lagged, or prior, values. Integrated(I)
represents the differencing of raw observations to allow for the time series to become
stationary. Moving Average(MR) incorporates the dependency between an observation and a
residual error from a moving average model applied to lagged observations.
Each component in ARIMA functions as a parameter with a standard notation that would be
ARIMA with p, d and q where integer values substitute for the parameters to indicate the type of
ARIMA model used.
p – no. of lag observations in the model (known as the lag order)
d – no. of times that the raw observations are differenced (known as the degree of differencing)
q – the size of moving average (order of moving average)
Seasonal Autoregressive Integrated Moving Average: SARIMA model is one step different
from an ARIMA model based on the concept of seasonal trends. It similarly uses past values but
also considers any seasonality patterns. The order in the ARIMA model (p, d, q) is used in this
model, which does consider seasonality. Added to that ‘s’ indicates the seasonal length in the data.

The flow of the proposed system is as follows:


5. Model Evaluation

Mean Square Error(MSE), Root Mean Square Error(RMSE) and Mean Absolute Error(MAE) are
normally used as used as performance evaluation metrics. The predicted temperature values are
compared with actual values for accuracy based on error metrics.

Let us say we obtained MAE of 0.60850 and RMSE of 0.76233for SARIMA model and MAE of
6.052 and RMSE of 7.496 for ARIMA model. If so, we conclude that SARIMA model forecasts
yielded least error in prediction of temperature as output. Thus, derived model could be used to
forecast weather for the upcoming years.

Owing to the linear nature of both the algorithms, they are quite handy and used in the industry
when it comes to experimentation and understanding the data, creating baseline forecasting scores.
If tuned right with lagged values (p,d,q) they can perform significantly better. The simple and
explainable nature of both the algorithms makes them one of the top picks by analysts and Data
Scientists. There are, however, some pros and cons when working with ARIMA and SARIMA at
scale.
Data Science Life Cycle for Customer Segmentation

1. Problem statement:
Customer segmentation is simply grouping customers with similar characteristics. These
characteristics include geography, demography, behavioural, purchasing power, situational factors,
personality, lifestyle, psychographic, etc. The goals of customer segmentation are customer
acquisition, customer retention, increasing customer profitability, customer satisfaction, resource
allocation by designing marketing measures or programs and improving target marketing measures.

Assume a supermarket mall owner through membership cards has some basic data about
customers like Customer ID, age, gender, annual income, spending score etc. and wants to
understand the customers like who are the target customers so that the sense can be given to
marketing team and the strategies can be planned accordingly.
2. Description of dataset:

The fictious dataset includes the following features:


1. Customer ID
2. Customer Gender
3. Customer Age
4. Annual Income of the customer (in Thousand Dollars)
5. Spending score of the customer (based on customer behaviour and spending nature)

3. Descriptive Statistics:
The describe () function in python can be used to get a descriptive statistics summary of a given
dataframe. This includes mean, count, std deviation, percentiles, and min-max values of all the
features.

4. Data Pre-processing

In this step, the following operations can be performed:

— Looking for null or missing values


— Looking for duplicated values
— Extracting columns of choice
— Feature normalization

We are only interested in the Annual Income (k$) and Spending Score (1–100). So these columns
can be extracted from the dataset.
Feature normalization helps to adjust all the data elements to a common scale in order to improve
the performance of the clustering algorithm. For example, in the data set let’s say Annual Income is
having values in thousands and spending score in just two digits. Since the data in these variables
are of different scales, it is tough to compare these variables. Each data point is converted to the
range of 0 to +1. Normalization techniques include Min-max, decimal scaling and z-score.

5. Data visualization
We are interested in identifying the relationship between the Annual Income (k$) and Spending
Score (1–100) we would use the scatterplot. Bar plots, pair plots etc can also be used for visualizing.

6. Model Building
Clustering algorithms include the K-means algorithm, hierarchical clustering, DBSCAN . In this
project, the k-means clustering algorithm has been applied in customer segmentation. K-means is
a clustering algorithm based on the principle of partition.
The steps of K-means clustering are :
1. Determine the number of clusters (k).
2. Select initial centroids.
3. Map each data point into the nearest cluster (most similar to centroid).
4. Update the mean value (centroid) of each cluster.
5. Repeat step 3–4 until all centroids are not changed.

Choosing the Optimum Number of Clusters


To find the optimum number of clusters we can use the WCSS (Within Clusters Sum of Squares)
and elbow plot.WCSS is defined as the sum of the squared distance between each member of the
cluster and its centroid. WCSS for different values of k is computed, and k for which WSS first
starts to diminish is chosen. In the plot of WSS-versus k, this is visible as an elbow.
The steps can be summarized in the below steps:
1. Compute K-Means clustering for different values of K by varying K from 1 to 10
clusters.
2. For each K, calculate the total within-cluster sum of square (WCSS).
3. Plot the curve of WCSS vs the number of clusters K.
4. The location of a bend (knee) in the plot is generally considered as an indicator of the
appropriate number of clusters.
Let us say from the elbow graph say we can observe that between number of cluster = 4 to number
of cluster = 6 there has been substantial decrease(an elbow) hence, we chose the K value for our
dataset as 5.
Now the model is trained on the dataset with a number of clusters 5.

The result of the analysis can be visualized using a 2D or 3D plot and it shows that the retail store
customers can be group into 5 clusters or segments for targeted marketing(shown using 5 different
colours).
Cluster 1 (green): These are average income earners with average spending scores. They are
cautious with their spending at the store.
Cluster 2 (yellow): The customers in this group are high income earners and with high spending
scores. They bring in profit. Discounts and other offers targeted at this group will increase their
spending score and maximize profit.
Cluster 3 (red): This group of customers have a higher income but they do not spend more at the
store. One of the assumption could be that they are not satisfied with the services rendered at the
store. They are another ideal group to be targeted by the marketing team because they have the
potential to bring in increased profit for the store.
Cluster 4 (purple): Low income earners with low spending score. I can assume that this is so
because people with low income will tend to purchase less item at the store.
Cluster 5 (blue): These are low income earning customers with high spending scores. I can assume
that why this group of customers spend more at the retail store despite earning less is because they
enjoy and are satisfied with the services rendered at the retail store.

With the help of clustering, we can understand the variables much better, prompting us to take
careful decisions. With the identification of customers, companies can release products and
services that target customers based on several parameters like income, age, spending patterns, etc.
Furthermore, more complex patterns like product reviews are taken into consideration for better
segmentation
Data Science Life Cycle for House Price Prediction

1. Problem statement:
We can analyse existing real estate prices to predict the price of real estate. This can be very useful
in understanding the valuation of a property or a new development as many people get confused
about the prices while purchasing a property and often end up paying too much for a flat or a
house. The problem statement is to predict the sale price of a house, given the features of the
house. The features are the columns in the dataset, and the target variable is the SalePrice column.
The problem is a regression problem, as the target variable is continuous.

2. Description of dataset: We shall use some sample real estate data to predict real estate
prices. For e.g., the dataset may have attributes like category marking under construction or not,
RERA approved or not, Number of rooms, Type of property i.e.., 1 RK/1 BHK/2 BHK, Total area
of the house in square feet, Category marking ready to move or not, Category marking resale or
not, Address of the property, Longitude of the property, Latitude of the property etc.

3. Descriptive statistics and inferential statistics

It is important for us to get a good understanding of the characteristics of our data. The .info()
method can be used to get a quick snapshot of the current state of the dataset.

Descriptive statistics can be used to know more specifically what are the min, max, mean,
standard deviation, upper and lower bounds and count of all the variables within our dataset. This
information is important firstly for determining the general characteristics of our data, secondly,
for determining the outliers we have in our data and it will give also us a better idea as to how we
need to prepare our data for training later on.

4. Data Pre-processing
There could be some mistakes in the collected entries in the dataset like some null values, human
errors or some impractical values which we call as outliers. So to overcome these inaccuracies,
we need to Pre-process and clean the data from these clutter values. There is a high need of Data
Pre- processing because if the Data that we are providing to our model is accurate and faultless,
then only the model will be able to give precise estimations which are very close to the actual
value.

Handling missing values:


If there are missing values (NaN) in the dataset we can handle them by getting rid of them or
replacing them with some values like mean,median etc. Replacing Nan values with the median is
the best course of action especially when we have a smaller dataset and we don’t want to lose any
of our precious data. If we have a very large dataset it is fine to just drop the Nan values provided
there is only a small number of them. It is better to replace the Nan values with the median rather
then the mean as the median is less affected when the distribution of our dataset is skewed.

Handling outliers:
Outlier values can be removed from the dataset or replaced with less extreme values. There is
also one more way to deal with outliers that called capping! This is the optimal method of dealing
with outliers. Here we first identify the upper and lower bounds of our data. This is generally
done by calculating 1.5*IQR (Inter-quartile range) above and below the mean. All data points
that fall outside our upper and lower bounds are considered outliers and need to be replaced by the
upper or lower bound.
After the data has been cleaned and free from the outliers, Feature Engineering and Exploratory
Data Analysis have to be done.

Feature Scaling
One of the most important transformations you need to apply to your data is feature scaling. With
few exceptions, Machine Learning algorithms don’t perform well when the input numerical
attributes have very different scales.
There are two common ways to get all attributes to have the same scale: min-max
scaling and standardization.

5. Data visualization
It is quite useful to have a quick overview of different features distribution vs house price. Some
datasets have many features that could be a deterrent in prediction, by finding out which features
would give the best result and dropping those that wouldn't give the best results we can try to get
better accuracy. This could be done using scatter plots, correlation matrix etc.

6. Model Building

The features and target variable is defined.


Features (X): The columns that are inserted into our model will be used to make predictions.
Prediction (y): Target variable that will be predicted by the features.

Then the data is split into training and testing sets for the classification of the best fitting machine
learning model. The standard 80-20 split ratio is used, a typical ratio for this purpose; 80% of the
data is considered as a training set and 20% as a testing set. To allow the implementation of the
model, Scikit-Learn have to be imported. It is a Python Library which provides machine learning
algorithms for implementation and many more features for modeling. We are performing
Supervised Learning and to find out best model we will be implementing some regression
algorithms which are likely to do a precise estimation of prices. The model which gives least error
and most nearer value prediction will be our final model.

To test the results of different models and compare them, we will provide same input values for
all the models. Lets take an example of area Kasarvadavli in Thane and check price for 900 sqft.
house having 2 bedrooms and 2 baths and compare the price given by different algorithms.

Regression analysis is a type of predictive modeling technique that analyses the relation between
the target or dependent variable and independent variables in a dataset. It involves determining
the best fitting line that passes through all the data points in such a way that distance of the line
from each data point is minimal. For most accurate Predictions we are trying different Regression
techniques on given problem statement to find out best fitting model. This includes linear
regression, Support Vector Regressor and Decision Trees.

1.Multiple Linear Regression:

The main aim of Linear Regression model is to find the best fit linear line and the optimal
values of intercept and coefficients such that the error is minimized. Error is defined as the
difference between the actual value and Predicted value. The goal is to reduce this error or
difference. Linear Regression is of two types based on number of independent variables: Simple
and Multiple. Simple Linear Regression contains only one independent variable and the model
has to find the linear relationship between this and the dependent variable. Whereas, Multiple
Linear Regression contains more than one independent variables for the model to find the
relationship with the dependent variable.

Equation of Simple Linear Regression is, y=b0+b1*x

Where b0 is the intercept, b1 is coefficient or slope, x is the independent variable and y is the
dependent variable.

Equation of Multiple Linear Regression is, y =b0+b1*x1+b2*x2+b3*x3+….bn*xn

Where b0 is the intercept, b1,b2,b3,b4,bn are the coefficients or slopes of the independent
variables x1,x2,x3,x4,xn and y is the dependent variable.

Multiple Linear Regression is an extension of Simple Linear Regression and here we assume
that there is a linear relationship between a dependent variable Y and independent variables X

So now , we have train data , test data and labels for both we fit our train and test data into a
multiple linear regression model. We use sklearn (built in python library) and import linear
regression from it and then initialize Linear Regression to a variable reg.

After fitting our data to the model we can check the score of our data ie , prediction. in this case
say the prediction is 92%

2.Support Vector Regression:

Support Vector Regression (SVR) uses the same method as Support Vector Machine
(SVM) but for regression problems. In SVR, the straight line that is required to fit the data is
referred to as hyperplane. The objective of a SVR algorithm is to find a hyperplane in an n-
dimensional space that classifies the data points. The data points on either side of the hyperplane
that are closest to the hyperplane which are called Support Vectors.

The best fit line is the hyperplane that has the maximum number of points. Unlike other
Regression models that try to minimize the error between the real and predicted value, the SVR
tries to fit the best line within a threshold value. The threshold value is the distance between the
hyperplane and boundary line. The problem with SVR is that they are not suitable for large
datasets.

3.Decision Tree Regression:

Decision Tree is a tree-structured algorithm with three types of nodes; Root Node, Interior node
and Leaf node. The Interior Nodes represent the features of a data set and the branches represent
the decision rules.

The Decision Tree Regressor observes features of an attribute and trains a model in the form of a
tree to predict data in the future to produce meaningful output. Decision tree Regressor learns
from the max depth, min depth of a graph and according to system analyses the data.
7. Model Evaluation
Cross-validation of different algorithms has proven to be a suitable method to find an acceptable
best fitting algorithm for the Model. After training the dataset on three different machine learning
model the outcome that has been extracted is as follows, the linear regression model performed
the best with the score of approximately 92%, followed by the SVM regression with an
approximate score of 91%, lastly the decision tree score almost 90% score when trained on a
dataset. Also, according to confusion matrix linear regression is giving nearly accurate predictions.
As the final decision to choose over these machine learning models the optimal choice had to be
the linear regression model and hence, that is the reason of using the same in the proposed
solution.

The Model has also proved that Location and square feet area plays an important role in deciding
the price of a property. This is helpful information for Sellers and buyers to act accordingly. The
GUI has provided Ease of access to the model, hence improving quality of accessibility.

8. Model Deployment

To deploy our machine learning model we need flask which is framework for deploying
functional webpage for our created model. There are more options for that but flask is one of the
effective and instant ways for creating UI for proposed machine learning model. It is also easier
to integrate Flask with the model. Flask allows us to create a UI for our model. Flask provides us
with tools, libraries and technologies that allow us to build a web application.

Once the Implementation is done the model is predicting us the price of the property (house) in
that particular location. We will deploy the model using Flask framework and create UI where
the user will enter the desired values and our model will predict the output. This is made possible
by using the python package for creating an API called Flask. For building the web application
and linking the model with the web application, first we need to extract our model into pickle and
json files and design webpage using HTML, CSS and JavaScript. With this the model is ready to
be displayed and make predictions on the web application.

In the future, the GUI can be made more attractive and interactive. It can also be turned into any
real estate sale website where sellers can give the details and house for sale and buyers can contact
according to the details given on the website.

To simplify it for the user, there can also be a recommending system to recommend real estate
properties to the user based on the predicted price. The current dataset only includes a few
locations of Thane city, expanding it to other cities and states of India is the future goal.

To make the system even more informative and user-friendly, Google maps can also be included.
This will show the neighbourhood amenities such as hospitals, schools surrounding a region of 1
km from the given location. This can also be included in making predictions since the presence
of such factors increases the price of real estate property.

You might also like