AP Shah Ads Notes Pt 1
AP Shah Ads Notes Pt 1
Hypothesis testing
Hypothesis testing helps in data analysis by providing a way to make inferences about a
population based on a sample of data. It allows analysts to make decisions about whether
to accept or reject a given assumption or hypothesis about the population based on the
evidence provided by the sample data. For example, hypothesis testing can be used to
determine whether a sample mean is significantly different from a hypothesized population
mean or whether a sample proportion is significantly different from a hypothesized
population proportion. This information can be used to make decisions about whether to
accept or reject a given assumption or hypothesis about the population.
The hypothesis is a statement, assumption or claim about the value of the parameter
(mean, variance, median etc.).
A hypothesis is an educated guess about something in the world around you. It should be
testable, either by experiment or observation.
Like, if we make a statement that “Dhoni is the best Indian Captain ever.” This is an
assumption that we are making based on the average wins and losses team had under his
captaincy. We can test this statement based on all the match data.
The null hypothesis is the hypothesis to be tested for possible rejection under the
assumption that it is true. The concept of the null is similar to innocent until proven guilty
We assume innocence until we have enough evidence to prove that a suspect is guilty.
It is denoted by H0.
The alternative hypothesis complements the Null hypothesis. It is the opposite of the null
hypothesis such that both Alternate and null hypothesis together cover all the possible
values of the population parameter.
It is denoted by H1.
Let’s understand this with an example:
A soap company claims that its product kills on an average of 99% of the germs.
Suppose lifebuoy claims that, it kills 99.9% of germs. So how can they say so? There has
to be a testing technique to prove this claim right?? So hypothesis testing uses to prove a
claim or any assumptions.
To test the claim of this company we will formulate the null and alternate hypothesis.
Note: When we test a hypothesis, we assume the null hypothesis to be true until there is
sufficient evidence in the sample to prove it false. In that case, we reject the
null hypothesis and support the alternate hypothesis. If the sample fails to provide
sufficient evidence for us to reject the null hypothesis, we cannot say that the null
hypothesis is true because it is based on just the sample data. For saying the null
hypothesis is true we will have to study the whole population data.
When a hypothesis specifies an exact value of the parameter, it is a simple hypothesis and
if it specifies a range of values then it is called a composite hypothesis.
e.g. Motor cycle company claiming that a certain model gives an average mileage of
100Km per liter, this is a case of simple hypothesis.
The average age of students in a class is greater than 20. This statement is a composite
hypothesis.
If the alternate hypothesis gives the alternate in both directions (less than and greater than)
of the value of the parameter specified in the null hypothesis, it is called a Two-tailed test.
If the alternate hypothesis gives the alternate in only one direction (either less than or
greater than) of the value of the parameter specified in the null hypothesis, it is called
a One-tailed test.
here according to H1, mean can be greater than or less than 100. This is an example of a
Two-tailed test
Critical Region
The critical region is that region in the sample space in which if the calculated value lies
then we reject the null hypothesis.
Suppose you are looking to rent an apartment. You listed out all the available apartments
from different real state websites. You have a budget of Rs. 15000/ month. You cannot
spend more than that. The list of apartments you have made has a price ranging from
7000/month to 30,000/month.
You select a random apartment from the list and assume below hypothesis:
Now, since your budget is 15000, you have to reject all the apartments above that price.
Here all the Prices greater than 15000 become your critical region. If the random
apartment’s price lies in this region, you have to reject your null hypothesis and if the
random apartment’s price doesn’t lie in this region, you do not reject your null hypothesis.
The critical region lies in one tail or two tails on the probability distribution curve according
to the alternative hypothesis. The critical region is a pre-defined area corresponding to a
cut off value in the probability distribution curve. It is denoted by α.
Critical values are values separating the values that support or reject the null hypothesis
and are calculated on the basis of alpha.
We will see more examples later on and it will be clear how do we choose α.
So Type I and type II error is one of the most important topics of hypothesis testing. Let’s
simplify it by breaking down this topic into a smaller portion.
A false positive (type I error) — when you reject a true null hypothesis.
A false negative (type II error) — when you accept a false null hypothesis.
The probability of committing Type I error (False positive) is equal to the significance level
or size of critical region α.
The probability of committing Type II error (False negative) is equal to the beta β. It is
called the ‘power of the test’.
A person is on trial for a criminal offense, and the judge needs to provide a verdict on his
case. Now, there are four possible combinations in such a case:
• First Case: The person is innocent, and the judge identifies the person as innocent
• Second Case: The person is innocent, and the judge identifies the person as guilty
• Third Case: The person is guilty, and the judge identifies the person as innocent
• Fourth Case: The person is guilty, and the judge identifies the person as guilty
Here
As you can clearly see, there can be two types of error in the judgment –
Type I error will be if the Jury convicts the person [rejects H0] although the person was
innocent [H0 is true].
Type II error will be the case when Jury released the person [Do not reject H0] although
the person is guilty [H1 is true].
According to the Presumption of Innocence, the person is considered innocent until proven
guilty. We consider the Null Hypothesis to be true until we find strong evidence against
it. Then we accept the Alternate Hypothesis. That means the judge must find the
evidence which convinces him “beyond a reasonable doubt.” This phenomenon
of “Beyond a reasonable doubt” can be understood as Significance Level (⍺) ie.
(Judge Decided Guilty | Person is Innocent) should be small. Thus, if ⍺ is smaller, it
will require more evidence to reject the Null Hypothesis.
The basic concepts of Hypothesis Testing are actually quite analogous to this situation.
It must be noted that z-Test & t-Tests are Parametric Tests, which means that the Null
Hypothesis is about a population parameter, which is less than, greater than, or equal to
some value. Steps 1 to 3 are quite self-explanatory but on what basis can we make a
decision in step 4? What does this p-value indicate?
We can understand this p-value as the measurement of the Defense Attorney’s argument.
If the p-value is less than ⍺ , we reject the Null Hypothesis, and if the p-value is greater
than ⍺, we fail to reject the Null Hypothesis.
Level of significance(α)
The significance level, in the simplest of terms, is the threshold probability of incorrectly
rejecting the null hypothesis when it is in fact true. This is also known as the type I error
rate.
It is the probability of a type 1 error. It is also the size of the critical region.
Generally, strong control of α is desired and in tests, it is prefixed at very low levels like
0.05(5%) or 01(1%).
If H0 is not rejected at a significance level of 5%, then one can say that our null hypothesis
is true with 95% assurance.
The p-value is the smallest level of significance at which a null hypothesis can be
rejected.
p-value
To understand this question, we will pick up the normal distribution:
p-value is the cumulative probability (area under the curve) of the values to the right of the red
point in the figure above.
Or,
p-value corresponding to the red point tells us about the ‘total probability’ of getting any value
to the right hand side of the red point, when the values are picked randomly from the population
distribution.
A large p-value implies that sample scores are more aligned or similar to the population score.
Alpha value is nothing but a threshold p-value, which the group conducting the
test/experiment decides upon before conducting a test of similarity or significance ( Z-test or a
T-test).
Consider the above normal distribution again. The red point in this distribution represents the
alpha value or the threshold p-value. Now, let’s say that the green and orange points represent
different sample results obtained after an experiment.
We can see in the plot that the leftmost green point has a p-value greater than the alpha. As a
result, these values can be obtained with fairly high probability and the sample results are
regarded as lucky.
The point on the rightmost side (orange) has a p-value less than the alpha value (red). As a
result, the sample results are a rare outcome and very unlikely to be lucky. Therefore, they
are significantly different from the population.
The alpha value is decided depending on the test being performed. An alpha value of 0.05 is
considered a good convention if we are not sure of what value to consider.
Let’s look at the relationship between the alpha value and the p-value closely.
Here, the red point represents the alpha value. This is basically the threshold p-value. We can
clearly see that the area under the curve to the right of the threshold is very low.
The orange point represents the p-value using the sample population. In this case, we can
clearly see that the p-value is less than the alpha value (the area to the right of the red point is
larger than the area to the right of the orange point). This can be interpreted as:
The results obtained from the sample is an extremity of the population distribution (an
extremely rare event), and hence there is a good chance it may belong to some other
distribution (as shown below).
Considering our definitions of alpha and the p-value, we consider the sample results obtained
as significantly different. We can clearly see that the p-value is far less than the alpha value.
Again, consider the same population distribution curve with the red point as alpha and the
orange point as the calculated p-value from the sample:
So, p-value > alpha (considering the area under the curve to the right-hand side of the red and
the orange points) can be interpreted as follows:
The sample results are just a low probable event of the population distribution and are very
likely to be obtained by luck.
We can clearly see that the area under the population curve to the right of the orange point is
much larger than the alpha value. This means that the obtained results are more likely to be
part of the same population distribution than being a part of some other distribution.
In the National Academy of Archery, the head coach intends to improve the performance of
the archers ahead of an upcoming competition. What do you think is a good way to improve
the performance of the archers?
He proposed and implemented the idea that breathing exercises and meditation before the
competition could help. The statistics before and after experiments are below:
Interesting. The results favor the assumption that the overall score of the archers improved. But
the coach wants to make sure that these results are because of the improved ability of the
archers and not by luck or chance. So what do you think we should do?
This is a classic example of a similarity test (Z-test in this case) where we want to check
whether the sample is similar to the population or not. In order to solve this, we will follow a
step-by-step approach:
1. Understand the information given and form the alternate and null hypothesis
2. Calculate the Z-score and find the area under the curve
3. Calculate the corresponding p-value
4. Compare the p-value and the alpha value
5. Interpret the final results
• Population Mean = 74
• Population Standard Deviation = 8
• Sample Mean = 78
• Sample Size = 60
We have the population mean and standard deviation with us and the sample size is over
30, which means we will be using the Z-test.
1. The after-experiment results are a matter of luck, i.e. mean before and after experiment
are similar. This will be our “Null Hypothesis”
2. The after-experiment results are indeed very different from the pre-experiment ones.
This will be our “Alternate Hypothesis”
The probability that we obtained is to the left of the Z-score (Red Point) which we calculated.
The value 0.999 represents the “total probability” of getting a result “less than the sample
score 78”, with respect to the population.
Here, the red point signifies where the sample mean lies with respect to the population
distribution. But we have studied earlier that p value is to the right-hand side of the red point,
so what do we do?
For this, we will use the fact that the total area under the normal Z distribution is
1. Therefore the area to the right of Z-score (or p-value represented by the unshaded region)
can be calculated as:
0.001 (p-value) is the unshaded area to the right of the red point. The value 0.001 represents
the “total probability” of getting a result “greater than the sample score 78”, with respect to
the population.
We were not given any value for alpha, therefore we can consider alpha = 0.05. According to
our understanding, if the likeliness of obtaining the sample (p-value) result is less than the alpha
value, we consider the sample results obtained as significantly different.
We can clearly see that the p-value is far less than the alpha value:
This says that the likeliness of obtaining the mean as 78 is a rare event with respect to the
population distribution. Therefore, it is convenient to say that the increase in the
performance of the archers in the sample population is not the result of luck. The sample
population belongs to some other (better in this case) distribution of itself.
Box plot
A box and whisker plot—also called a box plot—displays the five-number summary of
a set of data. The five-number summary is the minimum, first quartile, median, third
quartile, and maximum.
In a box plot, we draw a box from the first quartile to the third quartile. A vertical line
goes through the box at the median. The whiskers go from each quartile to the
minimum or maximum.
Outlier: The data that falls on the far left or right side of the ordered data is tested to
be the outliers. Generally, the outliers fall more than the specified distance from the
first and third quartile.
(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 . IQR).
Applications
It is used to know:
Example:
3, 7, 8, 5, 12, 14, 21, 13, 18, 50
Step1: Sort the values
3, 5, 7, 8, 12, 14, 15, 18, 21, 50
Step 2: Find the median.
Q2=13
Step 3: Find the quartiles.
First quartile, Q1 = data value at position (N + 2)/4=12/4=3rd position
Third quartile, Q3 = data value at position (3N + 2)/4=8th position
Q1=7
Q3=18
Step 4: Complete the five-number summary by finding the min and the max.
Here IQR=Q3-Q1=18-7=12
Any point beyond Q3+ 1.5 IQR (18+1.5*12=18+18=36) and Q1-1.5 IQR (7-1.5*12) is
considered an outlier.
Scatter plot
Scatter plots are the graphs that present the relationship between two variables in a
data-set. It represents data points on a two-dimensional plane or on a Cartesian
system. The independent variable or attribute is plotted on the X-axis, while the
dependent variable is plotted on the Y-axis. These plots are often called scatter
graphs or scatter diagrams.
A scatter plot is also called a scatter chart, scattergram, or scatter plot, XY graph. The
scatter diagram graphs numerical data pairs, with one variable on each axis, show
their relationship.
Scatter plots are used in either of the following situations.
The line drawn in a scatter plot, which is near to almost all the points in the plot is
known as “line of best fit” or “trend line“.
Types of correlation
The scatter plot explains the correlation between two attributes or variables. It
represents how closely the two variables are connected. There can be three such
situations to see the relation between the two variables –
1. Positive Correlation
2. Negative Correlation
3. No Correlation
Positive Correlation
A scatter plot with increasing values of both variables can be said to have a positive
correlation. Now positive correlation can further be classified into three categories:
Negative Correlation
A scatter plot with an increasing value of one variable and a decreasing value for
another variable can be said to have a negative correlation.These are also of three
types:
No Correlation
A scatter plot with no clear increasing or decreasing trend in the values of the variables
is said to have no correlation
No. of games 3 5 2 6 7 1 2 7 1 7
Scores 80 90 75 80 90 50 65 85 40 100
Solution:
X-axis or horizontal axis: Number of games
Y-axis or vertical axis: Scores
Now, the scatter graph will be:
Note: We can also combine scatter plots in multiple plots per sheet to read and
understand the higher-level formation in data sets containing multivariable, notably
more than two variables.
Z-Test
If we have a sample size of less than 30 and do not know the population variance, we must use a t-test. This is
how we judge when to use the z-test vs the t-test. Further, it is assumed that the z-statistic follows a standard
normal distribution. In contrast, the t-statistics follows the t-distribution with a degree of freedom equal to n-1,
where n is the sample size.
It must be noted that the samples used for z-test or t-test must be independent sample, and also must have a
distribution identical to the population distribution. This makes sure that the sample is not “biased” to/against
the Null Hypothesis which we want to validate/invalidate.
One-Sample Z-Test
We perform the One-Sample z-Test when we want to compare a sample mean with the population mean.
In this example:
• Mean Score for Girls is 641
• The number of data points in the sample is 20
• The population mean is 600
• Standard Deviation for Population is 100
Since the P-value is less than 0.05, we can reject the null hypothesis and conclude based on our result that
Girls on average scored higher than 600.
Two-Sample Z-Test
We perform a Two Sample z-test when we want to compare the mean of two samples.
In this example:
• Mean Score for Girls (Sample Mean) is 641
• Mean Score for Boys (Sample Mean) is 613.3
• Standard Deviation for the Population of Girls’ is 100
• Standard deviation for the Population of Boys’ is 90
• Sample Size is 20 for both Girls and Boys
• Difference between Mean of Population is 10
Thus, we can conclude based on the p-value that we fail to reject the Null Hypothesis. We don’t have enough
evidence to conclude that girls on average score of 10 marks more than the boys. Pretty simple, right?
T-Test
T test is a type of inferential statistic used to study if there is a statistical difference between two groups.
Mathematically, it establishes the problem by assuming that the means of the two distributions are equal (H₀:
µ₁=µ₂). If the t-test rejects the null hypothesis (H₀: µ₁=µ₂), it indicates that the groups are highly probably different.
The statistical test can be one-tailed or two-tailed. The one-tailed test is appropriate when there is a difference
between groups in a specific direction. It is less common than the two-tailed test. When choosing a t test, you
will need to consider two things: whether the groups being compared come from a single population or two
different populations, and whether you want to test the difference in a specific direction.
• One Sample t-test : Compares mean of a single group against a known/hypothesized/ population mean.
• Two Sample: Paired Sample T Test: Compares means from the same group at different times.
• Two Sample: Independent Sample T Test: Compares means for two different groups.
̅−µ
𝒙
𝒕=𝒔
⁄ 𝒏
√
𝑥̅ Sample mean
µ Population mean
𝑠 Sample standard deviation
𝑛 Sample size
Standard deviation can be calculated as:
̅
𝒅
𝒕=𝒔
⁄ 𝒏
√
∑𝑑 2 −∑𝑛(𝑑̅ )2
s= √ 𝑛−1
Degree of freedom is n1 + n2 – 2.
• If the calculated t value is greater than critical t value (obtained from a critical value table called the T-
distribution table) then reject the null hypothesis.
• P-value <significance level (𝜶) => Reject your null hypothesis in favor of your alternative
hypothesis. Your result is statistically significant.
• P-value >= significance level (𝜶) => Fail to reject your null hypothesis. Your result is not statistically
significant.
13-04-2023 7
Note:EVEN Number of years.So
midpoint=(1964+1965)/2=1964.5
Thank you
APPLIED DATA SCIENCE
Time Series Forecasting
“
“
To explain the taxonomy of time series forecasting methods
and time series decomposition.
• Time series is a series of observations listed in the order of time.
• The data points in a time series are usually recorded at constant successive time intervals.
• Time series analysis is the process of extracting meaningful non-trivial information and patterns from
time series.
• Time series forecasting is the process of predicting the future value of time series data based on past
observations and other inputs.
• The objective in time series forecasting is slightly different: to use historical information about a
particular quantity to make forecasts about the value of the same quantity in the future.
UTILITY OF TIME SERIES FORECASTING
In all other predictive models , the time component of the data was either ignored or was not
available. Such data are known as cross-sectional data.
Taxonomy of Time Series Forecasting
The investigation of time series can also be broadly divided into descriptive modeling, called time series analysis,
and predictive modeling, called time series forecasting.
Time Series forecasting can be further classified into four broad categories of techniques:
• Forecasting based on time series decomposition,
• smoothing based techniques,
• regression based techniques, and
• machine learning-based techniques.
Time series decomposition
• Time series decomposition is the process of deconstructing a time series into the number of constituent
components with each representing an underlying phenomenon.
• Decomposition splits the time series into a trend component, a seasonal component, and a noise
component.
• The trend and seasonality components are predictable (and are called systematic components),
whereas, the noise, by definition, is random (and is called the non-systematic component).
• Before forecasting the time series, it is important to understand and describe the components that
make the time series.
Firstly, the overall drug sales is trending upward and the upward trend accelerates in the 2000s.
Secondly, there is clear seasonality in the time series of drug sales. In particular, it is a yearly seasonality. There is a
spike in drug sales at the start of the year and a dip in every February. This seasonal variation is consistent every
year.
However, even when accounting for the trend and the seasonal variations there is one more phenomenon that
could not be explained. For example, the pattern in 2007 is odd when compared with prior years or 2008.
Components/Elements of a Time Series
Trend: Trend is the long-term tendency of the data. It represents change from one period to next.
Seasonality: Seasonality is the repetitive behavior during a cycle of time. These are repeated patterns appearing over and
over again in the time series. Seasonality can be further split into hourly, daily, weekly, monthly, quarterly, and yearly
seasonality.
Cycle: Cyclic component represents longer-than-a-year patterns where there is no specific time frames between the cycles.
An example here is the economic cycle of booms and crashes. While the booms and crashes exhibit a repeated pattern, the
length of a boom period, the length of a recession, the time between subsequent booms and crashes (and even two
consecutive crashes— double dip) is uncertain and random, unlike the seasonality components.
Noise: In a time series, anything that is not represented by level, trend, seasonality, or cyclic component is the noise in the
time series. The noise component is unpredictable but follows normal distribution in ideal cases.
All the time series datasets will have noise.
These individual components can be better forecasted using regression or similar techniques and
combined together as an aggregated forecasted time series.
In smoothing methods, the future value of the time series is the weighted average of past
observations.
Regression based forecasting techniques are similar to conventional supervised predictive models,
which have independent and dependent variables, but with a twist: the independent variable is
now time. The simplest of such methods is of course a linear regression model of the form:
Given a training set, the values of coefficients a and b can be estimated to forecast
future y values.
Regression based techniques can get pretty complicated with the type of function used to model
the relationship between future value and time. Commonly used functions are exponential,
polynomial, and power law functions. Most people are familiar with the trend line function
in spreadsheet programs, which offer several different function choices. The regression based
time series forecast differs from a regular function-fitting predictive model in the choice of the
independent variable.
A more sophisticated technique is based on the concept of autocorrelation.
Autocorrelation refers to the fact that data from adjacent time periods are correlated in a time
series. The most well-known among these techniques is ARIMA, which stands for Auto Regressive
Integrated Moving Average.
Any supervised classification or regression predictive models can be used to forecast the time
series too, if the time series data are transformed to a particular format with a target label and
input variables. This class of techniques are based on supervised machine learning models where
the input variables are derived from the time series using a windowing technique.
• In the time series decomposition can be classified into additive decomposition and multiplicative decomposition,
based on the nature of the different components and how they are composed.
• In an additive decomposition, the components are decomposed in such a way that when they are added together,
the original time series can be obtained.
• In the case of multiplicative decomposition the components are decomposed in the such a way that when they are
multiplied together, the original time series can be derived back.
• Both additive and multiplicative time series decomposition can be represented by these equations:
where Tt, St, and Et are trend, seasonal, and error components respectively.
• The classical decomposition technique is simple, intuitive, and serves as a baseline of all other advanced
decomposition methods.
• Suppose the time series has yearly seasonality with monthly data as shown in Fig.m represents seasonal
period, which is 12 for monthly data with yearly seasonality.
• The classical decomposition technique first estimates the trend component by calculating the long term (say
12 month) moving average.
• The trend component is removed from the time series to get remaining seasonal and noise components.
• The seasonal component can be estimated by average Jan, Feb, Mar, . . ., variance in the remaining series.
• Once the trend and seasonal components are removed, what is left is noise.
The algorithm for classic additive decomposition is:
1. Estimate the trend Tt: If m is even, calculate 2*m moving average (m-MA and then a 2-MA); if m is odd, calculate m-
moving average. A moving average is the average of last m data points.
2. Calculate detrended series: Calculate yt -Tt for each data point in the series.
3. Estimate the seasonal component St: average (yt - Tt) for each m period. For example, calculate the average of all January
values of (yt -Tt) and repeat for all the months. Normalize seasonal value in such a way that mean is zero.
4. Calculate the noise component Et: Et=(yt - Tt - St) for each data point in the series.
Multiplicative decomposition is similar to additive decomposition: Replace subtraction with division in the algorithm
described.
Thank you
10/04/2023
The below equation shows a typical autoregressive model. As the name suggests,
the new values of this model depend purely on a weighted linear combination of its
past values. Given that there are p past values, this is denoted as AR(p) or an
autoregressive model of the order p. Epsilon (ε) indicates the white noise
14-04-2023 10
Next, the moving average is defined as follows:
Here, the future value y(t) is computed based on the errors εt made by
the previous model. So, each successive term looks one step further
into the past to incorporate the mistakes made by that model in the
current computation. Based on the window we are willing to look past,
the value of q is set. Thus, the above model can be independently
denoted as a moving average order q or simply MA(q).
14-04-2023 11
Why does ARIMA need Stationary Time-Series Data?
Stationarity
A stationary time series data is one whose properties do not depend on the time, That is why
time series with trends, or with seasonality, are not stationary. the trend and seasonality will
affect the value of the time series at different times,
A stationary time series is one whose statistical properties such as mean, variance,
autocorrelation, etc. are all constant over time.
On the other hand for stationarity it does not matter when you observe it, it should look
much the same at any point in time. In general, a stationary time series will have no
predictable patterns in the long-term.
Why does ARIMA need Stationary Time-Series Data?
Time series data must be made stationary to remove any obvious correlation and collinearity
with the past data.
In stationary time-series data, the properties or value of a sample observation does not depend
on the timestamp at which it is observed. For example, given a hypothetical dataset of the year-
wise population of an area, if one observes that the population increases two-fold each year or
increases by a fixed amount, then this data is non-stationary.
Any given observation is highly dependent on the year since the population value would rely on
how far it is from an arbitrary past year. This dependency can induce incorrect bias while
training a model with time-series data.
To remove this correlation, ARIMA uses differencing to make the data stationary.
Differencing, at its simplest, involves taking the difference of two adjacent data points.
For example, the left graph above shows Google's stock price for 200 days. While the
graph on the right is the differenced version of the first graph – meaning that it shows the
change in Google stock of 200 days. There is a pattern observable in the first graph, and
these trends are a sign of non-stationary time series data. However, no trend or
seasonality, or increasing variance is observed in the second figure. Thus, we can say that
the differenced version is stationary.
14-04-2023 14
This change can simply be modeled by
14-04-2023 15
14-04-2023 16
Combining all of the three types of models above gives the
resulting ARIMA(p,d,q) model.
14-04-2023 17
In general, it is a good practice to follow the next steps when doing time-series forecasting:
•Step 1 — Check Stationarity: If a time series has a trend or seasonality component, it must be made
stationary.
•Step 2 — Determine the d value: If the time series is not stationary, it needs to be stationarized
through differencing.
•Step 3 — Select AR and MA terms: Use the ACF and PACF to decide whether to include an AR term,
MA term, (or) ARMA.
•Step 4 — Build the model
14-04-2023 18
For a stationary time series,
the ACF will drop to zero
relatively quickly, while the
ACF of non-stationary data
decreases slowly.
14-04-2023 19
For a stationary time series,
the ACF will drop to zero
relatively quickly, while the
ACF of non-stationary data
decreases slowly.
14-04-2023 20
The right order of differencing is the minimum differencing required to get a near-stationary
series which roams around a defined mean and the ACF plot reaches to zero fairly quick.
If the autocorrelations are positive for many number of lags (10 or more), then the series
needs further differencing.
On the other hand, if the lag 1 autocorrelation itself is too negative, then the series is probably
over-differenced
14-04-2023 21
Check if the series is stationary using the Augmented Dickey
Fuller test (adfuller()), from the statsmodels package.
Why?
The null hypothesis of the ADF test is that the time series is
non-stationary. So, if the p-value of the test is less than the
significance level (0.05) then you reject the null hypothesis and
infer that the time series is indeed stationary.
So, in our case, if P Value > 0.05 we go ahead with finding the
order of differencing.
14-04-2023 22
Thank you
29/03/2023
“
“
To demonstrate smoothening methods:moving average and exponential smooth
ing.
The various time series forecasting methods are:
• Simple Average
• Moving Average
• Weighted Moving Average
• Naïve Method
• Exponential smoothing
• Time Series Analysis using Linear Regression(Least Squares
Method)
• ARIMA
Simple Average:
Example 1:
3-Moving Average
Example 2:
13-04-2023 8
4-Moving Average
Example 1
13-04-2023 9
13-04-2023 10
13-04-2023 11
4-Moving Average
Example 2:
13-04-2023 12
4-Moving Average
Example 3:
13-04-2023 13
13-04-2023 14
5-Moving Average
Example 1
13-04-2023 15
13-04-2023 16
13-04-2023 17
Weighted Moving Average
When using a moving average method described before, each of the observations used to compute the
forecasted value is weighted equally. In certain cases, it might be beneficial to put more weight on the
observations that are closer to the time period being forecast. When this is done, this is known as a weighted
moving average technique. The weights in a weighted MA must sum to 1.
Naïve Method
13-04-2023 20
Exponential Smoothing
13-04-2023 21
Continue this
process till
13-04-2023
week 10 22
Final Answer
13-04-2023 23
Thank you
Data Science Life Cycle for Weather Forecasting
1. Problem statement: Weather prediction is a major issue for the weather department as it
is associated with the human's life and the economy. For instance, excess rainfall is the major
cause of natural disasters such as drought and flood which are encountered by the people every
year across the world. The time series machine learning model can be used for forecasting weather
conditions in any state. Time series currently is becoming very popular, a reason for that is
declining hardware's cost and capability of processing.
An effort has been made to develop a SARIMA (Seasonal Autoregressive Integrated Moving
Average) model for temperature prediction using historical data from Pune, Maharashtra. The
historical dataset from the year 2009 to 2020 has been taken for observation. When there is a
repeating cycle is present in a time series, instead of decomposing it manually to fit an ARIMA
model, another very popular method is to use the seasonal autoregressive integrated moving
average (SARIMA) model. The seasonal ARIMA model is designed by running Python on
Anaconda Jupyter Notebook and using the package matplotlib for data visualization.
Description of dataset:
Temperature data recorded from 2009 to 2020 were obtained for Pune city, from the meteorology
department at one-hour intervals. The data collected has different parameters, such as date time,
temperature, humidity, moonrise, wind speed, wind direction, pressure.
2. Data Pre-processing
Time series data is normally messy. Forecasting models from simple rolling averages to LSTMs
requires data to be clean. So the techniques used before moving to forecasting are:
Detrending/ Stationarity: Before forecasting, we want our time series variables to be mean-
variance stationery. This means that the statistical properties of a model do not vary depending
on when the sample was taken. Models built on stationary data are generally more robust. This
can be achieved by using differencing.
Anomaly detection: Any outlier present in the data might skew the forecasting results so it’s
often considered a good practice to identify and normalize outliers before moving on to
forecasting.
Check for sampling frequency: This is an important step to check the regularity of sampling.
Irregular data has to be imputed or made uniform before applying any modeling techniques
because irregular sampling leads to broken integrity of the time series and doesn’t fit well with
the models.
Missing data: At times there can be missing data for some datetime values and it needs to be
addressed before modeling.
3. Data visualization
The time series data can be broken down into trend, seasonality and noise components using
multiplicative decomposition and individual plots can be created to visually inspect them.
4. Model Building
The hourly temperature data during 2009–2018 is used as the training set, while that during 2019–
2020 is used as the testing set. To evaluate the forecast accuracy, as well as to compare the results
obtained from different models, the mean- square error (MSE) is calculated.
Akaike’s Information Criterion (AIC) is the most commonly used model selection criterion . AIC
basically deals with the goodness of fit of a model. AIC is calculated as : AIC = -2 ln (maximum
likelihood) + 2p Where, p represents the number of independent constraints estimated. Therefore,
when comparing models, the one with the least AIC value is chosen.
Mean Square Error(MSE), Root Mean Square Error(RMSE) and Mean Absolute Error(MAE) are
normally used as used as performance evaluation metrics. The predicted temperature values are
compared with actual values for accuracy based on error metrics.
Let us say we obtained MAE of 0.60850 and RMSE of 0.76233for SARIMA model and MAE of
6.052 and RMSE of 7.496 for ARIMA model. If so, we conclude that SARIMA model forecasts
yielded least error in prediction of temperature as output. Thus, derived model could be used to
forecast weather for the upcoming years.
Owing to the linear nature of both the algorithms, they are quite handy and used in the industry
when it comes to experimentation and understanding the data, creating baseline forecasting scores.
If tuned right with lagged values (p,d,q) they can perform significantly better. The simple and
explainable nature of both the algorithms makes them one of the top picks by analysts and Data
Scientists. There are, however, some pros and cons when working with ARIMA and SARIMA at
scale.
Data Science Life Cycle for Customer Segmentation
1. Problem statement:
Customer segmentation is simply grouping customers with similar characteristics. These
characteristics include geography, demography, behavioural, purchasing power, situational factors,
personality, lifestyle, psychographic, etc. The goals of customer segmentation are customer
acquisition, customer retention, increasing customer profitability, customer satisfaction, resource
allocation by designing marketing measures or programs and improving target marketing measures.
Assume a supermarket mall owner through membership cards has some basic data about
customers like Customer ID, age, gender, annual income, spending score etc. and wants to
understand the customers like who are the target customers so that the sense can be given to
marketing team and the strategies can be planned accordingly.
2. Description of dataset:
3. Descriptive Statistics:
The describe () function in python can be used to get a descriptive statistics summary of a given
dataframe. This includes mean, count, std deviation, percentiles, and min-max values of all the
features.
4. Data Pre-processing
We are only interested in the Annual Income (k$) and Spending Score (1–100). So these columns
can be extracted from the dataset.
Feature normalization helps to adjust all the data elements to a common scale in order to improve
the performance of the clustering algorithm. For example, in the data set let’s say Annual Income is
having values in thousands and spending score in just two digits. Since the data in these variables
are of different scales, it is tough to compare these variables. Each data point is converted to the
range of 0 to +1. Normalization techniques include Min-max, decimal scaling and z-score.
5. Data visualization
We are interested in identifying the relationship between the Annual Income (k$) and Spending
Score (1–100) we would use the scatterplot. Bar plots, pair plots etc can also be used for visualizing.
6. Model Building
Clustering algorithms include the K-means algorithm, hierarchical clustering, DBSCAN . In this
project, the k-means clustering algorithm has been applied in customer segmentation. K-means is
a clustering algorithm based on the principle of partition.
The steps of K-means clustering are :
1. Determine the number of clusters (k).
2. Select initial centroids.
3. Map each data point into the nearest cluster (most similar to centroid).
4. Update the mean value (centroid) of each cluster.
5. Repeat step 3–4 until all centroids are not changed.
The result of the analysis can be visualized using a 2D or 3D plot and it shows that the retail store
customers can be group into 5 clusters or segments for targeted marketing(shown using 5 different
colours).
Cluster 1 (green): These are average income earners with average spending scores. They are
cautious with their spending at the store.
Cluster 2 (yellow): The customers in this group are high income earners and with high spending
scores. They bring in profit. Discounts and other offers targeted at this group will increase their
spending score and maximize profit.
Cluster 3 (red): This group of customers have a higher income but they do not spend more at the
store. One of the assumption could be that they are not satisfied with the services rendered at the
store. They are another ideal group to be targeted by the marketing team because they have the
potential to bring in increased profit for the store.
Cluster 4 (purple): Low income earners with low spending score. I can assume that this is so
because people with low income will tend to purchase less item at the store.
Cluster 5 (blue): These are low income earning customers with high spending scores. I can assume
that why this group of customers spend more at the retail store despite earning less is because they
enjoy and are satisfied with the services rendered at the retail store.
With the help of clustering, we can understand the variables much better, prompting us to take
careful decisions. With the identification of customers, companies can release products and
services that target customers based on several parameters like income, age, spending patterns, etc.
Furthermore, more complex patterns like product reviews are taken into consideration for better
segmentation
Data Science Life Cycle for House Price Prediction
1. Problem statement:
We can analyse existing real estate prices to predict the price of real estate. This can be very useful
in understanding the valuation of a property or a new development as many people get confused
about the prices while purchasing a property and often end up paying too much for a flat or a
house. The problem statement is to predict the sale price of a house, given the features of the
house. The features are the columns in the dataset, and the target variable is the SalePrice column.
The problem is a regression problem, as the target variable is continuous.
2. Description of dataset: We shall use some sample real estate data to predict real estate
prices. For e.g., the dataset may have attributes like category marking under construction or not,
RERA approved or not, Number of rooms, Type of property i.e.., 1 RK/1 BHK/2 BHK, Total area
of the house in square feet, Category marking ready to move or not, Category marking resale or
not, Address of the property, Longitude of the property, Latitude of the property etc.
It is important for us to get a good understanding of the characteristics of our data. The .info()
method can be used to get a quick snapshot of the current state of the dataset.
Descriptive statistics can be used to know more specifically what are the min, max, mean,
standard deviation, upper and lower bounds and count of all the variables within our dataset. This
information is important firstly for determining the general characteristics of our data, secondly,
for determining the outliers we have in our data and it will give also us a better idea as to how we
need to prepare our data for training later on.
4. Data Pre-processing
There could be some mistakes in the collected entries in the dataset like some null values, human
errors or some impractical values which we call as outliers. So to overcome these inaccuracies,
we need to Pre-process and clean the data from these clutter values. There is a high need of Data
Pre- processing because if the Data that we are providing to our model is accurate and faultless,
then only the model will be able to give precise estimations which are very close to the actual
value.
Handling outliers:
Outlier values can be removed from the dataset or replaced with less extreme values. There is
also one more way to deal with outliers that called capping! This is the optimal method of dealing
with outliers. Here we first identify the upper and lower bounds of our data. This is generally
done by calculating 1.5*IQR (Inter-quartile range) above and below the mean. All data points
that fall outside our upper and lower bounds are considered outliers and need to be replaced by the
upper or lower bound.
After the data has been cleaned and free from the outliers, Feature Engineering and Exploratory
Data Analysis have to be done.
Feature Scaling
One of the most important transformations you need to apply to your data is feature scaling. With
few exceptions, Machine Learning algorithms don’t perform well when the input numerical
attributes have very different scales.
There are two common ways to get all attributes to have the same scale: min-max
scaling and standardization.
5. Data visualization
It is quite useful to have a quick overview of different features distribution vs house price. Some
datasets have many features that could be a deterrent in prediction, by finding out which features
would give the best result and dropping those that wouldn't give the best results we can try to get
better accuracy. This could be done using scatter plots, correlation matrix etc.
6. Model Building
Then the data is split into training and testing sets for the classification of the best fitting machine
learning model. The standard 80-20 split ratio is used, a typical ratio for this purpose; 80% of the
data is considered as a training set and 20% as a testing set. To allow the implementation of the
model, Scikit-Learn have to be imported. It is a Python Library which provides machine learning
algorithms for implementation and many more features for modeling. We are performing
Supervised Learning and to find out best model we will be implementing some regression
algorithms which are likely to do a precise estimation of prices. The model which gives least error
and most nearer value prediction will be our final model.
To test the results of different models and compare them, we will provide same input values for
all the models. Lets take an example of area Kasarvadavli in Thane and check price for 900 sqft.
house having 2 bedrooms and 2 baths and compare the price given by different algorithms.
Regression analysis is a type of predictive modeling technique that analyses the relation between
the target or dependent variable and independent variables in a dataset. It involves determining
the best fitting line that passes through all the data points in such a way that distance of the line
from each data point is minimal. For most accurate Predictions we are trying different Regression
techniques on given problem statement to find out best fitting model. This includes linear
regression, Support Vector Regressor and Decision Trees.
The main aim of Linear Regression model is to find the best fit linear line and the optimal
values of intercept and coefficients such that the error is minimized. Error is defined as the
difference between the actual value and Predicted value. The goal is to reduce this error or
difference. Linear Regression is of two types based on number of independent variables: Simple
and Multiple. Simple Linear Regression contains only one independent variable and the model
has to find the linear relationship between this and the dependent variable. Whereas, Multiple
Linear Regression contains more than one independent variables for the model to find the
relationship with the dependent variable.
Where b0 is the intercept, b1 is coefficient or slope, x is the independent variable and y is the
dependent variable.
Where b0 is the intercept, b1,b2,b3,b4,bn are the coefficients or slopes of the independent
variables x1,x2,x3,x4,xn and y is the dependent variable.
Multiple Linear Regression is an extension of Simple Linear Regression and here we assume
that there is a linear relationship between a dependent variable Y and independent variables X
So now , we have train data , test data and labels for both we fit our train and test data into a
multiple linear regression model. We use sklearn (built in python library) and import linear
regression from it and then initialize Linear Regression to a variable reg.
After fitting our data to the model we can check the score of our data ie , prediction. in this case
say the prediction is 92%
Support Vector Regression (SVR) uses the same method as Support Vector Machine
(SVM) but for regression problems. In SVR, the straight line that is required to fit the data is
referred to as hyperplane. The objective of a SVR algorithm is to find a hyperplane in an n-
dimensional space that classifies the data points. The data points on either side of the hyperplane
that are closest to the hyperplane which are called Support Vectors.
The best fit line is the hyperplane that has the maximum number of points. Unlike other
Regression models that try to minimize the error between the real and predicted value, the SVR
tries to fit the best line within a threshold value. The threshold value is the distance between the
hyperplane and boundary line. The problem with SVR is that they are not suitable for large
datasets.
Decision Tree is a tree-structured algorithm with three types of nodes; Root Node, Interior node
and Leaf node. The Interior Nodes represent the features of a data set and the branches represent
the decision rules.
The Decision Tree Regressor observes features of an attribute and trains a model in the form of a
tree to predict data in the future to produce meaningful output. Decision tree Regressor learns
from the max depth, min depth of a graph and according to system analyses the data.
7. Model Evaluation
Cross-validation of different algorithms has proven to be a suitable method to find an acceptable
best fitting algorithm for the Model. After training the dataset on three different machine learning
model the outcome that has been extracted is as follows, the linear regression model performed
the best with the score of approximately 92%, followed by the SVM regression with an
approximate score of 91%, lastly the decision tree score almost 90% score when trained on a
dataset. Also, according to confusion matrix linear regression is giving nearly accurate predictions.
As the final decision to choose over these machine learning models the optimal choice had to be
the linear regression model and hence, that is the reason of using the same in the proposed
solution.
The Model has also proved that Location and square feet area plays an important role in deciding
the price of a property. This is helpful information for Sellers and buyers to act accordingly. The
GUI has provided Ease of access to the model, hence improving quality of accessibility.
8. Model Deployment
To deploy our machine learning model we need flask which is framework for deploying
functional webpage for our created model. There are more options for that but flask is one of the
effective and instant ways for creating UI for proposed machine learning model. It is also easier
to integrate Flask with the model. Flask allows us to create a UI for our model. Flask provides us
with tools, libraries and technologies that allow us to build a web application.
Once the Implementation is done the model is predicting us the price of the property (house) in
that particular location. We will deploy the model using Flask framework and create UI where
the user will enter the desired values and our model will predict the output. This is made possible
by using the python package for creating an API called Flask. For building the web application
and linking the model with the web application, first we need to extract our model into pickle and
json files and design webpage using HTML, CSS and JavaScript. With this the model is ready to
be displayed and make predictions on the web application.
In the future, the GUI can be made more attractive and interactive. It can also be turned into any
real estate sale website where sellers can give the details and house for sale and buyers can contact
according to the details given on the website.
To simplify it for the user, there can also be a recommending system to recommend real estate
properties to the user based on the predicted price. The current dataset only includes a few
locations of Thane city, expanding it to other cities and states of India is the future goal.
To make the system even more informative and user-friendly, Google maps can also be included.
This will show the neighbourhood amenities such as hospitals, schools surrounding a region of 1
km from the given location. This can also be included in making predictions since the presence
of such factors increases the price of real estate property.