0% found this document useful (0 votes)
5 views11 pages

Inferential Statistics

This document discusses inferential statistics, emphasizing its role in making predictions about populations based on sample data, and outlines various sampling techniques. It covers key concepts such as hypothesis testing, including Z-Test, T-Test, correlation tests, and Chi-Square tests, explaining their applications in data analysis. The article highlights the importance of inferential statistics for data analysts and mentions that it does not cover all existing techniques and tests.

Uploaded by

aimlhod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views11 pages

Inferential Statistics

This document discusses inferential statistics, emphasizing its role in making predictions about populations based on sample data, and outlines various sampling techniques. It covers key concepts such as hypothesis testing, including Z-Test, T-Test, correlation tests, and Chi-Square tests, explaining their applications in data analysis. The article highlights the importance of inferential statistics for data analysts and mentions that it does not cover all existing techniques and tests.

Uploaded by

aimlhod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Statistics for Data Analysts:

Inferential Statistics with Python

Introduction

In data analysis, Statistics is important in


understanding data, discovering trends and
analyzing data efficiently which coincides with the
purpose of data analysis. Statistics is divided into 2
broad areas based on purpose: Descriptive Statistics
and Inferential Statistics.

This article is the second in the series of Statistics


for Data Analysis and it only covers Inferential
Statistics using Python. Click here for the previous
article on Descriptive Statistics with Python.

Inferential Statistics

Inferential statistics generally involves generating


deductions and/or predictions about a population. In
several cases, inferences are made about a
population using a sample. Unlike descriptive
statistics where a known sample/population data is
described, inferential statistics uses sample data to
make conclusions about the population.

Sampling and Sampling Techniques

Gathering information about the total population


can be very difficult and in some cases, impossible.
Due to this limitation, a smaller fraction of the
population, known as the sample, is analyzed and
inferences are made concerning the population
using the sample data collected. It should be noted
that the sample collected from a population has to
be a representation of the population for correct
deductions. Usually, this is dependent on factors
such as the sample size and sampling techniques
used.
Sampling Techniques
Generally, there are two sampling categories:
Random/Probabilistic sampling and Non-Probability
sampling. For the former, sampling is done at
random and is not biased. However, for non-
probability sampling, sampling is by deliberate
choice. For example, you might want to select the
best students to represent a school in competition
instead of selecting students at random. Under
these two broad categories lie several sampling
techniques.
Simple Random Sampling is the simplest, and most
common technique. Here, every element in the
population has an equal chance of being selected.
Another popular probabilistic sampling procedure is
the stratified sampling technique. In this case, the
population is divided into groups of related elements
called strata. Samples are then collected from each
stratum. For example, data might be collected from
the population in strata of different age groups
instead of at complete random.

The random.sample() function is typically used to


select samples from a population in python, where
the number of samples to be collected is passed as
an argument.
Click here for the Python Documentation

Hypothesis Testing

Hypothesis testing is a statistical inference


technique used to confirm or refute statements
made about a population using the sample data
provided. We can think of hypothesis testing as an
experiment, an hypothesis is made before the
experiment starts. After experimentation, we would
confirm if the results agree with the statement or
not.

Hypothesis testing is one of the most significant


aspects of inferential statistics. There are several
tests applied in hypothesis testing and the specific
test to use depends on the data and purpose of the
test. There are several hypothesis tests you would
need to be familiar with in your journey as a data
analyst. This article covers the following tests:

 Z Test & T-Test

 Correlation Test

 Chi Square Tests

Z-Test & T-Test


The Z-Test is a hypothesis test typically used to
determine if the means of two populations are
significantly different or if the mean of a population
is greater than, less than or equivalent to a specific
value. This test is used when the variance(s) of the
population(s) is/are known. It is also applied when
the data follows a normal distribution. When the
sample size is large, it is also assumed that the data
follows a normal distribution.

Check here for more on Z-Test

Using a case study of the performance of students in


2 classes, the Z Test can be used to ascertain if
there is a significant difference in score. In this
scenario, the null hypothesis is that the mean scores
from the two classes are equal. The hypothesis test
would enable us to support or refute this claim.
Usually, for hypothesis tests, a 5% level of
significance is applied and the claim is rejected if
the p-value produced is less than the level of
significance.
Check the documentation test here

The T-Test has a similar purpose as the Z-Test.


However, it is applied when the population standard
deviation is not known, or for samples with small
sample sizes (n < 30).

Let us paint another scenario of a coach who trains


junior athletes to run a 100meters race. The coach
believes that the average speed of her student is 10
seconds. To confirm this, she selects 10 athletes.

Correlation Test

Correlation describes the degree of relationship


between two (or more) variables. For example, there
might be a positive relationship between hours of
practice and overall performance: “The more you
practice, the better your results in an examination
will be”.
The correlation test tests if the relationship between
these variables is statistically significant. The
Pearson Correlation Coefficient is a popular
correlation coefficient that measures the linear
relationship between 2 variables.

For instance, the relationship between test scores


and exam scores can be tested using Pearson
correlation. The pearsonr function on Scipy returns
the correlation coefficient and tests if the
correlation is significant. The null hypothesis for
correlation test is that there is no correlation
between the variables.

Chi-Square Tests
There are 3 types of Chi-Square Tests:

 Chi-Square Test of Independence

 Chi-Square Goodness of Fit Test

 Chi-Square Test of Homogeneity

The most popularly of these tests are the Chi Square


Test of Independence and the Goodness of Fit Test.

The Chi-Square Goodness of Fit test is used, mostly,


to ascertain if the sample data is a true
representation of the population. On the other hand,
the Chi-Square test of independence is used to
determine if the relationship between two
categorical variables is significant. It is different
from the Correlation test because, unlike the
correlation test that focuses on quantitative
variables, this chi-square test deals with categorical
variables.

The scipy’s stats.chisquare function is used to


compute the goodness of fit test while
the chi2_contigency function is used to compute the
chi-square test of independence.

Inferential Statistics is an extremely valuable tool


for every potential data analyst. From applying
sampling techniques in your data collection process
to applying hypothesis tests to deduce from your
data, it is too valuable to dismiss. It is worth
mentioning that this article does not exhaust all the
sampling techniques and hypothesis tests that exist.
However, it covers some important and widely used
ones that you would come across.

You might also like