Fundamentals of Data Science
Fundamentals of Data Science
UNIT -I
Need for data science –benefits and uses –facets of data – data science process –setting the research
goal – retrieving data –cleansing, integrating and transforming data –exploratory data analysis –build
the models – presenting and building applications..
UNIT-II
Frequency distributions – Outliers –relative frequency distributions –cumulative frequency
distributions – frequency distributions for nominal data –interpreting distributions –graphs – averages
–mode –median –mean
UNIT-III
Normal distributions –z scores –normal curve problems – finding proportions –finding scores – more
about z scores –correlation –scatter plots –correlation coefficient for quantitative data – computational
formula for correlation coefficient-averages for qualitative and ranked data.
UNIT-IV
Basics of Numpy arrays, aggregations, computations on arrays, comparisons, structured arrays, Data
manipulation, data indexing and selection, operating on data, missing data, hierarchical indexing,
combining datasets –aggregation and grouping, pivot tables
UNIT-V
Visualization with matplotlib, line plots, scatter plots, visualizing errors, density and contour plots,
histograms, binnings, and density, three dimensional plotting, geographic data
Text Books:
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, ―Introducing Data Science‖, Manning
Publications, 2016.
2. Robert S. Witte and John S. Witte, ―Statistics‖, Eleventh Edition, Wiley Publications, 2017. 3. Jake
VanderPlas, ―Python Data Science Handbook‖, O‟Reilly, 2016.
References :
1. Allen B. Downey, ―Think Stats: Exploratory Data Analysis in Python‖, Green Tea Press,
2014.
Web Resources
● https://www.w3schools.com/datascience/
● https://www.geeksforgeeks.org/data-science-tutorial/
● https://www.coursera.org/
Mapping with Programme Outcomes:
CO1 3 3 3 3 3 2
CO2 3 3 3 2 2 3
CO3 2 2 2 3 3 3
CO4 3 3 3 3 3 2
CO5 3 3 3 3 3 1
Weightage of course 14 14 14 14 14 11
contributed to each PSO
UNIT -I
Need for data science –benefits and uses –facets of data – data science process –setting the research
goal – retrieving data –cleansing, integrating and transforming data –exploratory data analysis –build
the models – presenting and building applications..
What is Data Science?
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems
to extract knowledge and insights from structured and unstructured data. In simpler terms, data science
is about obtaining, processing, and analyzing data to gain insights for many purposes.
Data Science is a combination of multiple disciplines that uses statistics, data analysis, and machine
learning to analyze data and to extract knowledge and insights from it.
Despite every data science project being unique—depending on the problem, the industry it's applied in,
and the data involved—most projects follow a similar lifecycle.
This lifecycle provides a structured approach for handling complex data, drawing accurate conclusions,
and making data-driven decisions.
The data science lifecycle
Here are the five main phases that structure the data science lifecycle:
This initial phase involves collecting data from various sources, such as databases, Excel files, text files,
APIs, web scraping, or even real-time data streams. The type and volume of data collected largely
depend on the problem you’re addressing.
Once collected, this data is stored in an appropriate format ready for further processing. Storing the data
securely and efficiently is important to allow quick retrieval and processing.
Data preparation
Often considered the most time-consuming phase, data preparation involves cleaning and transforming
raw data into a suitable format for analysis. This phase includes handling missing or inconsistent data,
removing duplicates, normalization, and data type conversions. The objective is to create a clean,
high-quality dataset that can yield accurate and reliable analytical results.
Visualization tools, such as charts and graphs, make the data more understandable, enabling stakeholders
to comprehend the data trends and patterns better.
Data scientists use machine learning algorithms and statistical models to identify patterns, make
predictions, or discover insights in this phase. The goal here is to derive something significant from the
data that aligns with the project's objectives, whether predicting future outcomes, classifying data, or
uncovering hidden patterns.
The final phase involves interpreting and communicating the results derived from the data analysis. It's
not enough to have insights; you must communicate them effectively, using clear, concise language and
compelling visuals. The goal is to convey these findings to non-technical stakeholders in a way that
influences decision-making or drives strategic initiatives.
Understanding and implementing this lifecycle allows for a more systematic and successful approach to
data science projects. Let's now delve into why data science is so important.
Data science has emerged as a revolutionary field that is crucial in generating insights from data and
transforming businesses. It's not an overstatement to say that data science is the backbone of modern
industries. But why has it gained so much significance?
● Data volume. Firstly, the rise of digital technologies has led to an explosion of data. Every
online transaction, social media interaction, and digital process generates data. However, this
data is valuable only if we can extract meaningful insights from it. And that's precisely where
data science comes in.
● Value-creation. Secondly, data science is not just about analyzing data; it's about interpreting
and using this data to make informed business decisions, predict future trends, understand
customer behavior, and drive operational efficiency. This ability to drive decision-making based
on data is what makes data science so essential to organizations.
● Career options. Lastly, the field of data science offers lucrative career opportunities. With the
increasing demand for professionals who can work with data, jobs in data science are among the
highest paying in the industry. As per Glassdoor, the average salary for a data scientist in the
United States is $137,984, making it a rewarding career choice.
● Descriptive analytics. Analyzes past data to understand current state and trend identification. For
instance, a retail store might use it to analyze last quarter's sales or identify best-selling products.
● Diagnostic analytics. Explores data to understand why certain events occurred, identifying
patterns and anomalies. If a company's sales fall, it would identify whether poor product quality,
increased competition, or other factors caused it.
● Predictive analytics. Uses statistical models to forecast future outcomes based on past data, used
widely in finance, healthcare, and marketing. A credit card company may employ it to predict
customer default risks.
● Prescriptive analytics. Suggests actions based on results from other types of analytics to mitigate
future problems or leverage promising trends. For example, a navigation app advising the fastest
route based on current traffic conditions.
The increasing sophistication from descriptive to diagnostic to predictive to prescriptive analytics can
provide companies with valuable insights to guide decision-making and strategic planning. You can read
more about the four types of analytics in a separate article.
Data science can add value to any business which uses its data effectively. From statistics to predictions,
effective data-driven practices can put a company on the fast track to success. Here are some ways in
which data science is used:
Data Science can significantly improve a company's operations in various departments, from logistics
and supply chain to human resources and beyond. It can help in resource allocation, performance
evaluation, and process automation. For example, a logistics company can use data science to optimize
routes, reduce delivery times, save fuel costs, and improve customer satisfaction.
Data Science can uncover hidden patterns and insights that might not be evident at first glance. These
insights can provide companies with a competitive edge and help them understand their business better.
For instance, a company can use customer data to identify trends and preferences, enabling them to tailor
their products or services accordingly.
Companies can use data science to innovate and create new products or services based on customer
needs and preferences. It also allows businesses to predict market trends and stay ahead of the
competition. For example, streaming services like Netflix use data science to understand viewer
preferences and create personalized recommendations, enhancing user experience.
The implications of data science span across all industries, fundamentally changing how organizations
operate and make decisions. While every industry stands to gain from implementing data science, it's
especially influential in data-rich sectors.
Let's delve deeper into how data science is revolutionizing these key industries:
The finance sector has been quick to harness the power of data science. From fraud detection and
algorithmic trading to portfolio management and risk assessment, data science has made complex
financial operations more efficient and precise. For instance, credit card companies utilize data science
techniques to detect and prevent fraudulent transactions, saving billions of dollars annually.
Learn more about the finance fundamentals in Python and how you can make data-driven financial
decisions with our skill track.
Healthcare is another industry where data science has a profound impact. Applications range from
predicting disease outbreaks and improving patient care quality to enhancing hospital management and
drug discovery. Predictive models help doctors diagnose diseases early, and treatment plans can be
customized according to the patient's specific needs, leading to improved patient outcomes.
You can discover more about how data science is transforming healthcare in a DataFramed Podcast
episode.
Marketing is a field that has been significantly transformed by the advent of data science. The
applications in this industry are diverse, ranging from customer segmentation and targeted advertising to
sales forecasting and sentiment analysis. Data science allows marketers to understand consumer
behavior in unprecedented detail, enabling them to create more effective campaigns. Predictive analytics
can also help businesses identify potential market trends, giving them a competitive edge.
Personalization algorithms can tailor product recommendations to individual customers, thereby
increasing sales and customer satisfaction.
We have a separate blog post on five ways to use data science in marketing, exploring some of the
methods used in the industry. You can also learn more in our Marketing Analytics with Python skill
track.
While data science overlaps with many fields that also work with data, it carries a unique blend of
principles, tools, and techniques designed to extract insightful patterns from data.
Distinguishing between data science and these related fields can give a better understanding of the
landscape and help in setting the right career path. Let's demystify these differences.
Data science and data analytics both serve crucial roles in extracting value from data, but their focuses
differ. Data science is an overarching field that uses methods including machine learning and predictive
analytics, to draw insights from data. In contrast, data analytics concentrates on processing and
performing statistical analysis on existing datasets to answer specific questions.
While business analytics also deals with data analysis, it is more centered on leveraging data for
strategic business decisions. It is generally less technical and more business-focused than data science.
Data science, though it can inform business strategies, often dives deeper into the technical aspects, like
programming and machine learning.
Data engineering focuses on building and maintaining the infrastructure for data collection, storage, and
processing, ensuring data is clean and accessible. Data science, on the other hand, analyzes this data,
using statistical and machine learning models to extract valuable insights that influence business
decisions. In essence, data engineers create the data 'roads', while data scientists 'drive' on them to derive
meaningful insights. Both roles are vital in a data-driven organization.
Machine learning is a subset of data science, concentrating on creating and implementing algorithms
that let machines learn from and make decisions based on data. Data science, however, is broader and
incorporates many techniques, including machine learning, to extract meaningful information from data.
Statistics, a mathematical discipline dealing with data collection, analysis, interpretation, and
organization, is a key component of data science. However, data science integrates statistics with other
methods to extract insights from data, making it a more multidisciplinary field.
Industry Focus Technical Emphasis
Data Science Driving value with data across the 4 Programming, ML, Statistics
levels of analytics
Business Analytics Leverage data for strategic business Business strategies, data
decisions analysis
Data Engineering Build and maintain data infrastructure Data collection, storage,
processing
Having understood these distinctions, we can now delve into the key concepts every data scientist needs
to master.
Key Data Science Concepts
A successful data scientist doesn't just need technical skills but also an understanding of core concepts
that form the foundation of the field. Here are some key concepts to grasp:
Statistics and probability
These are the bedrock of data science. Statistics is used to derive meaningful insights from data, while
probability allows us to make predictions about future events based on available data. Understanding
distributions, statistical tests, and probability theories is essential for any data scientist.
Data cleaning
Data cleaning, also known as data cleansing or data scrubbing, is a crucial step in the data science
pipeline. It involves identifying and correcting errors, inconsistencies, and inaccuracies in datasets to
improve their quality and reliability. This process ensures that the data used for analysis and modeling is
accurate, complete, and suitable for its intended purpose.
In this article, we’ll explore the importance of data cleaning, common issues that data scientists
encounter, and various techniques and best practices for effective data cleaning.
The Importance of Data Cleaning
Data cleaning plays a vital role in the data science process for several reasons:
Data Quality: Clean data leads to more accurate analyses and reliable insights. Poor data quality can
result in flawed conclusions and misguided decisions.
Model Performance: Machine learning models trained on clean data tend to perform better and
generalize more effectively to new, unseen data.
Efficiency: Clean data reduces the time and resources spent on troubleshooting and fixing issues during
later stages of analysis or model development.
Consistency: Data cleaning helps ensure consistency across different data sources and formats, making
it easier to integrate and analyze data from multiple origins.
Compliance: In many industries, clean and accurate data is essential for regulatory compliance and
reporting purposes.
Exploratory data analysis is one of the basic and essential steps of a data science project. A data
scientist involves almost 70% of his work in doing the EDA of the dataset. In this article, we will discuss
what is Exploratory Data Analysis (EDA) and the steps to perform EDA.
Data science has proved to be the leading support in making decisions, increased automation, and
provision of insight across the industry in today's fast-paced, technology-driven world. In essence, the
nuts and bolts of data science involve very large data set handling, pattern searching from the data,
predicting specific outcomes based on the patterns found, and finally, acting or making informed
decisions on such data sets. This is operationalized through data science modeling that, in a way,
involves designing the algorithms and statistical models that have the purpose of processing and
analyzing data. This is quite a process that is challenging to learners who are only beginning their steps
in the field. Understanding this in crystal clear steps, even a person who is a beginner will be able to
follow in this journey of data science to create models effectively.
What is Data Science Modelling
Data science modeling is a set of steps from defining the problem to deploying the model in reality. The
main aim of this paper is to, in turn, demystify and come up with a very simple, stepwise guide that any
person with a basic grasp of ideas in data science should be able to follow with minimal ease. This
guideline ensures that each of these steps is explicated using the simplest of languages that even a
beginner can easily follow in applying such practices in their projects.
Data Science Modelling Steps
1. Define Your Objective
2. Collect Data
3. Clean Your Data
4. Explore Your Data
5. Split Your Data
6. Choose a Model
7. Train Your Model
8. Evaluate Your Model
9. Improve Your Model
10. Deploy Your Model
The 10 easy steps would guide a beginner through the modeling process in data science and are meant to
be an easily readable guide for beginners who want to build models that can analyze data and give
insights. Each step is crucial and builds upon the previous one, ensuring a comprehensive understanding
of the entire process. Designed for students, professionals who would like to switch their career paths,
and even curious minds out there in pursuit of knowledge, this guide gives the perfect foundation for
delving deeper into the world of data science models.
1. Define Your Objective
First, define very clearly what problem you are going to solve. Whether that is a customer churn
prediction, better product recommendations, or patterns in data, you first need to know your direction.
This should bring clarity to the choice of data, algorithms, and evaluation metrics.
2. Collect Data
Gather data relevant to your objective. This can include internal data from your company, publicly
available datasets, or data purchased from external sources. Ensure you have enough data to train your
model effectively.
3. Clean Your Data
Data cleaning is a critical step to prepare your dataset for modeling. It involves handling missing values,
removing duplicates, and correcting errors. Clean data ensures the reliability of your model's predictions.
4. Explore Your Data
Data exploration, or exploratory data analysis (EDA), involves summarizing the main characteristics of
your dataset. Use visualizations and statistics to uncover patterns, anomalies, and relationships between
variables.
5. Split Your Data
Divide your dataset into training and testing sets. The training set is used to train your model, while the
testing set evaluates its performance. A common split ratio is 80% for training and 20% for testing.
6. Choose a Model
Select a model that suits your problem type (e.g., regression, classification) and data. Beginners can start
with simpler models like linear regression or decision trees before moving on to more complex models
like neural networks.
7. Train Your Model
Feed your training data into the model. This process involves the model learning from the data, adjusting
its parameters to minimize errors. Training a model can take time, especially with large datasets or
complex models.
8. Evaluate Your Model
After training, assess your model's performance using the testing set. Common evaluation metrics
include accuracy, precision, recall, and F1 score. Evaluation helps you understand how well your model
will perform on unseen data.
9. Improve Your Model
Based on the evaluation, you may need to refine your model. This can involve tuning hyperparameters,
choosing a different model, or going back to data cleaning and preparation for further improvements.
10. Deploy Your Model
Once satisfied with your model's performance, deploy it for real-world use. This could mean integrating
it into an application or using it for decision-making within your organization.
Frequency Distribution is a tool in statistics that helps us organize the data and also helps us reach
meaningful conclusions. It tells us how often any specific values occur in the dataset. A frequency
distribution in a tabular form organizes data by showing the frequencies (the number of times values
occur) within a dataset.
A frequency distribution represents the pattern of how frequently each value of a variable appears in a
dataset. It shows the number of occurrences for each possible value within the dataset.
Let’s learn about Frequency Distribution including its definition, graphs, solved examples, and frequency
distribution table in detail.
Frequency Distribution
What is Outlier?
Outliers, in the context of information evaluation, are information points that deviate significantly from
the observations in a dataset. These anomalies can show up as surprisingly high or low values, disrupting
the distribution of data. For instance, in a dataset of monthly sales figures, if the income for one month
are extensively higher than the sales for all of the different months, that high sales determine would be
considered an outlier.
Why Removing Outliers is Necessary?
● Impact on Analysis: Outliers will have a disproportionate influence on statistical measures
like the suggest, skewing the general outcomes and leading to misguided conclusions.
Removing outliers can help ensure the analysis is based totally on a more representative
sample of the information.
● Statistical Significance: Outliers can have an effect on the validity and reliability of statistical
inferences drawn from the facts. Removing outliers, when appropriate, can assist maintain
the statistical importance of the analysis.
Identifying and accurately dealing with outliers is critical in data analysis to make certain the integrity
and accuracy of the results.
Types of Outliers
Outliers manifest in different forms, each presenting unique challenges:
● Univariate Outliers: These outliers occur when the point in a single variable substantially
deviates from the relaxation of the dataset. For example, if you're reading the heights of
adults in a sure place and most fall in the variety of 5 feet 5 inches to 6 ft, an person who
measures 7 toes tall might be taken into consideration a univariate outlier.
● Multivariate Outliers: In assessment to univariate outliers, multivariate outliers contain
observations which include outliers in multiple variables concurrently, highlighting
complicated relationships in the information. Continuing with our example, consider
evaluating height and weight, and you discover an character who's especially tall and
relatively heavy in comparison to the relaxation of the populace. This character would be
taken into consideration a multivariate outlier, as their characteristics in each height and
weight concurrently deviate from the normal.
● Point Outliers: These are the points which might be far eliminated from the rest of the points.
For instance, in a dataset of common household energy utilization, a price this is
exceptionally excessive or low as compared to the relaxation is a point outlier.
● Contextual Outliers: Sometimes known as conditional outliers, these are facts factors that
deviate from the normal only in a specific context or condition. For instance, a very low
temperature might be regular in wintry weather but unusual in summer.
● Collective Outliers: These outliers consist of a set of data factors that might not be excessive
by means of themselves however are unusual as an entire. This type of outlier regularly
shows a change in information behavior or emergent phenomena.
Main Causes of Outliers
Outliers can arise from various sources, making their detection vital:
● Data Entry Errors: Simple human errors in entering data can create extreme values.
● Measurement Error: Faulty device or experimental setup problems can cause abnormally
high or low readings.
● Experimental Errors: Flaws in experimental design might produce facts factors that do not
represent what they're presupposed to degree.
● Intentional Outliers: In some cases, data might be manipulated deliberately to produce outlier
effects, often seen in fraud cases.
● Data Processing Errors: During the collection and processing stages, technical glitches can
introduce erroneous data.
● Natural Variation: Inherent variability in the underlying data can also lead to outliers.
How Outliers can be Identified?
Identifying outliers is a vital step in records evaluation, supporting to discover anomalies, errors, or
valuable insights inside datasets. One common approach for figuring out outliers is through
visualizations, where records is graphically represented to highlight any points that deviate appreciably
from the overall pattern. Techniques like box plots and scatter plots offer intuitive visual cues for
recognizing outliers primarily based on their function relative to the rest of the facts.
Another method involves the usage of statistical measures, including the Z-score, DBSCAN algorithm,
or isolation forest algorithm which quantitatively determine the deviation of statistics factors from the
imply or discover outliers primarily based on their density inside the information area.
By combining visible inspection with statistical evaluation, analysts can efficiently identify outliers and
benefit deeper insights into the underlying traits of the facts.
1. Outlier Identification Using Visualizations
Visualizations offers insights into information distributions and anomalies. Visual tools like with scatter
plots and box plots, can efficaciously spotlight information factors that deviate notably from the
majority. In a scatter plot, outliers often seem as records factors mendacity far from the primary cluster
or displaying unusual styles as compared to the relaxation. Box plots offer a clean depiction of the facts's
central tendency and spread, with outliers represented as person factors beyond the whiskers.
1.1 Identifying outliers with box plots
Box plots Box plots are valuable equipment in statistics analysis for visually summarizing the
distribution of a dataset. Box plots are useful in outlier identification offer a concise illustration of key
statistical measures such as the median, quartiles, and variety. A box plot includes a rectangular "field"
that spans the interquartile range (IQR), with a line indicating the median. "Whiskers" enlarge from the
box to the minimum and most values inside a specific range, often set at 1.5 times the IQR. Any records
points beyond those whiskers are considered potential outliers. These outliers, represented as points, can
provide essential insights into the dataset's variability and capacity anomalies. Thus, box plots serve as a
visual useful resource in outlier detection, permitting analysts to pick out data points that deviate notably
from the general sample and warrant similarly research.
1.2 Identifying outliers with Scatter Plots
Scatter plots serve as vital tools in figuring out outliers inside datasets, mainly when exploring
relationships between two non-stop variables. These visualizations plot person facts points as dots on a
graph, with one variable represented on each axis. Outliers in scatter plots often take place as factors that
deviate extensively from the overall sample or fashion discovered most of the majority of statistics
factors.
They might appear as isolated dots, lying far from the main cluster, or exhibiting unusual patterns
compared to the bulk of the data. By visually inspecting scatter plots, analysts can fast pinpoint capacity
outliers, prompting further investigation into their nature and capability impact on the evaluation. This
preliminary identity lays the groundwork for deeper exploration and know-how of the records's conduct
and distribution.
2. Outlier Identification using Statistical Methods
2.1 Identifying outliers with Z-Score
Z-score, a extensively-used statistical approach, quantifies how many popular deviations a records factor
is from the suggest of the dataset. Outlier detection using Z-score, points information with Z-scores
beyond a positive threshold (usually set at
±3
±3) are considered outliers. A excessive high-quality or negative Z-score suggests that the statistics
factor is strangely far from the mean, signaling its capacity outlier fame. By calculating Z-score for each
statistics factor, analysts can systematically discover outliers primarily based on their deviation from the
imply, imparting a sturdy quantitative method to outlier detection.
2.2 Identifying outliers with DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that
identifies outliers based totally on the density of records factors in their area. Unlike traditional
clustering algorithms that require specifying the variety of clusters in advance, DBSCAN mechanically
determines clusters based on facts density. Data points that fall outside dense clusters or fail to satisfy
density criteria are labeled as outliers. By reading the neighborhood density of records points, DBSCAN
successfully identifies outliers in datasets with complex systems and varying densities, making it
specially appropriate for outlier detection in spatial information analysis and other packages.
2.3 Identifying outliers with Isolation Forest algorithm
The Isolation Forest algorithm is an anomaly detection method based totally on the idea of isolating
outliers in a dataset. It constructs a random forest of decision trees and isolates outliers with the aid of
recursively partitioning the dataset into subsets. Outliers are identified as instances that require fewer
partitions to isolate them from the relaxation of the facts. Since outliers are usually fewer in wide variety
and have attributes that vary drastically from ordinary instances, they're more likely to be isolated early
in the tree-building method. The Isolation Forest algorithm gives a scalable and green approach for
outlier detection, specially in excessive-dimensional datasets, and is powerful in opposition to the
presence of irrelevant capabilities.
When Should You Remove Outliers?
Deciding when to put off outliers depends on the context of the evaluation. Outliers should be removed
whilst they are due to errors or anomalies that do not constitute the real nature of the information. Few
Considerations for Removing Outliers are:
● Impact on Analysis: Removing outliers can have an effect on statistical measures and model
accuracy.
● Statistical Significance: Consider the consequences of outlier elimination on the validity of
the evaluation.
Frequency Distribution is a tool in statistics that helps us organize the data and also helps us reach
meaningful conclusions. It tells us how often any specific values occur in the dataset. A frequency
distribution in a tabular form organizes data by showing the frequencies (the number of times values
occur) within a dataset.
A frequency distribution represents the pattern of how frequently each value of a variable appears in a
dataset. It shows the number of occurrences for each possible value within the dataset.
Let’s learn about Frequency Distribution including its definition, graphs, solved examples, and frequency
distribution table in detail.
Frequency Distribution
Connects midpoints of
class frequencies using
Comparing various
Frequency Polygon lines, similar to a
datasets.
histogram but without
bars.
0-20 6
20-40 12
40-60 22
60-80 15
80-100 5
In Grouped Frequency Distribution observations are divided between different intervals known as class
intervals and then their frequencies are counted for each class interval. This Frequency Distribution is
used mostly when the data set is very large.
Example: Make the Frequency Distribution Table for the ungrouped data given as follows:
23, 27, 21, 14, 43, 37, 38, 41, 55, 11, 35, 15, 21, 24, 57, 35, 29, 10, 39, 42, 27, 17, 45, 52, 31, 36, 39, 38,
43, 46, 32, 37, 25
Solution:As there are observations in between 10 and 57, we can choose class intervals as 10-20, 20-30,
30-40, 40-50, and 50-60. In these class intervals all the observations are covered and for each interval
there are different frequency which we can count for each interval.
Thus, the Frequency Distribution Table for the given data is as follows:
10 – 20 5
20 – 30 8
30 – 40 12
40 – 50 6
50 – 60 3
In Ungrouped Frequency Distribution, all distinct observations are mentioned and counted individually.
This Frequency Distribution is often used when the given dataset is small.
Example: Make the Frequency Distribution Table for the ungrouped data given as follows:
10, 20, 15, 25, 30, 10, 15, 10, 25, 20, 15, 10, 30, 25
Solution:
As unique observations in the given data are only 10, 15, 20, 25, and 30 with each having a different
frequency.
Thus the Frequency Distribution Table of the given data is as follows:
Value Frequency
10 4
15 3
20 2
25 3
30 2
This distribution displays the proportion or percentage of observations in each interval or class. It is
useful for comparing different data sets or for analyzing the distribution of data within a set.
Relative Frequency is given by:
Relative Frequency = (Frequency of Event)/(Total Number of Events)
Example: Make the Relative Frequency Distribution Table for the following data:
Frequency 5 10 20 10 5
Solution:
To Create the Relative Frequency Distribution table, we need to calculate Relative Frequency for each
class interval. Thus Relative Frequency Distribution table is given as follows:
Total 50 1.00
Cumulative frequency is defined as the sum of all the frequencies in the previous values or intervals up
to the current one. The frequency distributions which represent the frequency distributions using
cumulative frequencies are called cumulative frequency distributions. There are two types of cumulative
frequency distributions:
● Less than Type: We sum all the frequencies before the current interval.
● More than Type: We sum all the frequencies after the current interval.
Check:
● Cumulative Frequency
● How to Calculate Cumulative Frequency table in Excel
56 63 70 49 33
0 8 14 39 86
92 88 70 56 50
57 45 42 12 39
Solution:
Since there are a lot of distinct values, we’ll express this in the form of grouped distributions with
intervals like 0-10, 10-20 and so. First let’s represent the data in the form of grouped frequency
distribution.
Runs Frequency
0-10 2
10-20 2
20-30 1
30-40 4
Now we will
40-50 4 convert this
frequency
distribution
into cumulative
frequency
50-60 5
distribution by
summing up
the values of
current interval
and 60-70 1 all the previous
intervals.
70-80 3
80-90 2
90-100 1
Less than 10 2
Less than 20 4
Less than 30 5
Less than 40 9
Less than 50 13
Less than 60 18
Less than 70 19
Less than 80 22
Less than 90 24
This table represents the cumulative frequency distribution of less than type.
More than 0 25
More than 10 23
More than 20 21
More than 30 20
More than 40 16
More than 50 12
More than 60 7
More than 70 6
More than 80 3
More than 90 1
This table represents the cumulative frequency distribution of more than type.
We can plot both the type of cumulative frequency distribution to make the Cumulative Frequency
Curve.
Frequency Distribution Curve
A frequency distribution curve, also known as a frequency curve, is a graphical representation of a data
set’s frequency distribution. It is used to visualize the distribution and frequency of values or
observations within a dataset. Let’s understand it’s different types based on the shape of it, as follows:
Where,
x
ˉ
σ
1
×100
Notice that now both series can be compared with the value of standard deviation only. Therefore, we
can say that for two series with the same mean, the series with a larger deviation can be considered more
variable than the other one.
Frequency Distribution Examples
Example 1: Suppose we have a series, with a mean of 20 and a variance is 100. Find out the Coefficient
of Variation.
Solution:
We know the formula for Coefficient of Variation,
σxˉ×100
x
ˉ
σ
×100
Given mean
xˉ
x
ˉ
= 20 and variance
σ2
σ
2
= 100.
Substituting the values in the formula,
σxˉ×100=20100×100=2010×100=200
x
ˉ
σ
×100
=
100
20
×100
=
10
20
×100
=200
Example 2: Given two series with Coefficients of Variation 70 and 80. The means are 20 and 30. Find
the values of standard deviation for both series.
Solution:
In this question we need to apply the formula for CV and substitute the given values.
Standard Deviation of first series.
C.V=σxˉ×10070=σ20×1001400=σ×10014=σ
C.V=
x
ˉ
σ
×100
70=
20
σ
×100
1400=σ×100
14=σ
Thus, the standard deviation of first series = 14
Standard Deviation of second series.
C.V=σxˉ×10080=σ30×1002400=σ×10024=σ
C.V=
x
ˉ
σ
×100
80=
30
σ
×100
2400=σ×100
24=σ
Thus, the standard deviation of first series = 24
Example 3: Draw the frequency distribution table for the following data:
2, 3, 1, 4, 2, 2, 3, 1, 4, 4, 4, 2, 2, 2
Solution:
Since there are only very few distinct values in the series, we will plot the ungrouped frequency
distribution.
Value Frequency
1 2
2 6
3 2
4 4
Total 14
Example 4: The table below gives the values of temperature recorded in Hyderabad for 25 days in
summer. Represent the data in the form of less-than-type cumulative frequency distribution:
37 34 36 27 22
25 25 24 26 28
30 31 29 28 30
32 31 28 27 30
30 32 35 34 29
Solution:
Since there are so many distinct values here, we will use grouped frequency distribution. Let’s say the
intervals are 20-25, 25-30, 30-35. Frequency distribution table can be made by counting the number of
values lying in these intervals.
20-25
2
25-30 10
30-35 13
This is the grouped frequency distribution table. It can be converted into cumulative frequency
distribution by adding the previous values.
Less than 25 2
Less than 30 12
Less than 35 25
Example 5: Make a Frequency Distribution Table as well as the curve for the data:
{45, 22, 37, 18, 56, 33, 42, 29, 51, 27, 39, 14, 61, 19, 44, 25, 58, 36, 48, 30, 53, 41, 28, 35, 47, 21, 32,
49, 16, 52, 26, 38, 57, 31, 59, 20, 43, 24, 55, 17, 50, 23, 34, 60, 46, 13, 40, 54, 15, 62}
Solution:
To create the frequency distribution table for given data, let’s arrange the data in ascending order as
follows:
{13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62}
Now, we can count the observations for intervals: 10-20, 20-30, 30-40, 40-50, 50-60 and 60-70.
Interval Frequency
10 – 20 7
20 – 30 10
30 – 40 10
40 – 50 10
50 – 60 10
60 – 70 3
From this data, we can plot the Frequency Distribution Curve as follows:
A cumulative frequency is defined as the total of frequencies that are distributed over different class
intervals. It means that the data and the total are represented in the form of a table in which the
frequencies are distributed according to the class interval. In this article, we are going to discuss in detail
the cumulative frequency distribution, types of cumulative frequencies, and the construction of the
cumulative frequency distribution table with examples in detail.
What is Meant by Cumulative Frequency Distribution?
The cumulative frequency is the total of frequencies, in which the frequency of the first class interval is
added to the frequency of the second class interval and then the sum is added to the frequency of the
third class interval and so on. Hence, the table that represents the cumulative frequencies that are divided
over different classes is called the cumulative frequency table or cumulative frequency distribution.
Generally, the cumulative frequency distribution is used to identify the number of observations that lie
above or below the particular frequency in the provided data set.
Types of Cumulative Frequency Distribution
The cumulative frequency distribution is classified into two different types namely: less than ogive or
cumulative frequency and more/greater than cumulative frequency.
Less Than Cumulative Frequency:
The Less than cumulative frequency distribution is obtained by adding successively the frequencies of
all the previous classes along with the class against which it is written. In this type, the cumulate begins
from the lowest to the highest size.
Greater Than Cumulative Frequency:
The greater than cumulative frequency is also known as the more than type cumulative frequency. Here,
the greater than cumulative frequency distribution is obtained by determining the cumulative total
frequencies starting from the highest class to the lowest class.
Graphical Representation of Less Than and More Than Cumulative Frequency
Representation of cumulative frequency graphically is easy and convenient as compared to representing
it using a table, bar-graph, frequency polygon etc.
The cumulative frequency graph can be plotted in two ways:
1. Cumulative frequency distribution curve(or ogive) of less than type
2. Cumulative frequency distribution curve(or ogive) of more than type
Steps to Construct Less than Cumulative Frequency Curve
The steps to construct the less than cumulative frequency curve are as follows:
1. Mark the upper limit on the horizontal axis or x-axis.
2. Mark the cumulative frequency on the vertical axis or y-axis.
3. Plot the points (x, y) in the coordinate plane where x represents the upper limit value and y
represents the cumulative frequency.
4. Finally, join the points and draw the smooth curve.
5. The curve so obtained gives a cumulative frequency distribution graph of less than type.
To draw a cumulative frequency distribution graph of less than type, consider the following cumulative
frequency distribution table which gives the number of participants in any level of essay writing
competition according to their age:
Table 1 Cumulative Frequency distribution table of less than type
Monday 2 hrs
Tuesday 1 hr
Wednesday 2 hrs
Thursday 3 hrs
Friday 4 hrs
Saturday 2 hrs
Sunday 6 hrs
Solution:
Let the no. of hours be the frequency.
Hence, the cumulative frequency table is calculated as follows:
Monday 2 hrs 2
Tuesday 1 hr 2+1 = 3
● Categorical Classification:
Nominal data is used to categorize variables into distinct groups based on qualitative attributes,
without any numerical significance or inherent order.
● Mutually Exclusive:
Each data point can belong to only one category, ensuring clear and precise classification without
overlap between groups.
● No Order or Hierarchy:
The categories within nominal data do not have a ranked sequence or hierarchy; all categories
are considered equal but different.
● Identified by Labels:
Categories are often identified using names or labels, which can occasionally include numbers
used as identifiers rather than quantitative values.
● Limited Statistical Analysis:
Analysis of nominal data primarily involves counting frequency, determining mode, and using
chi-square tests, as measures of central tendency like mean or median are not applicable.
● Frequency Distribution:
One of the most common methods of analyzing nominal data is to count the frequency of
occurrences in each category. This helps in understanding the distribution of data across the
different categories. For instance, in a nominal data example like survey responses on preferred
types of cuisine, frequency distribution would reveal how many respondents prefer each type of
cuisine.
● Mode Determination:
The mode, or the most frequently occurring category in the dataset, is a key measure of central
tendency that can be applied to nominal data. It provides insight into the most common or
popular category among the data points. For example, if analyzing nominal data on pet
ownership, the mode would indicate the most common type of pet among participants.
● Cross-tabulation:
Cross-tabulation involves comparing two or more nominal variables to identify relationships
between categories. This analysis can reveal patterns and associations that are not immediately
apparent. For instance, cross-tabulating nominal data on consumers' favorite fast-food chains
with their age groups could uncover preferences trends among different age demographics.
● Chi-square Test:
For more complex analysis involving nominal data, the chi-square test is used to examine the
relationships between two nominal variables. It tests whether the distribution of sample
categorical data matches an expected distribution. As an example, researchers might use a
chi-square test to analyze whether there is a significant association between gender (a nominal
data example) and preference for a particular brand of product.
Examples
To illustrate the concept of nominal data more concretely, here are some practical examples that
showcase its application across various fields and contexts:
Analysis Techniques Frequency counts, mode, chi-square Median, percentile, rank correlation,
tests non-parametric tests
Application Used for categorizing data without Used when data classification
any need for ranking. requires a hierarchy or ranking.
Interpreting Distributions
- Symmetric, bell-shaped
- Mean = Median = Mode
- Characteristics:
- Most data points cluster around mean
- Tails decrease exponentially
- 68% data within 1 standard deviation
- 95% data within 2 standard deviations
- Examples: Height, IQ scores, measurement errors
2. Skewed Distribution
3. Bimodal Distribution
4. Multimodal Distribution
- Multiple peaks
- Characteristics:
- Multiple modes (local maxima)
- Multiple valleys
- Data has multiple distinct groups
- Examples: Gene expression data, text analysis
5. Uniform Distribution
8. Lognormal Distribution
9. Binomial Distribution
If the data set is having an even number of values then the median can be found by taking the average of
the two middle values. Consider 10 (EVEN) values 1,2,3,7,8,3,2,5,4,15. We first sort the values in
ascending order 1,2,2,3,3,4,5,7,8,15 then the median is (3+4)/2=3.5 which is the average of the two
middle values i.e. the values which are located at the 5th and 6th number in the sequence and will have 4
numbers on either side.
Mode
It is the most frequent value in the data set. We can easily get the mode by counting the frequency of
occurrence. Consider a data set with the values 1,5,5,6,8,2,6,6. In this data set, we can observe the
following,
The value 6 occurs the most hence the mode of the data set is 6.
We often test our data by plotting the distribution curve, if most of the values are centrally located and
very few values are off from the center then we say that the data is having a normal distribution. At that
time the values of mean, median, and mode are almost equal.
However, when our data is skewed, for example, as with the right-skewed data set below:
We can say that the mean is being dragged in the direction of the skew. In this skewed distribution, mode
< median < mean. The more skewed the distribution, the greater the difference between the median and
mean, here we consider median for the conclusion. The best example of the right-skewed distribution is
salaries of employees, where higher-earners provide a false representation of the typical income if
expressed as mean salaries and not the median salaries.
For left-skewed distribution mean < median < mode. In such a case also, we emphasize the median
value of the distribution.
df["Watch Time"].mode()
0 1.5
dtype: float64
df["Watch Time"].median()
2.0
If we observe the values then we can conclude the value of Mean Watch Time is 2.5 hours and which
appears reasonably correct. For Age of viewer following results can be obtained,
df["Age"].median()
12.5
df["Age"].mean()
19.9
df["Age"].mode()
0 12
1 15
dtype: int64
The value of mean Age is looked somewhat away from the actual data. Most of the viewers are in the
range of 10 to 15 while the value of mean comes 19.9. This is because the outliers present in the data set.
We can easily find the outliers using a boxplot.
sns.boxplot(df['Age'], orient='vertical')
If we observe the value of Median Age then the result looks correct. The value of mean is very sensitive
to outliers.
Now for the most popular language, we can not calculate the mean and median since this is nominal
data.
sns.barplot(x="Language",y="Age",data=df)
sns.barplot(x="Language",y="Watch Time",data=df)
If we observe the graph then it is seen that the Tamil bar is largest for Language vs Age and Language vs
Watch Time graph. But this will mislead the result because there is only one person who watches the
shows in Tamil.
df["Language"].value_counts()
Hindi 4
English 3
Tamil 1
Telgu 1
Marathi 1
Name: Language, dtype: int64
df["Language"].mode()
0 Hindi
dtype: object
Result
From the above result, it is concluded that the most popular language is Hindi. This is observed when we
find the mode of the data set.
Hence from the above observation, it is concluded that in the sample survey average age of viewers is
12.5 years who watch for 2.5 hours daily a show in the Hindi language.
We can say there is no best central tendency measure method because the result is always based on the
types of data. For ordinal, interval, and ratio data (if it is skewed) we can prefer median. For Nominal
data, the model is preferred and for interval and ratio data (if it is not skewed) mean is preferred.
Measures of Central Tendency and Dispersion
Dispersion measures indicate how data values are spread out. The range, which is the difference between
the highest and lowest values, is a simple measure of dispersion. The standard deviation measures the
expected difference between a data value and the mean.
________________________________________________________________________________
UNIT -III
Normal distributions
Normal Distribution is the most common or normal form of distribution of Random Variables, hence the
name “normal distribution.” It is also called Gaussian Distribution in Statistics or Probability. We use
this distribution to represent a large number of random variables. It serves as a foundation for statistics
and probability theory.
It also describes many natural phenomena, forms the basis of the Central Limit Theorem, and also
supports numerous statistical methods.
The normal distribution is the most important and most widely used distribution in statistics. It is
sometimes called the “bell curve,” although the tonal qualities of such a bell would be less than pleasing.
It is also called the “Gaussian curve” of Gaussian distribution after the mathematician Karl Friedrich
Gauss.
Strictly speaking, it is not correct to talk about “the normal distribution” since there are many normal
distributions. Normal distributions can differ in their means and in their standard deviations. Figure 4.1
shows three normal distributions. The blue (left-most) distribution has a mean of −3 and a standard
deviation of 0.5, the distribution in red (the middle distribution) has a mean of 0 and a standard deviation
of 1, and the black (right-most) distribution has a mean of 2 and a standard deviation of 3. These as well
as all other normal distributions are symmetric with relatively more values at the center of the
distribution and relatively few in the tails. What is consistent about all normal distribution is the shape
and the proportion of scores within a given distance along the x-axis. We will focus on the standard
normal distribution (also known as the unit normal distribution), which has a mean of 0 and a standard
deviation of 1 (i.e., the red distribution in Figure 4.1).
Figure 4.1. Normal distributions differing in mean and standard deviation. (“Normal Distributions with
Different Means and Standard Deviations” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)
5. Normal distributions are defined by two parameters, the mean ( ) and the standard
deviation (s).
6. 68% of the area of a normal distribution is within one standard deviation of the mean.
7. Approximately 95% of the area of a normal distribution is within two standard deviations
of the mean.
These properties enable us to use the normal distribution to understand how scores relate to one another
within and across a distribution. But first, we need to learn how to calculate the standardized score that
makes up a standard normal distribution.
Z Scores
A z score is a standardized version of a raw score (x) that gives information about the relative location
of that score within its distribution. The formula for converting a raw score into a z score is
As you can see, z scores combine information about where the distribution is located (the mean/center)
with how wide the distribution is (the standard deviation/spread) to interpret a raw score (x).
Specifically, z scores will tell us how far the score is away from the mean in units of standard deviations
and in what direction.
The value of a z score has two parts: the sign (positive or negative) and the magnitude (the actual
number). The sign of the z score tells you in which half of the distribution the z score falls: a positive
sign (or no sign) indicates that the score is above the mean and on the right-hand side or upper end of the
distribution, and a negative sign tells you the score is below the mean and on the left-hand side or lower
end of the distribution. The magnitude of the number tells you, in units of standard deviations, how far
away the score is from the center or mean. The magnitude can take on any value between negative and
positive infinity, but for reasons we will see soon, they generally fall between −3 and 3.
Let’s look at some examples. A z score value of −1.0 tells us that this z score is 1 standard deviation
(because of the magnitude 1.0) below (because of the negative sign) the mean. Similarly, a z score value
of 1.0 tells us that this z score is 1 standard deviation above the mean. Thus, these two scores are the
same distance away from the mean but in opposite directions. A z score of −2.5 is two-and-a-half
standard deviations below the mean and is therefore farther from the center than both of the previous
scores, and a z score of 0.25 is closer than all of the ones before. In Unit 2, we will learn to formalize the
distinction between what we consider “close to” the center or “far from” the center. For now, we will use
a rough cut-off of 1.5 standard deviations in either direction as the difference between close scores
(those within 1.5 standard deviations or between z = −1.5 and z = 1.5) and extreme scores (those farther
than 1.5 standard deviations—below z = −1.5 or above z = 1.5).
We can also convert raw scores into z scores to get a better idea of where in the distribution those scores
fall. Let’s say we get a score of 68 on an exam. We may be disappointed to have scored so low, but
perhaps it was just a very hard exam. Having information about the distribution of all scores in the class
would be helpful to put some perspective on ours. We find out that the class got an average score of 54
with a standard deviation of 8. To find out our relative location within this distribution, we simply
convert our test score into a z score.
We find that we are 1.75 standard deviations above the average, above our rough cut-off for close and
far. Suddenly our 68 is looking pretty good!
Figure 4.2 shows both the raw score and the z score on their respective distributions. Notice that the red
line indicating where each score lies is in the same relative spot for both. This is because transforming a
raw score into a z score does not change its relative location, it only makes it easier to know precisely
where it is.
Figure 4.2. Raw and standardized versions of a single score. (“Raw and Standardized Versions of a
Score” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)
z Scores are also useful for comparing scores from different distributions. Let’s say we take the SAT and
score 501 on both the math and critical reading sections. Does that mean we did equally well on both?
Scores on the math portion are distributed normally with a mean of 511 and standard deviation of 120,
so our z score on the math section is
which is just slightly below average (note the use of “math” as a subscript; subscripts are used when
presenting multiple versions of the same statistic in order to know which one is which and have no
bearing on the actual calculation). The critical reading section has a mean of 495 and standard deviation
of 116, so
So even though we were almost exactly average on both tests, we did a little bit better on the critical
reading portion relative to other people.
Finally, z scores are incredibly useful if we need to combine information from different measures that
are on different scales. Let’s say we give a set of employees a series of tests on things like job
knowledge, personality, and leadership. We may want to combine these into a single score we can use to
rate employees for development or promotion, but look what happens when we take the average of raw
scores from different scales, as shown in Table 4.1.
Job
Knowledge Personality Leadership
Employee (0–100) (1–5) (1–5) Average
Because the job knowledge scores were so big and the scores were so similar, they overpowered the
other scores and removed almost all variability in the average. However, if we standardize these scores
into z scores, our averages retain more variability and it is easier to assess differences between
employees, as shown in Table 4.2.
for a sample. Notice that these are just simple rearrangements of the original formulas for calculating z
from raw scores.
Let’s say we create a new measure of intelligence, and initial calibration finds that our scores have a
mean of 40 and standard deviation of 7. Three people who have scores of 52, 43, and 34 want to know
how well they did on the measure. We can convert their raw scores into z scores:
A problem is that these new z scores aren’t exactly intuitive for many people. We can give people
information about their relative location in the distribution (for instance, the first person scored well
above average), or we can translate these z scores into the more familiar metric of IQ scores, which have
a mean of 100 and standard deviation of 16:
We would also likely round these values to 127, 107, and 87, respectively, for convenience.
We saw in Chapter 3 that standard deviations can be used to divide the normal distribution: 68% of the
distribution falls within 1 standard deviation of the mean, 95% within (roughly) 2 standard deviations,
and 99.7% within 3 standard deviations. Because z scores are in units of standard deviations, this means
that 68% of scores fall between z = −1.0 and z = 1.0 and so on. We call this 68% (or any percentage we
have based on our z scores) the proportion of the area under the curve. Any area under the curve is
bounded by (defined by, delineated by, etc.) by a single z score or pair of z scores.
An important property to point out here is that, by virtue of the fact that the total area under the curve of
a distribution is always equal to 1.0 (see section on Normal Distributions at the beginning of this
chapter), these areas under the curve can be added together or subtracted from 1 to find the proportion in
other areas. For example, we know that the area between z = −1.0 and z = 1.0 (i.e., within one standard
deviation of the mean) contains 68% of the area under the curve, which can be represented in decimal
form as .6800. (To change a percentage to a decimal, simply move the decimal point 2 places to the left.)
Because the total area under the curve is equal to 1.0, that means that the proportion of the area outside z
= −1.0 and z = 1.0 is equal to 1.0 − .6800 = .3200 or 32% (see Figure 4.3). This area is called the area in
the tails of the distribution. Because this area is split between two tails and because the normal
distribution is symmetrical, each tail has exactly one-half, or 16%, of the area under the curve.
Figure 4.3. Shaded areas represent the area under the curve in the tails. (“Area under the Curve in the
Tails” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)
We will have much more to say about this concept in the coming chapters. As it turns out, this is a quite
powerful idea that enables us to make statements about how likely an outcome is and what that means
for research questions we would like to answer and hypotheses we would like to test.
distributed. Height, birth weight, reading ability, job satisfaction, or SAT scores are just a few
Because normally distributed variables are so common, many statistical tests are designed for
Understanding the properties of normal distributions means you can use inferential statistics to
compare different groups and make estimates about populations using samples.
What are the properties of normal distributions?
Normal distributions have key characteristics that are easy to spot in graphs:
● The distribution is symmetric about the mean—half the values fall below the mean and
● The distribution can be described by two values: the mean and the standard deviation.
The mean is the location parameter while the standard deviation is the scale parameter.
The mean determines where the peak of the curve is centered. Increasing the mean moves
distribution:
● Around 68% of values are within 1 standard deviation from the mean.
● Around 95% of values are within 2 standard deviations from the mean.
● Around 99.7% of values are within 3 standard deviations from the mean.
Example: Using the empirical rule in a normal distributionYou collect SAT scores from students in a
new test preparation course. The data follows a normal distribution with a mean score (M) of 1150
● Around 68% of scores are between 1,000 and 1,300, 1 standard deviation above and below
the mean.
● Around 95% of scores are between 850 and 1,450, 2 standard deviations above and below
the mean.
● Around 99.7% of scores are between 700 and 1,600, 3 standard deviations above and below
the mean.
The empirical rule is a quick way to get an overview of your data and check for any outliers or
If data from small samples do not closely follow this pattern, then other distributions like the
t-distribution may be more appropriate. Once you identify the distribution of your variable, you
random samples within the population. A sampling distribution of the mean is the distribution of
● Law of Large Numbers: As you increase sample size (or the number of samples), then
● With multiple large samples, the sampling distribution of the mean is normally
Parametric statistical tests typically assume that samples come from normally distributed
populations, but the central limit theorem means that this assumption isn’t necessary to meet
You can use parametric tests for large samples from populations with any kind of distribution
as long as other important assumptions are met. A sample size of 30 or more is generally
considered large.
For small samples, the assumption of normality is important because the sampling distribution
of the mean isn’t known. For accurate results, you have to be sure that the population is
normally distributed before you can use parametric tests with small samples.
distribution is a probability distribution, so the total area under the curve is always 1 or 100%.
The formula for the normal probability density function looks fairly complicated. But to use it,
you only need to know the population mean and standard deviation.
For any value of x, you can plug in the mean and standard deviation into the formula to find the
● μ = mean
● σ = standard deviation
● σ2 = variance
Example: Using the probability density functionYou want to know the probability that SAT scores in
On your graph of the probability density function, the probability is the shaded area under the curve
that lies to the right of where your SAT scores equal 1380.
You can find the probability value of this score using the standard normal distribution.
Every normal distribution is a version of the standard normal distribution that’s been stretched
While individual observations from normal distributions are referred to as x, they are referred to
as z in the z-distribution. Every normal distribution can be converted to the standard normal
Z-scores tell you how many standard deviations away from the mean each value lies.
You only need to know the mean and standard deviation of your distribution to find the z-score
of a value.
● x = individual value
● μ = mean
● σ = standard deviation
We convert normal distributions into the standard normal distribution for several reasons:
value.
● To find the probability that a sample mean significantly differs from a known population
mean.
deviations.
below that z-score occurring. If you convert an individual value into a z-score, you can then
find the probability of all values up to that value occurring in a normal distribution.
Example: Finding probability using the z-distributionTo find the probability of SAT scores in your
The mean of our distribution is 1150, and the standard deviation is 150. The z-score tells you how
ormula alculation
For a z-score of 1.53, the p-value is 0.937. This is the probability of SAT scores being 1380 or less
(93.7%), and it’s the area under the curve left of the shaded area.
To find the shaded area, you take away 0.937 from 1, which is the total area under the curve.
That means it is likely that only 6.3% of SAT scores in your sample exceed 1380.