LAS Unit-04 Study Material F.Y.B.tech Sem-II 2022-23
LAS Unit-04 Study Material F.Y.B.tech Sem-II 2022-23
Statistics is a field of mathematics that is related to data analysis. For the last few centuries,
statistics has remained a part of mathematics as the original work was done by
mathematicians like Pascal, James Bernoulli, De-Moivre, Laplace, Gauss and others. Till
early nineteenth century, statistics was mainly concerned with official statistics needed for
the collection of information on revenue, population etc. The science of statistics developed
gradually and its field of application widened day by day. In fact, the term statistics is
generally used to mean numerical facts and figures.
Statistics: the collection, presentation, analysis and interpretation of numerical data.
Statistics is the study of the collection, organization, analysis, interpretation and
presentation of data with the use of quantified models. In short, it is a mathematical tool
that is used to collect and summarize data.
Scope of Statistics:
1. It presents the facts in numerical figures. For example, recording the sales of various
products in a company.
2. It studies the relationship between two or more phenomena. For example, in
medical science, for collection, presentation and analysis of observed facts relating
to causes and incidence of diseases and the result of the application of various
drugs.
3. It helps in the formulation of policies. For example, Economic policy is formulated
by governments by considering and correlating the data regarding profits &
dividends, assets & liabilities, income & expenditures.
4. It presents complex facts in a simplified form. For example, in Astronomy, to find
most probable measurements of the distance, sizes, masses and densities of
heavenly bodies by means of observations.
5. It helps in forecasting. For example, stock market results, sales, GDP etc.
6. It provides techniques for testing of hypotheses. For example, in planning the
marketing strategies.
Data:
Data is defined as "facts and statistics collected together for reference or analysis." In
other words, data is information that has been gathered and analyzed in order to be used
for a specific purpose.
Data is important because it helps us understand the world around us, test hypotheses
and make predictions.
Data collection:
In Statistics, the data collection is a process of gathering information from all the relevant
sources to find a solution to the research problem. It helps to evaluate evaluate the outcome
outcome of the problem. The data collection methods allow a person to conclude an answer
to the relevant question. Most of the organizations use data collection methods to make
assumptions about future probabilities and trends.
Primary Data: Primary data is the one, which is collected by the investigator himself for
the purpose of a specific inquiry or study. Such data is original in character and is
generated by survey conducted by individuals or research institution or any organisation.
Example: If a researcher is interested to know the impact of noon meal scheme for the
school children, he has to undertake a survey and collect data on the opinion of parents and
children by asking relevant questions. Such a data collected for the purpose is called
primary data.
Sources of Secondary data: In most of the studies the investigator finds it impracticable to
collect first-hand information on all related issues and as such he makes use of the data
collected by others. There is a vast amount of published information from which statistical
studies may be made and fresh statistics are constantly in a state of production.
The sources of secondary data can broadly be classified under two heads:
1. Published sources, and
2. Unpublished sources.
1. Published Sources:
The various sources of published data are: Clinical and other personal records, death
certificates, published mortality statistics, census publications, etc.
Examples include:
1. Official publications of Central Statistical Authority
2. Publication of Ministry of Health and Other Ministries
3. News Papers and Journals.
4. International Publications like Publications by WHO, World Bank, UNICEF
5. Records of hospitals or any Health Institutions.
Note: A lot of secondary data is available in the internet. We can access it at any time for the
further studies.
2. Unpublished Sources.
All statistical material is not always published. There are various sources of unpublished
data such as records maintained by various Government and private offices, studies made
by research institutions, scholars, etc. Such sources can also be used where necessary.
Precautions in the use of Secondary data
The following are some of the points that are to be considered in the use of secondary data
1. How the data has been collected and processed
2. The accuracy of the data
3. How far the data has been summarized
4. How comparable the data is with other tabulations
5. How to interpret the data, especially when figures collected for one purpose is used for
another Generally speaking, with secondary data, people have to compromise between what
they want and what they are able to find.
To explain the types of data, we have categorized the types of statistical data with
examples and detailed insights here.
1. Nominal Data
Nominal data is a type of data that includes names or labels. Examples of nominal data
include gender, Nationality, Religion, etc. In research studies, nominal data is often
used to group participants into different categories. For instance, researchers may
want to study the effects of a new treatment on men and women. In this case, the
nominal data would be used to separate the participants into two groups: men and
women.
Nominal data is also sometimes used to measure satisfaction levels. For instance, a
customer satisfaction survey might ask customers to rate their experience on a scale
from 1 to 5, with 1 being "very unsatisfied" and 5 being "very satisfied." In this case,
the numerical values represent different categories (satisfaction levels), so the data
would be considered nominal.
The term ordinal data refers to data with labels that indicate ranking or order.
Examples of ordinal data include social class (upper class, middle class, lower class),
opinions (excellent, good, bad), and satisfaction ratings (Very Satisfied, Satisfied,
Neutral, Unsatisfied, Very Unsatisfied)
For example, if you were tracking the number of students in each grade at a school,
the data would be discrete because there are a finite number of possibilities (ranging
from 0 to the maximum number of students in any given grade).
Continuous Data
Continuous data is a type of data that can take on any value within a certain range.
That is, the data is not divided into distinct values but rather exists as points along a
continuum. Continuous data is often difficult to collect because it requires precise
measurements. It is also more difficult to analyze than discrete data because it often
contains errors. However, continuous data provides more information than discrete
data and can be used to make more accurate predictions. For these reasons,
continuous data is often used in fields such as weather forecasting and medicine.
For example, the temperature is continuous data because it can be any number within
a certain range (32 degrees, 33 degrees, 34 degrees etc.).
Diagrams are an essential operational tool for the presentation of statistical data. They
are objects, mainly geometrical figures such as lines, circles, bars, etc. Statistics
elaborated with the help of diagrams make it easier and simpler, thereby enhancing the
representation of any type of data.
From the charts and diagrams, we will be able to make an analysis and possibly predict
future outcomes.
Line Diagram:
Line diagram is used to represent specific data across varying parameters. A line represents
the sequence of data connected against a particular variable.
A line chart is also called, a line plot or line graph. It is used to show how variables and
information change over time. The information on a line chart is represented with points
and the points are connected with a continuous line.
For example, if you have information on how the price of petrol changed over 5 months,
you can represent that in a line chart so that the trend can be viewed and studied.
A scatter plot displays the relationship between two sets of data. In a scatter plot, dots
are used to represent the values of the data. After collecting data and plotting it, the
pattern of the dots on the plots will tell the relationship between the sets that are being
compared.
For example, if you have information about a person's weight at different ages of his life,
you can represent that in a scatter plot and it will look like the figure below.
Solution:
(i) We represent the above data by a simple bar diagram in the following manner:
Step-1: Years are marked along the X-axis and labeled as ‘Year’.
Step-2: Values of Production Cost are marked along the Y-axis and labeled as ‘Production
Cost (in lakhs of `).
Step-3: Vertical rectangular bars are erected on the years marked and whose height is
proportional to the magnitude of the respective production cost.
The Pie diagram is a circular diagram. As the diagram looks like a pie, it is given this name. A
circle which has 360c is divided into different sectors. Angles of the sectors, subtending at
the center, are proportional to the magnitudes of the frequency of the components.
Procedure:
The following procedure can be followed to draw a Pie diagram for a given data:
v. Draw the second sector adjacent to the first sector at an angle corresponding to the
second component.
Solution :
The following procedure is followed to draw a Pie diagram for a given data:
ii. Compute angles for each component food, clothing, recreation, education, rent and
miscellaneous using the formula class frequency/N x 360
iv. Draw the first sector in the anti-clockwise direction at an angle calculated for the first
component food.
v. Draw the second sector adjacent to the first sector at an angle corresponding to the second
component clothing.
vi. This process is continued for all the components namely recreation, education, rent and
miscellaneous.
Just like the bar chart, the data in a histogram chart is represented with bars but a
histogram organizes data in ranges. It shows the frequency at which different ranges of
data occur.
e.g: Given-the height of the trees (in inches): 61, 63, 64, 66, 68, 69, 71, 71.5, 72, 72.5, 73,
73.5, 74, 74.5, 76, 76.2, 76.5, 77, 77.5, 78, 78.5, 79, 79.2, 80, 81, 82, 83, 84, 85, 87. We can
group the data as follows in a frequency distribution table by setting a range:
a mathematical function showing the number of instances in which a variable takes each of
its possible values.
The frequency of a value is the number of times it occurs in a dataset. A frequency
distribution is the pattern of frequencies of a variable. It’s the number of times each
possible value of a variable occurs in a dataset.
The method for making a frequency table differs between the four types of frequency
distributions. You can follow the guides below or use software such as Excel, SPSS, or R to
make a frequency table.
1. Create a table with two columns and as many rows as there are values of the
variable. Label the first column using the variable name and label the second column
“Frequency.” Enter the values in the first column.
Example:
1. Divide the variable into class intervals. Below is one method to divide a variable
into class intervals. Different methods will give different answers, but there’s no
agreement on the best method to calculate class intervals.
o You can round this value to a whole number or a number that’s convenient to
add (such as a multiple of 10).
o Calculate the class intervals. Each interval is defined by a lower limit and
upper limit. Observations in a class interval are greater than or equal to the
lower limit and less than the upper limit:
The lower limit of the first interval is the lowest value in the dataset. Add the
class interval width to find the upper limit of the first interval and the lower
limit of the second variable. Keep adding the interval width to calculate more
class intervals until you exceed the highest value.
2. Create a table with two columns and as many rows as there are class intervals.
Label the first column using the variable name and label the second column
“Frequency.” Enter the class intervals in the first column.
3. Count the frequencies. The frequencies are the number of observations in each
class interval. You can count by tallying if you find it helpful. Enter the frequencies in
the second column of the table beside their corresponding class intervals.
The class intervals are 19 ≤ a < 29, 29 ≤ a < 39, 39 ≤ a < 49, 49 ≤ a < 59, and 59 ≤ a < 69.
Example:
Central Tendency:
the tendency for the values of a random variable to cluster round its mean, mode, or media
Measures of central tendency are summary statistics that represent the center point or
typical value of a dataset. Examples of these measures include the mean, median, and mode.
These statistics indicate where most values in a distribution fall and are also referred to as
the central location of a distribution. You can think of central tendency as the propensity
for data points to cluster around a middle value.
In statistics, the mean, median, and mode are the three most common measures of central
tendency. Each one calculates the central point using a different method. Choosing the best
measure of central tendency depends on the type of data
Mean:
The mean is the arithmetic average, and it is probably the measure of central tendency that
you are most familiar. Calculating the mean is very simple. You just add up all of the values
and divide by the number of observations in your dataset.
Let x1, x2, x3 , . . . , xn be n observations. We can find the arithmetic mean using the mean
formula:
Mean, x̄ = (x1 + x2 + ... + xn)/n
Example: If the heights of 5 people are 142 cm, 150 cm, 149 cm, 156 cm, and 153 cm.
= 750/5
= 150
Mean, x̄ = 150 cm
When the data is present in tabular form, we use the following formula:
x 4 6 9 10 15
f 5 10 10 7 8
Solution:
xi fi xifi
4 5 20
6 10 60
9 10 90
10 7 70
15 8 120
∑ fi = 40 ∑ xi fi = 360
= 360/40
=9
Thus, Mean = 9
0-10 2
10-20 6
20-30 9
30-40 7
40-50 4
50-60 2
Solution:
In this case, we find the classmark (also called as mid-point of a class) for each class.
5 2 10
15 6 90
25 9 225
35 7 245
45 4 180
55 2 110
= 860/30
= 28.67
x̄ = 28.67
Median:
The median is the middle value. It is the value that splits the dataset in half, making it a
natural measure of central tendency.
To find the median, order your data from smallest to largest, and then find the data point
that has an equal number of values above it and below it. The method for locating the
median varies slightly depending on whether your dataset has an even or odd number of
values. If data set has an odd number of values then median is the middle value and If a
dataset contains an even number of values, the median of the dataset is the mean of the two
middle values.
Solution:
Arranging in ascending order, we get: 23, 34, 43, 54, 56, 67, 78. Here, n (number of
observations) = 7
So, (7 + 1)/2 = 4
Median = 54
Example 2: Let's consider the data: 50, 67, 24, 34, 78, 43. What is the median?
Solution:
Arranging in ascending order, we get: 24, 34, 43, 50, 67, 78.
6/2 = 3
= (43 + 50)/2
Median = 46.5
When the data is continuous and in the form of a frequency distribution, the median is
found as shown below:
Frequency 2 12 22 8 6
Solution:
Calculation table:
0-10 2 2
10-20 12 2 + 12 = 14
20-30 22 14 + 22 = 36
30-40 8 36 + 8 = 44
40-50 6 44 + 6 = 50
N = 50
N/2 = 50/2 = 25
= 20 + (25 - 14)/22 × 10
= 20 + (11/22) × 10
= 20 + 5 = 25
∴ Median = 25
Mode:
The mode is the value that occurs the most frequently in your data set, making it a different
type of measure of central tendency than the mean or median.
To find the mode, sort the values in your dataset by numeric values or by categories. Then
identify the value that occurs most often.
For example in the data: 6, 8, 9, 3, 4, 6, 7, 6, 3, the value 6 appears the most number of
times. Thus, mode = 6. An easy way to remember mode is: Most Often Data Entered. Note:
A data may have no mode, 1 mode, or more than 1 mode. Depending upon the number of
modes the data has, it can be called unimodal, bimodal, trimodal, or multimodal.
When the data is continuous, the mode can be found using the following steps:
• Step 1: Find modal class i.e. the class with maximum frequency.
• Step 2: Find mode using the following formula:
(𝒇𝒎 −𝒇𝟏 )
Mode = 𝒍 + ((𝟐𝒇 )×𝒉
𝒎 −𝒇𝟏 −𝒇𝟐 )
Number of students 5 10 12 6 3
Solution:
h = class width = 20
= 40 + (2/8) × 20
= 45
∴ Mode = 45
We covered the formulas and methods to find the mean, median, and mode for a grouped
and ungrouped set of data. Let us summarize and recall them using the list of mean,
median, and mode formulas given below,
Take a quick look at the figure below with mean mode median formulas.
The values which divide an array (a set of data arranged in ascending or descending order)
into four equal parts are called Quartiles. The first, second and third quartiles are denoted
by Q1, Q2,Q3 respectively. The first and third quartiles are also called the lower and upper
quartiles respectively. The second quartile represents the median, the middle value.
In order to apply formulae, we need to arrange the above data into ascending order i.e. in
The quartiles may be determined from grouped data in the same way as the median except
that in place of n/2 we will use n/4. For calculating quartiles from grouped data we will
form cumulative frequency column. Quartiles for grouped data will be calculated from the
following formula:
Where,
l = lower class boundary of the class containing the Q1 or Q3
, i.e. the class corresponding to the cumulative frequency in which n/4 or 3n/4 lies
The quartile deviation is half the difference between the third quartile and the first quartile
be represented as follows:
Quartile deviation is also known as semi-interquartile range. Here, the difference between
the third and first quartiles is called interquartile range. The interquartile range may be
taken as measure of dispersion (i.e. the extent to which the values are spread out from the
average).
The values which divide an array into ten equal parts are called deciles. The first, second,……
corresponds to median. The second, fourth, sixth and eighth deciles which collectively
divide the data into five equal parts are called quintiles.
Deciles for Ungrouped Data:
Deciles for ungrouped data will be calculated from the following formula:
𝒊
𝑫𝒊 = (𝑵 + 𝟏) × , 𝒊 = 𝟏, 𝟐, 𝟑 … … . . 𝟏𝟎
𝟏𝟎
Deciles for grouped data will be calculated from the following formula:
𝑵 𝒉
𝑫𝒊 = 𝒍 + (𝒊 × 𝟏𝟎 − 𝒄𝒇) × 𝒇 , 𝒊 = 𝟏, 𝟐, 𝟑 … … . . 𝟏𝟎
Percentiles:
The values which divide an array into one hundred equal parts are called percentiles. The
The 50th percentile (p50) corresponds to the median. The 25th percentile
(p25)corresponds to the first quartile and the 75th percentile (p75)corresponds to the third
quartile.
Percentiles for Ungrouped Data:
Percentile from ungrouped data could be calculated from the following formula:
Percentiles can also be calculated for grouped data which is done with the help of following
formula:
Measures of dispersion are non-negative real numbers that help to gauge the spread of
data about a central value. These measures help to determine how stretched or squeezed
the given data is. There are five most commonly used measures of dispersion. These are
range, variance, standard deviation, mean deviation, and quartile deviation.
The most important use of measures of dispersion is that they help to get an understanding
of the distribution of data. As the data becomes more diverse, the value of the measure of
dispersion increases.
The measures of dispersion can be classified into two broad categories. These are absolute
measures of dispersion and relative measures of dispersion. Range, variance, standard
deviation and mean deviation fall under the category of absolute measures of deviation.
These measures have the same unit as the data that is being scrutinized. Coefficients of
dispersion are relative measures of deviation. Such dispersion measures are always
dimensionless.
Range: Given a data set, the range can be defined as the difference between the maximum
value and the minimum value.
Variance: The average squared deviation from the mean of the given data set is known as
the variance. This measure of dispersion checks the spread of the data about the mean.
Standard Deviation: The square root of the variance gives the standard deviation. Thus,
the standard deviation also measures the variation of the data about the mean.
Mean Deviation: The mean deviation gives the average of the data's absolute deviation
about the central points. These central points could be the mean, median, or mode.
If the data of separate data sets have different units and need to be compared then relative
measures of dispersion are used. The measures are expressed in the form of ratios and
percentages thus, making them unitless. Some of the relative measures of dispersion are
given below:
Coefficient of Range: It is the ratio of the difference between the highest and lowest value
in a data set to the sum of the highest and lowest value.
Coefficient of Variation: It is the ratio of the standard deviation to the mean of the data
set. It is expressed in the form of a percentage.
Coefficient of Mean Deviation: This can be defined as the ratio of the mean deviation to
the value of the central point from which it is calculated.
Measures of dispersion are used when we want to find the scattering of data about a
central point such as the mean. The general formulas used to calculate the various
measures of dispersion are given in the tables below:
Absolute Measures of
Formulas
Dispersion
H-S
Range where H is the largest value and S is the
smallest value in a data set.
∑(𝑥𝑖 −𝑥̅ )2
𝑆. 𝐷 = 𝜎 = √ or
𝑁
2
∑ 𝑥𝑖 2 ∑ 𝑥𝑖
Standard Deviation 𝜎=√ −( )
𝑁 𝑁
∑ 𝑓𝑖 (𝑥𝑖 − 𝑥̅ )2
𝑆. 𝐷 = 𝜎 = √
𝑁
1
Q.D = 2 (𝑄3 − 𝑄1 )
Quartile Deviation where Q3and Q1are the third and first
quartiles respectively.
Whenever we want compare the variability of the two series which differ widely in their
averages or which are measured in different units, we do not merely calculate the measure
of dispersion but we calculate the coefficients of dispersion (C.D) based on different
measures of dispersion are as follows:
Coefficient of Range (H - S) / (H + S)
(𝑄3 − 𝑄1 )
Coefficient of Quartile Deviation
(𝑄3 + 𝑄1 )
Both measures of dispersion and measures of central tendency are used to describe data.
The table given below outlines the difference between the measures of dispersion and
central tendency.
Measures of Dispersion Central Tendency
• Measures of dispersion are used to determine the spread of data. They are
measured about a central value.
• Measures of dispersion can be classified into two types, i.e., absolute and relative
measures of dispersion.
• Absolute measures of deviation have the same units as the data and relative
measures are unitless.
• Range, variance, standard deviation, quartile deviation and mean deviation are
absolute measures of deviation
• Coefficients of dispersion are relative measures of deviation