2348314_BioStats_CIA1
2348314_BioStats_CIA1
DEPARTMENT OF STATISTICS
WRITTEN BY
CATHARIN NIVITHA P
Reg. No 2348314
DEPARTMENT OF STATISTICS
CHRIST (DEEMED TO BE UNIVERSITY)
BANGALORE-560029
INDIA
OCTOBER 2024
1. Types of Variables:
A variable is a quantity that can take off on different values. Examples of variables include age, sex,
height, weight, body mass index (BMI), blood group, body temperature, blood glucose level, blood
pressure, heart rate, number of teeth, severity of disease (mild, moderate, severe) etc. Variables are
classified into categorical variables and numerical variables.
Variables
Categorical Numerical
Example: Some of the examples of categorical variables include gender, skin colour, nationality and
marital status of the patients admitted into the hospital.
Example: Genotype, blood type, zip code, gender, race, eye colour, political party, etc are some of the
examples of nominal variables. Nominal variables can be encoded with numbers if needed, but the
order is arbitrary and any calculations, such as computing a mean, median, or standard deviation,
would be meaningless.
Example: Pain scale, Frequency of occurrence, Severity of symptoms are some of the examples of
ordinal variables. Patients are asked to rate their pain on a scale of 1 to 10, with 10 representing the
most severe pain. It is common to use terms like “Never”, “Rarely”, “Sometimes”, “Often”, and
“Always” to denote the frequency of occurrence, making it an ordinal variable. In a medical survey,
patients could be asked to rate the severity of their symptoms as “Mild”, “Moderate”, or “Severe”
Example: Haemoglobin level, Patient heart rate, Blood Pressure, etc are examples of numerical
variables.
Example: In molecular biology, many situations involve counting events: how many codons use a
certain spelling, how many reads of DNA match a reference, how many CG digrams are observed in a
DNA sequence. These counts give us discrete variables, as opposed to quantities such as mass and
intensity that are measured on continuous scales.
Example: Height of the patient, weight of the patient, blood glucose level, etc comes under continuous
variables.
2. Measurement Levels
Levels of measurement, also called scales of measurement, tells how precisely variables are recorded.
In scientific research, a variable is anything that can take on different values across your data set (e.g.,
height or test scores). There are 4 levels of measurement. The higher the level, the more complex the
measurement. Nominal data is the least precise and complex level. The word nominal means “in
name,” so this kind of data can only be labelled. It does not have a rank order, equal spacing between
values, or a true zero value.
2.2 Ordinal:
Variables that can be measured on an ordinal scale have the following properties:
• They have a natural order. For example, “very satisfied” is better than “satisfied,” which is
better than “neutral,” etc.
• The difference between values can’t be evaluated. For example, we can’t exactly say that the
difference between “very satisfied and “satisfied” is the same as the difference between
“satisfied” and “neutral.”
• The two measures of central tendency we can calculate for these variables are the mode and
the median. The mode tells us which category had the most counts and the median tells us
the “middle” value.
• Ordinal scale data is often collected by companies through surveys who are looking for
feedback about their product or service. For example, a grocery store might survey 100 recent
customers and ask them about their overall experience.
2.3 Interval:
It is a scale used to label variables that have a natural order and a quantifiable difference
between values, but no “true zero” value.
The nice thing about interval scale data is that it can be analysed in more ways than nominal or ordinal
data.
2.3 Ratio:
This scale is used to label variables that have a natural order, a quantifiable difference between values,
and a “true zero” value.
3. Visualizations:
Data visualization is the graphical representation of information and data. By using visual elements
like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand
trends, outliers, and patterns in data. Data visualization translates complex data sets into visual
formats that are easier for the human brain to comprehend. The primary goal of data visualization is
to make data more accessible and easier to interpret, allowing users to identify patterns, trends, and
outliers quickly. This is particularly important in the context of big data, where the sheer volume of
information can be overwhelming without effective visualization techniques.
Data
Visualization
1. Histogram
2. Bar Diagram or Bar Graph
3. Frequency Polygon
4. Cumulative Frequency Curve or Ogive
Histogram:
A histogram is a chart that plots the distribution of a numeric variable's values as a series of bars. Each
bar typically covers a range of numeric values called a bin or class; a bar's height indicates the
frequency of data points with a value within the corresponding bin.
Histogram
, , , , , ,
Bar Graph:
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with
heights or lengths proportional to the values that they represent. The bars can be plotted vertically or
horizontally. A vertical bar chart is sometimes called a column chart. A bar graph shows comparisons
among discrete categories. One axis of the chart shows the specific categories being compared, and
the other axis represents a measured value. Some bar graphs present bars clustered in groups of more
than one, showing the values of more than one measured variable.
Bar Graph
Category 4
Category 3
Category 2
Category 1
0 1 2 3 4 5 6
Frequency Polygon:
A frequency polygon is almost identical to a histogram, which is used to compare sets of data or to
display a cumulative frequency distribution. It uses a line graph to represent quantitative data.
Statistics deals with the collection of data and information for a particular purpose. The tabulation of
each run for each ball in cricket gives the statistics of the game. Tables, graphs, pie-charts, bar graphs,
histograms, polygons etc. are used to represent statistical data pictorially. Frequency polygons are a
visually substantial method of representing quantitative data and its frequencies. Let us discuss how
to represent a frequency polygon.
Cumulative Frequency Curve:
A curve that represents the cumulative frequency distribution of grouped data on a graph is called a
Cumulative Frequency Curve or an Ogive. Representing cumulative frequency data on a graph is the
most efficient way to understand the data and derive results.
4. Descriptive Analysis:
Descriptive statistics help describe and explain the features of a specific data set by giving short
summaries about the sample and measures of the data. The most recognized types of descriptive
statistics are measures of centre. For example, the mean, median, and mode, which are used at almost
all levels of math and statistics, are used to define and describe a data set. The mean, or the average,
is calculated by adding all the figures within the data set and then dividing by the number of figures
within the set.
The central tendency is stated as the statistical measure that represents the single value of the entire
distribution or a dataset. It aims to provide an accurate description of the entire data in the
distribution. The central tendency of the dataset can be found out using the three important measures
namely mean, median and mode.
The mean represents the average value of the dataset. It can be calculated as the sum of all the values
in the dataset divided by the number of values. In general, it is considered as the arithmetic mean.
Some other measures of mean used to find the central tendency are as follows:
• Geometric Mean
• Harmonic Mean
• Weighted Mean
It is observed that if all the values in the dataset are the same, then all geometric, arithmetic and
harmonic mean values are the same. If there is variability in the data, then the mean value differs.
Calculating the mean value is completely easy. The formula to calculate the mean value is given by:
𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛
𝑀𝑒𝑎𝑛 =
𝑛
MEDIAN:
Median is the middle value of the dataset in which the dataset is arranged in the ascending order or
in descending order. When the dataset contains an even number of values, then the median value of
the dataset can be found by taking the mean of the middle two values.
𝑛+1
𝑋[ ] 𝑖𝑓 𝑛 𝑖𝑠 𝑜𝑑𝑑
2
𝑀𝑒𝑑𝑖𝑎𝑛 =
𝑛 𝑛+1
𝑋 [ ] + 𝑋[ ]
2 2 𝑖𝑓 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛
2
MODE:
The mode represents the frequently occurring value in the dataset. Sometimes the dataset may
contain multiple modes and in some cases, it does not contain any mode at all.
𝑓1 − 𝑓0
𝑀𝑜𝑑𝑒 = 𝐿 + { }∗ℎ
2𝑓1 − 𝑓0 − 𝑓2
4.2 Dispersion:
Dispersion is the state of getting dispersed or spread. Statistical dispersion means the extent to which
numerical data is likely to vary about an average value. In other words, dispersion helps to understand
the distribution of the data.
1. Range: It is simply the difference between the maximum value and the minimum value given
in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
2. Variance: Deduct the mean from each data in the set, square each of them and add each
square and finally divide them by the total no of values in the data set to get the variance.
Variance σ2) = ∑ X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the standard deviation i.e.
S.D. = √σ.
4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of numbers into
quarters. The quartile deviation is half of the distance between the third and the first quartile.
5. Mean and Mean Deviation: The average of numbers is known as the mean and the arithmetic
mean of the absolute deviations of the observations from a measure of central tendency is
known as the mean deviation (also called mean absolute deviation).
4.3 Correlation:
Correlation is a statistical measure that expresses the extent to which two variables are linearly
related meaning they change together at a constant rate). It’s a common tool for describing
simple relationships without making a statement about cause and effect. Correlation is measured
with a unit-free measure called the correlation coefficient which ranges from -1 to +1 and is
denoted by r. Statistical significance is indicated with a p-value.
4.4 Regression:
Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. It can be utilized to
assess the strength of the relationship between variables and for modelling the future
relationship between them. Regression analysis includes several variations, such as linear,
multiple linear, and nonlinear. The most common models are simple linear and multiple
linear. Nonlinear regression analysis is commonly used for more complicated data sets in
which the dependent and independent variables show a nonlinear relationship.
The simplest form is linear regression, where one independent variable (x) is used to predict
a dependent variable (y). The equation is typically:
𝑦 = 𝐵0 + 𝐵1 𝑋 + ϵ
• 𝐵0 𝑖𝑠 𝑡ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
• 𝐵1 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑙𝑜𝑝𝑒
• 𝜖 𝑖𝑠 𝑡ℎ𝑒 𝑒𝑟𝑟𝑜𝑟 𝑡𝑒𝑟𝑚
Other types of regression include multiple linear regression, logistic regression, and
polynomial regression.