0% found this document useful (0 votes)
10 views10 pages

2348314_BioStats_CIA1

Uploaded by

nivithacatharin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

2348314_BioStats_CIA1

Uploaded by

nivithacatharin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

BIO-STATISTICS

DEPARTMENT OF STATISTICS

WRITTEN BY

CATHARIN NIVITHA P

Reg. No 2348314

DEPARTMENT OF STATISTICS
CHRIST (DEEMED TO BE UNIVERSITY)
BANGALORE-560029
INDIA

OCTOBER 2024
1. Types of Variables:
A variable is a quantity that can take off on different values. Examples of variables include age, sex,
height, weight, body mass index (BMI), blood group, body temperature, blood glucose level, blood
pressure, heart rate, number of teeth, severity of disease (mild, moderate, severe) etc. Variables are
classified into categorical variables and numerical variables.

Variables

Categorical Numerical

Nominal Ordinal Discrete Continuous

1.1 Categorical Variables:


Categorical variable (also called qualitative variable) refers to a characteristic that is not quantifiable.
Categorical measurements are expressed in terms of natural language descriptions, but not in terms
of numbers. Sometimes categorical data can take numerical values, but those numbers do not have
mathematical meaning. Categorical variables are further classified into nominal and ordinal.

Example: Some of the examples of categorical variables include gender, skin colour, nationality and
marital status of the patients admitted into the hospital.

1.1.1 Nominal Variables:


A nominal variable is a type of categorical variable that can have two or more categories. However,
there is no ordering within these categories. A nominal variable does not have any numerical
characteristics and is qualitative in nature.

Example: Genotype, blood type, zip code, gender, race, eye colour, political party, etc are some of the
examples of nominal variables. Nominal variables can be encoded with numbers if needed, but the
order is arbitrary and any calculations, such as computing a mean, median, or standard deviation,
would be meaningless.

1.1.2 Ordinal Variables:


An ordinal variable is a type of categorical variable that has a clear ordering or ranking of the
categories. This means that while the categories have a meaningful sequence, the intervals between
the categories are not necessarily equal or known. Ordinal variables are commonly found in surveys
and questionnaires, where responses can be ranked but the exact differences between them are not
quantifiable.

Example: Pain scale, Frequency of occurrence, Severity of symptoms are some of the examples of
ordinal variables. Patients are asked to rate their pain on a scale of 1 to 10, with 10 representing the
most severe pain. It is common to use terms like “Never”, “Rarely”, “Sometimes”, “Often”, and
“Always” to denote the frequency of occurrence, making it an ordinal variable. In a medical survey,
patients could be asked to rate the severity of their symptoms as “Mild”, “Moderate”, or “Severe”

1.2 Numerical Variables:


A numeric variable (also called quantitative variable) is a quantifiable characteristic whose values are
numbers (except numbers which are codes standing up for categories). Numeric variables may be
either continuous or discrete.

Example: Haemoglobin level, Patient heart rate, Blood Pressure, etc are examples of numerical
variables.

1.2.1 Discrete Variables:


A discrete variable is a type of numerical variable that can assume only a finite number of real values
within a given interval.

Example: In molecular biology, many situations involve counting events: how many codons use a
certain spelling, how many reads of DNA match a reference, how many CG digrams are observed in a
DNA sequence. These counts give us discrete variables, as opposed to quantities such as mass and
intensity that are measured on continuous scales.

1.2.2 Continuous Variables:


Continuous variables can assume any numeric value and can be meaningfully split into smaller parts.
Consequently, they have valid fractional and decimal values. In fact, continuous data have an infinite
number of potential values between any two points.

Example: Height of the patient, weight of the patient, blood glucose level, etc comes under continuous
variables.

2. Measurement Levels
Levels of measurement, also called scales of measurement, tells how precisely variables are recorded.
In scientific research, a variable is anything that can take on different values across your data set (e.g.,
height or test scores). There are 4 levels of measurement. The higher the level, the more complex the
measurement. Nominal data is the least precise and complex level. The word nominal means “in
name,” so this kind of data can only be labelled. It does not have a rank order, equal spacing between
values, or a true zero value.

Nominal Ordinal Interval Ratio


Named Named Named Named

Natural Order Natural Order Natural Order


Equal distance Equal distance between
between Variables Variables
Has a 'true zero' value, thus
ratio between values can
be calculated
2.1 Nominal:
A scale used to label variables that have no quantitative values. Variables that can be
measured on a nominal scale have the following properties:
• They have no natural order. For example, we can’t arrange eye colours in order of
worst to best or lowest to highest.
• Categories are mutually exclusive. For example, an individual can’t have both blue
and brown eyes. Similarly, an individual can’t live both in the city and in a rural area.
• The only number we can calculate for these variables are counts. For example, we
can count how many individuals have blonde hair, how many have black hair, how many
have brown hair, etc.
• The only measure of central tendency we can calculate for these variables is the
mode. The mode tells us which category had the most counts. For example, we could find
which eye color occurred most frequently.

2.2 Ordinal:
Variables that can be measured on an ordinal scale have the following properties:

• They have a natural order. For example, “very satisfied” is better than “satisfied,” which is
better than “neutral,” etc.
• The difference between values can’t be evaluated. For example, we can’t exactly say that the
difference between “very satisfied and “satisfied” is the same as the difference between
“satisfied” and “neutral.”
• The two measures of central tendency we can calculate for these variables are the mode and
the median. The mode tells us which category had the most counts and the median tells us
the “middle” value.
• Ordinal scale data is often collected by companies through surveys who are looking for
feedback about their product or service. For example, a grocery store might survey 100 recent
customers and ask them about their overall experience.

2.3 Interval:
It is a scale used to label variables that have a natural order and a quantifiable difference
between values, but no “true zero” value.

• These variables have a natural order.


• We can measure the mean, median, mode, and standard deviation of these variables.
• These variables have an exact difference between values.
• These variables have no “true zero” value.

The nice thing about interval scale data is that it can be analysed in more ways than nominal or ordinal
data.
2.3 Ratio:
This scale is used to label variables that have a natural order, a quantifiable difference between values,
and a “true zero” value.

• These variables have a natural order.


• We can calculate the mean, median, mode, standard deviation, and a variety of other
descriptive statistics for these variables.
• These variables have an exact difference between values.
• These variables have a “true zero” value.

3. Visualizations:
Data visualization is the graphical representation of information and data. By using visual elements
like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand
trends, outliers, and patterns in data. Data visualization translates complex data sets into visual
formats that are easier for the human brain to comprehend. The primary goal of data visualization is
to make data more accessible and easier to interpret, allowing users to identify patterns, trends, and
outliers quickly. This is particularly important in the context of big data, where the sheer volume of
information can be overwhelming without effective visualization techniques.

Data
Visualization

Grouping and Graphical


Tabulation Visualization

3.1 Grouping and Tabulation:


It is cumbersome to study or interpret large data without grouping it, even if it is arranged
sequentially. For this, the data are usually organised into groups called classes and presented in a table
which gives the frequency in each group. Such a frequency table gives a better overall view of the
distribution of data and enables a person to rapidly comprehend important characteristics of the data.

3.2 Graphical Visualization:


The data which has been shown in the tabular form, may be displayed in pictorial form by using a
graph. A well-constructed graphical presentation is the easiest way to depict a given set of data. Here
are a few of the standard graphic forms of representing the data

1. Histogram
2. Bar Diagram or Bar Graph
3. Frequency Polygon
4. Cumulative Frequency Curve or Ogive
Histogram:
A histogram is a chart that plots the distribution of a numeric variable's values as a series of bars. Each
bar typically covers a range of numeric values called a bin or class; a bar's height indicates the
frequency of data points with a value within the corresponding bin.

Histogram

, , , , , ,

Bar Graph:
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with
heights or lengths proportional to the values that they represent. The bars can be plotted vertically or
horizontally. A vertical bar chart is sometimes called a column chart. A bar graph shows comparisons
among discrete categories. One axis of the chart shows the specific categories being compared, and
the other axis represents a measured value. Some bar graphs present bars clustered in groups of more
than one, showing the values of more than one measured variable.

Bar Graph
Category 4
Category 3
Category 2
Category 1
0 1 2 3 4 5 6

Series 3 Series 2 Series 1

Frequency Polygon:
A frequency polygon is almost identical to a histogram, which is used to compare sets of data or to
display a cumulative frequency distribution. It uses a line graph to represent quantitative data.
Statistics deals with the collection of data and information for a particular purpose. The tabulation of
each run for each ball in cricket gives the statistics of the game. Tables, graphs, pie-charts, bar graphs,
histograms, polygons etc. are used to represent statistical data pictorially. Frequency polygons are a
visually substantial method of representing quantitative data and its frequencies. Let us discuss how
to represent a frequency polygon.
Cumulative Frequency Curve:
A curve that represents the cumulative frequency distribution of grouped data on a graph is called a
Cumulative Frequency Curve or an Ogive. Representing cumulative frequency data on a graph is the
most efficient way to understand the data and derive results.

4. Descriptive Analysis:
Descriptive statistics help describe and explain the features of a specific data set by giving short
summaries about the sample and measures of the data. The most recognized types of descriptive
statistics are measures of centre. For example, the mean, median, and mode, which are used at almost
all levels of math and statistics, are used to define and describe a data set. The mean, or the average,
is calculated by adding all the figures within the data set and then dividing by the number of figures
within the set.

4.1 Central Tendency:


MEAN:

The central tendency is stated as the statistical measure that represents the single value of the entire
distribution or a dataset. It aims to provide an accurate description of the entire data in the
distribution. The central tendency of the dataset can be found out using the three important measures
namely mean, median and mode.
The mean represents the average value of the dataset. It can be calculated as the sum of all the values
in the dataset divided by the number of values. In general, it is considered as the arithmetic mean.
Some other measures of mean used to find the central tendency are as follows:

• Geometric Mean
• Harmonic Mean
• Weighted Mean

It is observed that if all the values in the dataset are the same, then all geometric, arithmetic and
harmonic mean values are the same. If there is variability in the data, then the mean value differs.
Calculating the mean value is completely easy. The formula to calculate the mean value is given by:

𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛
𝑀𝑒𝑎𝑛 =
𝑛

MEDIAN:

Median is the middle value of the dataset in which the dataset is arranged in the ascending order or
in descending order. When the dataset contains an even number of values, then the median value of
the dataset can be found by taking the mean of the middle two values.

𝑛+1
𝑋[ ] 𝑖𝑓 𝑛 𝑖𝑠 𝑜𝑑𝑑
2

𝑀𝑒𝑑𝑖𝑎𝑛 =

𝑛 𝑛+1
𝑋 [ ] + 𝑋[ ]
2 2 𝑖𝑓 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛
2

MODE:

The mode represents the frequently occurring value in the dataset. Sometimes the dataset may
contain multiple modes and in some cases, it does not contain any mode at all.

𝑓1 − 𝑓0
𝑀𝑜𝑑𝑒 = 𝐿 + { }∗ℎ
2𝑓1 − 𝑓0 − 𝑓2

4.2 Dispersion:
Dispersion is the state of getting dispersed or spread. Statistical dispersion means the extent to which
numerical data is likely to vary about an average value. In other words, dispersion helps to understand
the distribution of the data.

1. Range: It is simply the difference between the maximum value and the minimum value given
in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6

2. Variance: Deduct the mean from each data in the set, square each of them and add each
square and finally divide them by the total no of values in the data set to get the variance.
Variance σ2) = ∑ X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the standard deviation i.e.
S.D. = √σ.

4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of numbers into
quarters. The quartile deviation is half of the distance between the third and the first quartile.

5. Mean and Mean Deviation: The average of numbers is known as the mean and the arithmetic
mean of the absolute deviations of the observations from a measure of central tendency is
known as the mean deviation (also called mean absolute deviation).

4.3 Correlation:
Correlation is a statistical measure that expresses the extent to which two variables are linearly
related meaning they change together at a constant rate). It’s a common tool for describing
simple relationships without making a statement about cause and effect. Correlation is measured
with a unit-free measure called the correlation coefficient which ranges from -1 to +1 and is
denoted by r. Statistical significance is indicated with a p-value.

• The closer r is to zero, the weaker the linear relationship.


• Positive r values indicate a positive correlation, where the values of both variables tend to
increase together.
• Negative r values indicate a negative correlation, where the values of one variable tend to
increase when the values of the other variable decrease.
• The p-value gives us evidence that we can meaningfully conclude that the population
correlation coefficient is likely different from zero, based on what we observe from the
sample.
• Unit-free measure means that correlations exist on their own scale: in our example, the
number given for r is not on the same scale as either elevation or temperature. This is different
from other summary statistics. For instance, the mean of the elevation measurements is on
the same scale as its variable.

4.4 Regression:
Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. It can be utilized to
assess the strength of the relationship between variables and for modelling the future
relationship between them. Regression analysis includes several variations, such as linear,
multiple linear, and nonlinear. The most common models are simple linear and multiple
linear. Nonlinear regression analysis is commonly used for more complicated data sets in
which the dependent and independent variables show a nonlinear relationship.

The simplest form is linear regression, where one independent variable (x) is used to predict
a dependent variable (y). The equation is typically:
𝑦 = 𝐵0 + 𝐵1 𝑋 + ϵ
• 𝐵0 𝑖𝑠 𝑡ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
• 𝐵1 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑙𝑜𝑝𝑒
• 𝜖 𝑖𝑠 𝑡ℎ𝑒 𝑒𝑟𝑟𝑜𝑟 𝑡𝑒𝑟𝑚
Other types of regression include multiple linear regression, logistic regression, and
polynomial regression.

You might also like