0% found this document useful (0 votes)
2 views

Unit 4

Unit 4 covers data processing and analysis, detailing the steps involved in transforming raw data into usable information, including collection, preparation, input, output, and storage. It discusses various analysis techniques such as univariate, bivariate, and multivariate analysis, as well as the different levels of measurement (nominal, ordinal, interval, and ratio) and their respective statistical tools. Additionally, it highlights measures of central tendency, dispersion, and relationships, alongside regression analysis and parametric tests.

Uploaded by

Navya Chawla
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit 4

Unit 4 covers data processing and analysis, detailing the steps involved in transforming raw data into usable information, including collection, preparation, input, output, and storage. It discusses various analysis techniques such as univariate, bivariate, and multivariate analysis, as well as the different levels of measurement (nominal, ordinal, interval, and ratio) and their respective statistical tools. Additionally, it highlights measures of central tendency, dispersion, and relationships, alongside regression analysis and parametric tests.

Uploaded by

Navya Chawla
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Unit 4

Processing and Analysis of data


Contents
• Meaning, importance and steps involved in processing of data
• Statistical tools and techniques for analysis of data Analysis
• Basics data analysis-Frequency distribution
• Analysis and interpretation of data –Interpretation of results
• Diagrammatic and Graphic representation
• Concept of Univariate
• Bivariate and multivariate analysis
Meaning, importance and steps involved in processing of data
Data processing Data processing in research is the collection and translation of a data set into valuable, usable
information. Through this process, a researcher, data engineer or data scientist takes raw data and converts it
into a more readable format, such as a graph, report or chart, either manually or through an automated tool
Collection: The first and most important step of data processing. It is important to use verified and trustworthy
sources for gathering data.
Preparation: Once the data is collected, it then enters the data preparation stage. Data preparation, often referred
to as “pre-processing” is the stage at which raw data is cleaned up and organized for the following stage of data
processing. During preparation, raw data is diligently checked for any errors or missing entries . The purpose of
this step is to eliminate bad data and begin to create high-quality data for the best business intelligence. Coding
od date is also done in this stage.
Data input: In this step, the raw data is converted into machine readable form and fed into the processing unit.
This can be in the form of data entry through a keyboard, scanner or any other input source.
Information output: The output/interpretation stage is the stage at which data is finally usable to non-data
scientists. Decoding of data is done at this stage. It is translated, readable, and often in the form of graphs, plain
text, etc.) Members of the company or institution can now begin to self-serve the data for their own data
analytics projects.
Data storage:. After all of the data is processed, it is then stored for future use. While some information may be
put to use immediately, much of it will serve a purpose later on. When data is properly stored, it can be quickly
and easily accessed by members of the organization when needed.
Analysis of different types of data
• Univarate Analysis of data :Univariate analysis is a statistical technique that involves the analysis
of one variable at a time. It is used to describe the data and to find patterns that exist within the data
set. It is also used to identify outliers and to test for normality. Univariate analysis can be used to
identify relationships between a single variable and another variable, such as a dependent variable.
It can also be used to determine the impact of a single variable on a dependent variable
• Descriptive analysis of univariate data is a method of summarizing and describing the data using
a variety of techniques such as frequency distributions, measures of central tendency (mean,
median, and mode), measures of dispersion (range, variance, and standard deviation), and graphical
techniques (histograms, box plots, and scatterplots). This type of analysis is used to identify
patterns, trends, and relationships in the data. It is also used to describe the characteristics of the
data, such as the shape of the distribution, the spread of the data, and outliers. Descriptive analysis
of univariate data is an important step in the data analysis process as it provides a basis for further
analysis.
For example, if we were analyzing the heights of students in a class, we would use univariate analysis
to look at the distribution of heights. We could look at the mean, median, mode, and range of the data.
We could also look at the frequency of different heights, and any outliers that exist. This would give
us an overall picture of the data and
• Bivariate Analysis of data : Bivariate analysis is a statistical method used to analyze the
relationship between two variables. It is used to determine if there is a correlation between the two
variables and to identify the strength of that correlation. It can also be used to identify any outliers
or trends in the data. It can be used to identify relationships between two variables, such as age and
income, or to compare two groups, such as men and women. Bivariate analysis can also be used to
identify relationships between three or more variables, such as age, income, and education.
Example : you have a dataset that includes information about people's income and education level.
You can use bivariate analysis to explore the relationship between these two variables.
• For example, you might use a scatter plot to visualize the data. The scatter plot would show the
relationship between income and education level. You might find that people with higher levels of
education tend to have higher incomes. This would suggest that there is a positive correlation
between education level and income.
• You could also use bivariate analysis to look at the relationship between income and other
variables, such as age, gender, or location. This could help you understand how these other factors
may influence income.
• Bivariate analysis can be used to explore relationships between any two variables. It is a powerful
tool for understanding data and can help you uncover insights that may not be obvious from
looking at the data alone.
• Multivariate Analysis of data : It involves the examination of more than two
variables at a time. It is used to analyze the relationships between multiple
variables and to explore the underlying structure of the data. It can be used to
identify patterns and trends in the data, to make predictions, and to identify
relationships between variables. It can be used to identify the most important
factors that influence a particular outcome, or to identify clusters of similar
individuals or objects. Multivariate analysis can also be used to test hypotheses
and to determine the strength of the relationships between variables.
For example, a researcher may want to examine the relationship between age,
gender, and income. To do this, they would collect data on these three variables
from a sample of individuals. They could then use multivariate analysis to look
for patterns in the data. They might find that older individuals tend to have
higher incomes, or that men tend to earn more than women. These patterns can
then be used to make predictions about the population as a whole.
Measurement is the quantification of attributes of an object or event, which
can be used to compare with other objects or events. In other words,
measurement is a process of determining how large or small a physical
quantity is as compared to a basic reference quantity of the same kind.
Scale : A scale is a device or an object used to measure or quantify any event
or another object
Levels of Measurements: There are four different scales of measurement.
The data can be defined as being one of the four scales. The four types of
scales are:
1-Nominal Scale
2-Ordinal Scale
3-Interval Scale
4-Ratio Scale
1-Nominal Scale: A nominal scale is the 1st level of measurement scale in which the numbers serve as “tags” or “labels” to
classify or identify the objects. This is the lowest level of measurement where data is categorized into distinct groups or categories. The
categories have no inherent order or numerical value. Examples include gender (male/female), eye color (blue/brown/green), or types of
cars (sedan/SUV/hatchback). In nominal scales, you can only determine equality or inequality between categories, but you cannot
perform mathematical operations or quantify the differences between them.
Characteristics of Nominal Scale
• A nominal scale variable is classified into two or more categories. In this measurement mechanism, the
answer should fall into either of the classes.
• It is qualitative. The numbers are used here to identify the objects.
• The numbers don’t define the object characteristics. The only permissible aspect of numbers in the nominal
scale is “counting.”
Example: An example of a nominal scale measurement is given below:
What is your gender?
M- Male
F- Female
• Here, the variables are used as tags, and the answer to this question should be either M or F.
• Which test is useful for measuring data on a nominal scale?
• Appropriate statistical tests : The central tendency is a good way to visualize the classification. Another analysis that
can be performed is the chi-square correlation. When we want to compare more than two categorical (nominal)
variables, we can use the chi-square test
2-Ordinal Scale: The ordinal scale is the 2nd level of measurement. In an ordinal scale, data can be categorized into distinct
groups or categories, like in a nominal scale. However, the categories in an ordinal scale also have a natural order or ranking.
The differences between the categories are not necessarily equal or quantifiable. Examples of ordinal scales include survey
ratings (e.g., strongly agree, agree, neutral, disagree, strongly disagree) or educational levels (e.g., elementary, middle school,
high school, college). With ordinal data, you can determine the relative order or rank of categories, but you cannot establish
the magnitude of differences between them.
Characteristics of the Ordinal Scale
• The ordinal scale shows the relative ranking of the variables
• It identifies and describes the magnitude of a variable
• Along with the information provided by the nominal scale, ordinal scales give the rankings of those variables
• The interval properties are not known
• The surveyors can quickly analyze the degree of agreement concerning the identified order of variables
Example:
• Ranking of school students – 1st, 2nd, 3rd, etc.
• Ratings in restaurants
• Evaluating the frequency of occurrences
Very often , Often , Not often , Not at all
• Assessing the degree of agreement
Totally agree , Agree, Neutral, Disagree, Totally disagree

The most appropriate statistical tests : non –parametric test , Mann-Whitney U test and the Kruskal–Wallis H
3-Interval Scale: The interval scale is the 3rd level of measurement scale. The interval scale possesses
the properties of both nominal and ordinal scales, but it also has equal intervals between the values.
In an interval scale, the numerical values or measurements have a consistent and meaningful distance
between them, but there is no true zero point. With interval data, you can determine the order,
measure the differences, and perform operations like addition and subtraction, but multiplication and
division are not meaningful.
Characteristics of Interval Scale:
• The interval scale is quantitative as it can quantify the difference between the values
• It allows calculating the mean and median of the variables
• To understand the difference between the variables, you can subtract the values between the variables
• The interval scale is the preferred scale in Statistics as it helps to assign any numerical values to arbitrary
assessment such as feelings, calendar types, etc.
Example: Examples include temperature measured in Celsius or Fahrenheit, where the difference between
10 and 20 degrees is the same as the difference between 30 and 40 degrees. However, zero degrees does not
represent an absence of temperature.
The most appropriate statistical tests : With a normal distribution of interval data, both parametric and non-
parametric tests are possible. Parametric tests are more statistically powerful than non-parametric tests and
let you make stronger conclusions regarding your data
4-Ratio Scale: The ratio scale is the 4th level of measurement scale, The ratio scale is the highest level of measurement,
possessing all the characteristics of the previous three scales. It has categories with natural order, equal intervals, and a
meaningful zero point. In a ratio scale, zero represents the absence of the measured attribute. Examples of ratio scales include
height, weight, time, and counts. With ratio data, you can determine the order, measure the differences, perform arithmetic
operations, and calculate ratios or proportions.
Characteristics of Ratio Scale:
• Ratio scale has a feature of absolute zero
• It doesn’t have negative numbers, because of its zero-point feature
• It affords unique opportunities for statistical analysis. The variables can be orderly added, subtracted, multiplied, divided.
Mean, median, and mode can be calculated using the ratio scale.
• Ratio scale has unique and useful properties. One such feature is that it allows unit conversions like kilogram – calories,
gram – calories, etc.
Example:
• What is your weight in Kgs?
• Less than 55 kgs
• 55 – 75 kgs
• 76 – 85 kgs
• 86 – 95 kgs
• More than 95 kgs
The most appropriate statistical tests : This type of measurement scale is the most versatile as all statistical tools can be
applied to this scale. Most common of these tests are the parametric test regression, t-tests, ANCOVA, MANOVA, Pearson
correlation etc.
Type of Type of Descriptive analysis
measurement

Nominal Frequency table, Proportion percentages, Mode

Ordinal Median , Quartiles, Percentiles, Rank order correlation

Interval Arithmetic Mean ,Correlation coefficient

Ratio Index numbers, Geometric mean ,Harmonic mean


STATISTICAL TOOLS USED IN RESEARCH
1-Measures of central tendency
2-Measures of dispersion
3-Measures of Relationship
In case of bivariate population : a-Cross tabulation, b-Charles spearman’s coefficient of correlation ,c-Karl person's
coefficient of correlation
In case of multivariate population: a-Cofficient of multiple correlation, b-Cofficient of partial correlation
4-REGRESSION ANALYSIS
5-Parametric tests
6-Non-parametric tests
Measures of central tendency: it tell us the point about which items have a tendency to cluster .The
mean, median and mode are the three commonly used measures of central tendency.
• Mean is one of the measures of central tendency, apart from the mode and median. Mean is nothing
but the average of the given set of values. It denotes the equal distribution of values for a given data
set. To calculate the mean, we need to add the total values given in a datasheet and divide the sum
by the total number of values.
Example: What is the mean of 2, 4, 6, 8 and 10?
Solution: First, add all the numbers.
2 + 4 + 6 + 8 + 10 = 30
Now divide by 5 (total number of observations).
Mean = 30/5 = 6
• Median is the middle value of a given data when all the values are arranged in ascending order.
Median is the central value of the data set when they are arranged in an order.
For example, the median of 3, 7, 1, 4, 8, 10, 2.
Arrange the data set in ascending order: 1,2,3,4,7,8,10
Median = middle value = 4
• Mode is the number in the list, which is repeated a maximum number of times.
Measures of dispersion: Two data sets can have the same mean but they can be
entirely different. Thus to describe data, one needs to know the extent of variability. This is
given by the measures of dispersion. Range, interquartile range, and standard deviation are
the three commonly used measures of dispersion.
Range : (Highest value of an item in a series ) –( lowest value)
Mean deviation : Mean deviation is a statistical measure that calculates the average
deviation from the mean value of a given data set.
Standard deviation : In statistics, standard deviation is a measure of how much a random
variable varies from its mean. It is calculated as the square root of the variance. A low
standard deviation indicates that the values are close to the mean, while a high standard
deviation indicates that the values are spread out over a wider range.
Variance: Variance measures how far each number in the set is from the mean (average),
and thus from every other number in the set.
Measures of Relationship:
For bivariate or multivariate population , we have to answer two types of questions :
Q1: Does there exist association or correlation between two or more variable, if yes what
degree?
Q2: Is there any cause and effect relationship between the two variables or between one
variable and others ?
First question is answered by use of correlation technique and second question is answered
by technique of regression
In case of bivariate population : a-Cross tabulation, b-Charles spearman’s coefficient of
correlation ,c-Karl person's coefficient of correlation
In case of multivariate population: a-Cofficient of multiple correlation, b-Cofficient of
partial correlation
REGRESSION ANALYSIS: Regression analysis is a set of statistical methods used to
estimate relationships between a dependent variable and one or more independent
variables. It can be used to assess the strength of the relationship between variables and for
modeling the future relationship between them.
Simple linear regression is used to model the relationship between two continuous variables.
Often, the objective is to predict the value of an output variable (or response) based on the
value of an input (or predictor) variable.
Parametric tests :are those that make assumptions about the parameters of the population
distribution from which the sample is drawn. This is often the assumption that the
population data are normally distributed and the variances of the groups being compared are
equal. Eg-
1-t-test: is an inferential statistic used to determine if there is a significant difference between the means of two
groups and how they are related. T-tests are used when the data sets follow a normal distribution and have
unknown variances, like the data set recorded from flipping a coin 100 times
2-Anova (Analysis of variance) :Analysis of Variance, is a parametric test. It's a statistical method that
analyzes the differences between the means of two or more groups or treatments
Non-parametric tests : are “distribution-free” and can be used for non-Normal variables.
Nonparametric tests do not make any assumptions about the distribution of the data or the
equality of variances.Eg-
1-The Mann-Whitney test is another powerful nonparametric test. It is similar to the t-test in that it is
designed to test differences between groups, but it is used with data that are ordinal.
2-The chi-square test (chi2) is used when the data are nominal and when computation of a mean is not
possible. This test is a statistical procedure that uses proportions and percentages to evaluate group
differences.
Diagrammatic and Graphic representation
Data can be represented diagrammatically in various ways, such as bar
graphs, line graphs, pie charts, scatter plots, and histograms. These
diagrams can help to visualize data in a more meaningful way, making it
easier to identify patterns and trends.
For example, a bar graph can be used to compare the values of different
categories, while a line graph can be used to show the changes in a value
over time. A pie chart can be used to show the proportions of different
values, and a scatter plot can be used to show the relationship between two
variables. Finally, a histogram can be used to show the distribution of a
single variable.
Frequency distribution : is a way of organizing data into categories or groups and counting
the number of observations that fall into each category. It is a way of summarizing a set of
data. Frequency distributions are used to show the number of occurrences of different values
in a dataset. Frequency distributions can be used to identify patterns in the data, such as the
most common value or values, or the shape of the data. Example :
Age Group Frequency
0-10 10
11-20 20
21-30 30
31-40 40
41-50 25
51-60 15
61-70 5
71-80 2
81-90 1
THANKS

You might also like