Data Analysis
Data Analysis
Graphical Presentation
Line graph
A line graph is a type of graph where the information or data is
plotted as some and then they are added to each other by a straight
line.
The line graph is normally used to represent the data that changes
over time.
Bar graph
A bar graph or chart is a way to represent data by
rectangular column or bar. The heights or length
of the bar is proportional to the values.
Histogram
. According to the above-mentioned Pictograph, the number of Appels sold on Monday is 6x2=12
Scatter diagrams
Scatter diagram or scatter plot is a way of graphical representation by using two
variables. The plot shows the relationship between two variables. Below there is a data
table as well as a Scattergram as per the given data.
Frequency of Total-
Car Color
Loss Collisions
Blue 25
Green 52
Red 41
White 36
Black 39
Grey 23
Total 216
Frequency of Total- Percentage of Loss
Car Color Collisions
Loss Collisions
Blue 25 11
Green 52 24
Red 41 19
White 36 17
Black 39 18
Grey 23 11
Total 216 100
Pie Chart
Bar chart
2 Types of Statistics
Descriptive stats are used to define statistical properties of the data like the mean,
variance, skewness, etc.
Statistics used only to describe the sample or summarize information about
the sample ( Methods of organizing, summarizing and presenting data in an
informative way)
Inferential stats are used to determine if there statistically significant relationships
between sets of data.
Statistics used to make inferences or generations about the broader population
( Methods used to find out something about population based on a sample)
Differences between Descriptive statistics and inferential statistics
Exercises
1. The line graph is normally used to represent the data that changes over time.
2. The heights or length of the bar is proportional to the values.
3. Similar in appearance to a bar graph, the histogram condenses a data series into an easily
interpreted visual by taking many data points and grouping them into logical ranges or bins.
4. The stem and leaf plot is a way to represents quantitative data according to frequency ranges or
frequency distribution.
5. Pie chart is a circular chart where numerical information represents as slices or in fractional form or
percentage where the whole circle is 100%.
6. In the stem and leaf plot, each data is split into stem and leaf, which is 32 will be split into 3 stems
and 2 leaves.
7. Scatter diagram or scatter plot is a way of graphical representation by using two variables.
8. It can be used in almost all fields from mathematics to physics to psychology and so on.
9. Descriptive stats are used to define statistical properties of the data like the mean, variance,
skewness, etc.
10. Inferential stats are used to determine if there statistically significant relationships between sets of
data.
Data description
Mean – the sum of the values, divided by the total number of values.
Median – the midpoint of the data array .
Mode – the value that occurs most often in a data set.
Properties and Uses of Central Tendency
The Mean
• The mean is found by using all the values of the data.
• The mean varies less than the median or mode when samples are taken from the
same population and all three measures are computed for these samples.
• The mean is used in computing , other statistics, such as the variance.
• The mean for the data set is unique and not necessarily one of the data values.
• The mean cannot be computed for the data in a group frequency distribution that has
an open-ended class.
• The mean is affected by extremely high or low values, called outliers, and may not be
the appropriate average to use in these situation.
The Median
• The median is used to find the center or middle value of a data set.
• The median is used when it is necessary to find out whether the data values fall
into the upper half or lower half of the distribution.
• The median is used for an open-ended distribution.
• The median is affected less than the mean by extremely high or extremely low
values.
The Mode
• The mode is used when the most typical case is desired.
• The mode is the easiest average to compute.
• The mode can be used when the data are nominal or categorical , such as
religious preference, gender.
Exercises
1.Which measure of central tendency includes the magnitude of scores?
(a) Mean (b) Mode (c) Median (d) Range
2. To calculate the median, all the items of a series have to be
arranged in a/an ___(c)_____.
(a) Descending order (b)Ascending order (c) Ascending or
descending order(d) None of the above.
Coefficient of Variation
• The standard deviation divided by the mean.
Coefficient of variation (CV) = CV = =
Exploratory Data Analysis
• Exploratory Data Analysis (EDA) is an analysis approach that identifies general patterns in the
data. These patterns include outliers and features of the data that might be unexpected. EDA is
an important first step in any data analysis.
• In exploratory data analysis (EDA) , data can be organized using a stem and leaf plot .
• The measure of central tendency used in EDA is a median.
• The measure of variation used in EDA is the interquartile range .
Interquartile range : IQR = 3 -
Challenge yourself with these true/false questions.
1. The purpose of inferential statistics is to simplify and organize the data from a study.
(True/False)
2. Frequency distributions are a subset of inferential statistics. (True/
False)
3. Summary statistics are a subset of descriptive statistics. (True/False
)
4. The range is a frequently used measure of central tendency. (True/
False)
5. The mode is a frequently used measure of central tendency. (True/
False)
6. The mean is the score at the 50th percentile. (True/False)
7. A sample is a subset of the people or objects in a population. (True/
False)
8. If all of the scores in a distribution are increased by exactly five
points, the range will increase by five points. (True/False)
9. The variance, unlike the range, uses all the scores in its
computation. (True/False)
10.In descriptive statistics, final results are shown in form of charts,
tables and graphs. (True/False)
2.Analytic statistics
- looking at association among 2 or more variables
Where:
Y = the dependent variable that you are trying to predict or explain
E.g. School Administrator wondered whether the class size and grade achievement ( in percent) were
related . A random sample of classes revealed the following data. Find ý when x = 12
No. of students 15 10 8 20 18 6
Avg. Grade (%) 85 90 82 80 84 92
No of
Students(x) Average grade (%) (y) xy x2
15 85 1275 225
10 90 900 100
8 82 656 64
20 80 1600 400
18 84 1512 324
6 92 552 36
Total 77 513 6495 1149
b = = = - 0.55
a= =
If x = 12,
Correlation
Correlation is used to describe the linear relationship between two continuous
variables (e.g., height and weight). It measures the strength (qualitatively) and direction
of the linear relationship between two or more variables.
Correlation coefficient
The degree of association is measured by a correlation coefficient,
denoted by r. It is sometimes called Pearson’s correlation
coefficient after its originator and is a measure of linear association.
If a curved line is needed to express the relationship, other and
more complicated measures of the correlation must be used.
The correlation coefficient is measured on a scale that varies from
Compute the value of the linear correlation coefficient for the data obtained in the study of the number
of absences and the final grade of the seven students in the statistics class.
Student Number of Final Grade y (%) xy x2 y2
absences x
A 6 82 492 36 6724
B 2 86 172 4 7396
C 15 43 645 225 1849
D 9 74 666 81 5476
E 12 58 696 144 3364
F 5 90 450 25 8100
G 8 78 624 64 6084
Total 57 511 3745 579 38993
= = = -0.94422
Correlation coefficient (r) = -0.94422
Therefore Number of absences and final grade
are negatively and strongly correlated.
ANOVA
Analysis of Variance (ANOVA) is a statistical formula
used to compare variances across the means (or
average) of different groups. A range of scenarios use it to
determine if there is any difference between the means of
different groups.
ANOVA stands for Analysis of Variance. It's a statistical test
that was developed by Ronald Fisher in 1918 and has been
in use ever since. Put simply, ANOVA tells you if there are
any statistical differences between the means of
three or more independent groups. One-way ANOVA is
the most basic form.