Group-4-Data-Management-Notes
Group-4-Data-Management-Notes
Group 4:
DATA MANAGEMENT
List of Contents:
• Measures of Dispersion
• Measures of Position
• Chi-square
Data management is a process by which Information is acquired and processed to ensure the
accessibility and reliability of the data for its users.
One of the most important tool in processing and managing such information is statistics.
Statistics is a science which deals with the collection, organization, presentation, analysis, and
interpretation of data so as to give a more meaningful information.
Descriptive statistics refers to the collection, organization, summary, and presentation of data while
Inferential statistics deals with the interpretation and analysis of data where conclusion is drawn based
from the subset of the population
Secondary data information obtained from published materials or data gathered by other individuals or
agencies. These are the data which are transcribed from original sources.
Variable a characteristic of interest that has been observed or measured on every member of the
population or sample. A variable may be quantitative or qualitative where quantitative variable is further
classified as discrete or continuous
Discrete takes on a countable amount (it is usually expressed as whole number) Example: number of
books owned by a student
Continuous measured in a continuous scale (it takes. any value within a range or interval) Example:
height of the students (in feet)
Examples: gender (male or female) hair color (black, brown, blonde) level of satisfaction of a student on
his grade (highly satisfied, satisfied, not satisfied)
- A single value that is used to identify the center of the data set or set of
observations.
Mean (arithmetic average) – is the sum of all observed values divided in the data set.
NOTES (DATA MANAGEMENT)
For example:
The scores of five students who are randomly selected in a class of Math 01 are as follows: 44,
37, 41, 35, and 32. Find their average score.
Given: n=5
Formula: x̅ = Σx/n
Solution: x̅ = Σx/n
x̅ = 44+37+41+35+32 = 189/5
x̅ = 37.8
Median – a single value which divides an array of observations into two equal parts.
- it is the middle value in a set of numbers, where half of the values are less than the median
and half are greater.
Note: The observations must be arranged first (lowest to highest) before getting the median
value.
For example:
Odd: The number of books owned by the eleven children are as follows: 5, 2, 4, 6, 5, 10, 7, 6, 9,
8, 6. What is the median?
NOTES (DATA MANAGEMENT)
Arrangement: 2, 4, 5, 5, 6, 6, 6, 7, 8, 9, 10
Median: 6
Even: Compute the median of the data set: 2.5, 4.0, 5.8, 3.5, 2.5, 8.2, 7.1, 3.7.
Mode – an observation that occurs most frequently in the given data set.
For example:
Find the mode in this data set: 36, 36, 12, 29, 35, 45, 50, 45, 45, 5
Answer: 45 – Unimodal
Find the mode in this data set: 39, 23, 25, 25, 63, 37, 45, 37, 48, 51, 28, 45, 50
Answer: 25, 37, & 45 – Trimodal
Measures of Dispersion
The most commonly used measures of dispersion are the range, variance, and standard deviation.
The range, R, is the difference between the highest value (H) and lowest value (L) in the data set. That is,
R = H – L.
NOTES (DATA MANAGEMENT)
In terms of measure of central tendency, each student performs equally since they have same average
rating of 80%. However, looking at the variability of their ratings, Student A has the highest range as
compared to the other students. This shows that scores of student A are more dispersed than the other.
The rating of Student A is fluctuating while that of Student B is uniformly distributed. On the other hand,
Student C has range equal to zero so his ratings are all concentrated at its mean indicating that the
distribution has no spread.
• The larger the value of the range, the more dispersed the observations are.
• The range considers only the extreme values or observations in the data set.
The standard deviation is the positive square root of the variance. The variance is the average of the
squared deviations of every observation from the mean.
The unit of the variance is squared unit while that of the standard deviation is the same as the unit of
the data set. The following symbols are used to designate these measures to a population and sample.
Example 1:
The following are the scores of a student in all her long exams in Calculus: 83, 80, 89, 78, and 70.
Calculate the standard deviation.
(Using the formula of population variance, the Variance is 38.8, Standard Deviation is 6.23)
Example 2:
The table gives the frequency distribution of the daily commuting times (in minutes) from home to work
for all 25 employees of a company. Calculate the variance and standard deviation.
NOTES (DATA MANAGEMENT)
solution:
Measures of Position
A measure of position or Quantiles is a statistical measure that provides the specific location of an
observation relative to the other values when the data are in ranked order.
Percentiles, deciles, and quartiles are among the most commonly used measures of position.
Quartiles are the 3 score points which divide the data or distribution into four equal parts. These are the
first quartile (Q1), the second quartile (Q2) and the third quartile (Q3).
The difference between the upper quartile (Q3) and the lower quartile (Q1) is called Interquartile Range
(IQR).
Deciles are the 9 score points which divide the data into ten equal parts. These are the first decile (D 1),
second decile (D2), third decile (D3), up to nineth decile (D9).
Percentiles are the 99 score points which divide the data into 100 equal parts. It is used to characterize
values according to the percentage below them.
a) Lower Quartile (L) = Position of Q1 = 𝟏/𝟒 (𝒏 + 𝟏), where 𝒏 is the number of elements in a data.
b) Upper Quartile (U) = Position of Q3 = 𝟑/𝟒 (𝒏 + 𝟏), where 𝒏 is the number of elements in a
data.
c) Median is the middle number after the data elements are arranged in decreasing or increasing
order. To get the median, get the average by getting their sum and dividing it by 2.
Remember:
a) If the number of elements (n) in a data is odd, there is only one middle number and this is the
median.
b) If the number of elements (n) in a data is even, there are two middle numbers and you have to
compute their average to get the median.
A diagram showing the representation of a 5-point summary of a data set specified by the lowest and
the highest values, the values corresponding to 𝑄1 and 𝑄3 , and the median is called a box – and -
whisker plot also known as box plot.
3. Locate the five numbers (lowest and the highest values, Q1, median, and Q3) in the number line
and draw a rectangle (box) above the scales covering Q1 , median, and Q3 then draw a line segment
across the box passing through the median.
4. Connect the box to the extreme values by a line segment (known as whisker).
Stem-and-leaf display
A stem - and- leaf display is an organized diagram showing the relative position of every element in the
data set such that the leading digit(s) become the stem and the trailing digit(s) becomes the leaf.
NOTES (DATA MANAGEMENT)
When most of the observations are near the “center” and the distribution of data is nearly similar on
both sides then the distribution is said to follow a normal distribution.
A normal distribution, named as the Gaussian distribution, is a continuous probability distribution which
is drawn graphically by a smooth bell-shaped curve called the normal curve having an area under it
which is equal to one.
2. The three measures of central tendency given by the mean, median and mode are all equal.
4. The curve is asymptotic with respect to the horizontal axis on both directions.
The proportion of values in a given data set which is normally distributed is based on the mean and the
standard deviation of the data set.
That is about 68% of the observations fall within 1 standard deviation away from the mean; about 95%
of the observations fall within 2 standard deviations away from the mean; and about 99.7% of the
observations fall within 3 standard deviations away from the mean.
A standard normal distribution is a distribution of a random variable with mean zero and standard
deviation equal to one. That is, 𝑍~𝑁(0,1 ).
NOTES (DATA MANAGEMENT)
Correlation and regression are two related statistical tools. Correlation is used to find out if there is a
relationship between two variables while regression is a means to predict or forecast the value of one
variable in terms of the other.
CORRELATION ANALYSIS
Correlation analysis is a method used measure the degree of relationship or association between two or
more variables.
SCATTER DIAGRAM
Scatter diagram also known as scatter plot, is pictorial presentation showing the relationship between
two variables. It shows the direction and shape of the association being conveyed. This is done by
plotting the points corresponding to the observations/data on the first quadrant of a rectangular
coordinate system.
NOTES (DATA MANAGEMENT)
TYPES OF CORRELATION
1. Positive correlation a direct relationship between two variables exists. That is, as one variable
increases (decreases), the other also increases (decreases).
2. Negative correlation an inverse relationship exists between the variables. Here, one variable
increases as the other decreases or vice versa
3. Zero correlation exists when scores in one variable tend to score neither systematically high nor
systematically low in the other variable. It indicates that there is no correlation between the
variables. The points in the scatter diagram are in random manner.
REMARKS:
The relationship between two variables may be described by its magnitude or its strength. In terms of
strength, the correlation may be perfect, high, moderate, or low in a perfect correlation, all points in the
scatter diagram lie on a straight line.
The degree or strength of relationship between two variables may also be described by computing a
single number called the correlation coefficient
The correlation coefficient (r) may be interpreted using the correlation scale shown below
NOTES (DATA MANAGEMENT)
REGRESSION
REGRESSION describes the process of estimating the relationship between two variables. The
relationship is estimated by by fitting a straight line through the given data. The least squares method is
useful in determining the equation of the line that best fit the data
Chi-square
CHI-SQUARE
Chi-square (x2) – a statistical method used to determine if there is a significant association between two
categorical variables or if a sample matches the population distribution.
Chi-Square Test for Independence: This test checks if two categorical variables are related or
independent.
Chi-Square Goodness-of-Fit Test: This test determines if the observed sample distribution fits an
expected probability distribution.
Example: Do dice rolls match the expected uniform distribution of a fair die?
df = number of categories - 1
Find the critical value in a Chi-square table using df and significance level (usually 0.05).
Step 4: Conclusion
For example:
Scenario: A researcher wants to know if there is an association between gender and preference for a
particular type of book (fiction vs. non-fiction). She collects data from a random sample of 100 people
and organizes it into a table:
Research Question:
Is there a statistically significant association between gender and book preference?
Step-by-Step Solution:
o Null Hypothesis (H₀): There is no association between gender and book preference (they
are independent).
For example, the expected count for males who prefer fiction:
Using this formula, we fill in the expected counts for each cell:
df = (2−1) × (2−1) = 1
At a 5% significance level (α = 0.05) and 1 degree of freedom, the critical value for chi-
square is 3.841. Since our calculated chi-square statistic (1.01) is less than 3.841, we fail to reject
the null hypothesis.
Conclusion:
There is no significant association between gender and book preference in this sample;
any difference observed is likely due to chance.
Scenario: A candy company claims that the colors of their candies are distributed as follows:
• Red: 30%
• Blue: 20%
• Green: 25%
• Yellow: 25%
A researcher suspects that the actual distribution might differ, so they randomly select 200
candies and record the following counts:
Research Question:
Is there a significant difference between the observed distribution of candy colors and the
company’s claimed distribution?
Step-by-Step Solution:
o Null Hypothesis (H₀): The observed distribution matches the claimed distribution by the
company.
o Alternative Hypothesis (H₁): The observed distribution does not match the claimed
distribution.
Therefore,
df = number of categories – 1
= 4 -1 = 3
At a 5% significance level (α = 0.05) and 3 degrees of freedom, the critical value for chi-
square is approximately 7.815. Since our calculated chi-square statistic (5.42) is less than 7.815,
we fail to reject the null hypothesis.
Conclusion:
There is no significant difference between the observed distribution of candy colors and
the company’s claimed distribution. This means that the observed color distribution is consistent
with the company’s claim.
NOTES (DATA MANAGEMENT)