0% found this document useful (0 votes)
4 views

Group-4-Data-Management-Notes

The document provides an overview of data management and statistical concepts, including measures of central tendency, dispersion, position, probability, correlation, and chi-square tests. It explains key terms such as population, sample, mean, median, mode, variance, and standard deviation, along with methods for analyzing and interpreting data. Additionally, it covers the construction of box-and-whisker plots and the significance of normal distribution in statistics.

Uploaded by

talisicgwen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Group-4-Data-Management-Notes

The document provides an overview of data management and statistical concepts, including measures of central tendency, dispersion, position, probability, correlation, and chi-square tests. It explains key terms such as population, sample, mean, median, mode, variance, and standard deviation, along with methods for analyzing and interpreting data. Additionally, it covers the construction of box-and-whisker plots and the significance of normal distribution in statistics.

Uploaded by

talisicgwen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

NOTES (DATA MANAGEMENT)

Group 4:

Arecayos, Jasline Kieth

Cinco, Laurie Jane

Comendador, Trizsha Mae

Lawas, Janice Mae

Montes, France Trisha

Nunez, John Christopher

DATA MANAGEMENT

List of Contents:

• Basic Statistical Concepts

• Measures of Central Tendency

• Measures of Dispersion

• Measures of Position

• Probability and the Normal Distribution

• Correlation and Linear Regression

• Chi-square

Basic Statistical Concepts

Data management is a process by which Information is acquired and processed to ensure the
accessibility and reliability of the data for its users.

One of the most important tool in processing and managing such information is statistics.

Statistics is a science which deals with the collection, organization, presentation, analysis, and
interpretation of data so as to give a more meaningful information.

Descriptive statistics refers to the collection, organization, summary, and presentation of data while
Inferential statistics deals with the interpretation and analysis of data where conclusion is drawn based
from the subset of the population

Population - a collection or set of things or objects under consideration

Sample - a subset or representative group of the population


NOTES (DATA MANAGEMENT)

Statistic a value which is computed from a sample

Array -listing of observations which are arranged in an increasing or decreasing magnitude

Parameter a value which is computed from a population

Data refers to the information gathered in a research

Primary data -information gathered from respondents by the researcher himself.

Secondary data information obtained from published materials or data gathered by other individuals or
agencies. These are the data which are transcribed from original sources.

Variable a characteristic of interest that has been observed or measured on every member of the
population or sample. A variable may be quantitative or qualitative where quantitative variable is further
classified as discrete or continuous

Quantitative/Numerical variable describes the amount or number of an element of a sample or


population

Discrete takes on a countable amount (it is usually expressed as whole number) Example: number of
books owned by a student

Continuous measured in a continuous scale (it takes. any value within a range or interval) Example:
height of the students (in feet)

Qualitative/Categorical variable - describes the quality, category, or character of an element of a


population or sample

Examples: gender (male or female) hair color (black, brown, blonde) level of satisfaction of a student on
his grade (highly satisfied, satisfied, not satisfied)

Measures of Central Tendency

Measures of Central Tendency – also called as control location

- A single value that is used to identify the center of the data set or set of
observations.

Three measures of central tendency – MEAN, MEDIAN, and MODE

Mean (arithmetic average) – is the sum of all observed values divided in the data set.
NOTES (DATA MANAGEMENT)

For example:

The scores of five students who are randomly selected in a class of Math 01 are as follows: 44,
37, 41, 35, and 32. Find their average score.

Given: n=5

Required: average score (mean)

Formula: x̅ = Σx/n

Solution: x̅ = Σx/n

x̅ = 44+37+41+35+32 = 189/5

x̅ = 37.8

Answer: The average score of the five students is 37.8

Median – a single value which divides an array of observations into two equal parts.

- it is the middle value in a set of numbers, where half of the values are less than the median
and half are greater.

ODD: Middlemost value in the list

EVEN: Average of the two middlemost values. Can be computed as below:

Note: The observations must be arranged first (lowest to highest) before getting the median
value.

For example:
Odd: The number of books owned by the eleven children are as follows: 5, 2, 4, 6, 5, 10, 7, 6, 9,
8, 6. What is the median?
NOTES (DATA MANAGEMENT)

Arrangement: 2, 4, 5, 5, 6, 6, 6, 7, 8, 9, 10
Median: 6

Even: Compute the median of the data set: 2.5, 4.0, 5.8, 3.5, 2.5, 8.2, 7.1, 3.7.

Given: m1 (lowest) = 3.7


m2 (highest) = 4.0
Arrangement: 2.5, 2.5, 3.5, 3.7, 4.0, 5.8, 7.1, 8.2
Solution:
x̅ = (m1 + m2)/2
= (3.7 + 4.0)/2 = 7.7/2
x̅ = 3.85
Answer: The median is 3.85

Mode – an observation that occurs most frequently in the given data set.

Three types of mode: UNIMODAL, BIMODAL, TRIMODAL, and MULTIMODAL

Unimodal: only one mode


Bimodal: two modes
Trimodal: three modes
Multimodal: more than three modes

For example:
Find the mode in this data set: 36, 36, 12, 29, 35, 45, 50, 45, 45, 5
Answer: 45 – Unimodal

Find the mode in this data set: 8, 7, 6, 5, 6, 9, 2, 3, 11, 11, 43, 10


Answer: 6 & 11 – Bimodal

Find the mode in this data set: 39, 23, 25, 25, 63, 37, 45, 37, 48, 51, 28, 45, 50
Answer: 25, 37, & 45 – Trimodal

Measures of Dispersion

A measure of dispersion/measure of variation is a quantity that measures the spread or variability of


the values in a given set of data

The most commonly used measures of dispersion are the range, variance, and standard deviation.

The range, R, is the difference between the highest value (H) and lowest value (L) in the data set. That is,
R = H – L.
NOTES (DATA MANAGEMENT)

FIND THE RANGE:

In terms of measure of central tendency, each student performs equally since they have same average
rating of 80%. However, looking at the variability of their ratings, Student A has the highest range as
compared to the other students. This shows that scores of student A are more dispersed than the other.
The rating of Student A is fluctuating while that of Student B is uniformly distributed. On the other hand,
Student C has range equal to zero so his ratings are all concentrated at its mean indicating that the
distribution has no spread.

• The larger the value of the range, the more dispersed the observations are.

• The range considers only the extreme values or observations in the data set.

STANDARD DEVIATION & VARIANCE

The standard deviation is the positive square root of the variance. The variance is the average of the
squared deviations of every observation from the mean.

The unit of the variance is squared unit while that of the standard deviation is the same as the unit of
the data set. The following symbols are used to designate these measures to a population and sample.

VARIANCE AND STANDARD DEVIATION FOR UNGROUPED DATA:


NOTES (DATA MANAGEMENT)

Example 1:

The following are the scores of a student in all her long exams in Calculus: 83, 80, 89, 78, and 70.
Calculate the standard deviation.

(Using the formula of population variance, the Variance is 38.8, Standard Deviation is 6.23)

VARIANCE AND STANDARD DEVIATION FOR GROUPED DATA

Example 2:

The table gives the frequency distribution of the daily commuting times (in minutes) from home to work
for all 25 employees of a company. Calculate the variance and standard deviation.
NOTES (DATA MANAGEMENT)

solution:

Measures of Position

A measure of position or Quantiles is a statistical measure that provides the specific location of an
observation relative to the other values when the data are in ranked order.

Percentiles, deciles, and quartiles are among the most commonly used measures of position.

Quartiles are the 3 score points which divide the data or distribution into four equal parts. These are the
first quartile (Q1), the second quartile (Q2) and the third quartile (Q3).

a) 25% of the data has a value ≤ Q1 or Lower Quartile

b) 50% of the data has a value ≤ Q2 or Median

c) 75% of the data has a value ≤ Q3 or Upper Quartile

The difference between the upper quartile (Q3) and the lower quartile (Q1) is called Interquartile Range
(IQR).

Formula: IQR = (Q3-Q1)


NOTES (DATA MANAGEMENT)

Deciles are the 9 score points which divide the data into ten equal parts. These are the first decile (D 1),
second decile (D2), third decile (D3), up to nineth decile (D9).

Percentiles are the 99 score points which divide the data into 100 equal parts. It is used to characterize
values according to the percentage below them.

Measures of Position of Ungrouped Data

FINDING THE VALUE OF QUARTILES OF UNGROUPED DATA USING THE MENDENHALL

AND SINCICH METHOD

a) Lower Quartile (L) = Position of Q1 = 𝟏/𝟒 (𝒏 + 𝟏), where 𝒏 is the number of elements in a data.

b) Upper Quartile (U) = Position of Q3 = 𝟑/𝟒 (𝒏 + 𝟏), where 𝒏 is the number of elements in a

data.

c) Median is the middle number after the data elements are arranged in decreasing or increasing
order. To get the median, get the average by getting their sum and dividing it by 2.

Remember:

a) If the number of elements (n) in a data is odd, there is only one middle number and this is the
median.

b) If the number of elements (n) in a data is even, there are two middle numbers and you have to
compute their average to get the median.

FINDING THE VALUE OF DECILE AND PERCENTILE OF UNGROUPED DATA


NOTES (DATA MANAGEMENT)

Measures of Position of Ungrouped Data


NOTES (DATA MANAGEMENT)
NOTES (DATA MANAGEMENT)

Box - and - Whisker Plot


NOTES (DATA MANAGEMENT)

A diagram showing the representation of a 5-point summary of a data set specified by the lowest and
the highest values, the values corresponding to 𝑄1 and 𝑄3 , and the median is called a box – and -
whisker plot also known as box plot.

Steps in the Construction of Box – and – Whisker Plot

1. Arrange the values in an increasing pattern.

2. Compute for Q1, median , and Q3 .

3. Locate the five numbers (lowest and the highest values, Q1, median, and Q3) in the number line
and draw a rectangle (box) above the scales covering Q1 , median, and Q3 then draw a line segment
across the box passing through the median.

4. Connect the box to the extreme values by a line segment (known as whisker).

Stem-and-leaf display

A stem - and- leaf display is an organized diagram showing the relative position of every element in the
data set such that the leading digit(s) become the stem and the trailing digit(s) becomes the leaf.
NOTES (DATA MANAGEMENT)

Probability and the Normal Distribution

When most of the observations are near the “center” and the distribution of data is nearly similar on
both sides then the distribution is said to follow a normal distribution.

A normal distribution, named as the Gaussian distribution, is a continuous probability distribution which
is drawn graphically by a smooth bell-shaped curve called the normal curve having an area under it
which is equal to one.

Properties of normal distribution

1. The total area under the normal curve is one.

2. The three measures of central tendency given by the mean, median and mode are all equal.

3. It is symmetric with respect to the vertical line .

4. The curve is asymptotic with respect to the horizontal axis on both directions.

The proportion of values in a given data set which is normally distributed is based on the mean and the
standard deviation of the data set.

That is about 68% of the observations fall within 1 standard deviation away from the mean; about 95%
of the observations fall within 2 standard deviations away from the mean; and about 99.7% of the
observations fall within 3 standard deviations away from the mean.

Standard normal distribution

A standard normal distribution is a distribution of a random variable with mean zero and standard
deviation equal to one. That is, 𝑍~𝑁(0,1 ).
NOTES (DATA MANAGEMENT)

Rules in Finding the Areas Under the Normal Curve

Correlation and Linear Regression

CORRELATION AND LINEAR REGRESSION

Correlation and regression are two related statistical tools. Correlation is used to find out if there is a
relationship between two variables while regression is a means to predict or forecast the value of one
variable in terms of the other.

CORRELATION ANALYSIS

Correlation analysis is a method used measure the degree of relationship or association between two or
more variables.

SCATTER DIAGRAM

Scatter diagram also known as scatter plot, is pictorial presentation showing the relationship between
two variables. It shows the direction and shape of the association being conveyed. This is done by
plotting the points corresponding to the observations/data on the first quadrant of a rectangular
coordinate system.
NOTES (DATA MANAGEMENT)

TYPES OF CORRELATION

1. Positive correlation a direct relationship between two variables exists. That is, as one variable
increases (decreases), the other also increases (decreases).

2. Negative correlation an inverse relationship exists between the variables. Here, one variable
increases as the other decreases or vice versa

3. Zero correlation exists when scores in one variable tend to score neither systematically high nor
systematically low in the other variable. It indicates that there is no correlation between the
variables. The points in the scatter diagram are in random manner.

REMARKS:

The relationship between two variables may be described by its magnitude or its strength. In terms of
strength, the correlation may be perfect, high, moderate, or low in a perfect correlation, all points in the
scatter diagram lie on a straight line.

The degree or strength of relationship between two variables may also be described by computing a
single number called the correlation coefficient

THE PEARSON CORRELATION COEFFICIENT (R)

The correlation coefficient (r) may be interpreted using the correlation scale shown below
NOTES (DATA MANAGEMENT)

REGRESSION

REGRESSION describes the process of estimating the relationship between two variables. The
relationship is estimated by by fitting a straight line through the given data. The least squares method is
useful in determining the equation of the line that best fit the data

Chi-square

CHI-SQUARE

Chi-square (x2) – a statistical method used to determine if there is a significant association between two
categorical variables or if a sample matches the population distribution.

TWO TYPES OF CHI-SQUARE TESTS

Chi-Square Test for Independence: This test checks if two categorical variables are related or
independent.

Example: Is gender related to voting preference?

Chi-Square Goodness-of-Fit Test: This test determines if the observed sample distribution fits an
expected probability distribution.

Example: Do dice rolls match the expected uniform distribution of a fair die?

STEPS IN CALCULATING CHI-SQUARE STATISTIC

Step 1: State the null and alternative hypotheses.

Step 2: Calculate the Chi-square test statistic.

for Test for Independence: (1) Calculate Expected Frequencies

Expected Frequency = (Row Total × Column Total) / Grand Total

(2) Chi-square Statistic

χ² = Σ((Observed - Expected)² / Expected)

(3) Degrees of Freedom (df)

df = (Number of Rows - 1) × (Number of Columns - 1)

for Goodness-of-Fit Test: (1) Calculate Expected Frequencies

Expected Frequency = Row Total / number of categories


NOTES (DATA MANAGEMENT)

(2) Chi-square Statistic

χ² = Σ((Observed - Expected)² / Expected)

(3) Degrees of Freedom (df)

df = number of categories - 1

Step 3: Find the critical value

Find the critical value in a Chi-square table using df and significance level (usually 0.05).

Step 4: Conclusion

If χ² > critical value, reject the null hypothesis.

If χ² ≤ critical value, fail to reject the null hypothesis.

For example:

Chi-Square Test for Independence

Scenario: A researcher wants to know if there is an association between gender and preference for a
particular type of book (fiction vs. non-fiction). She collects data from a random sample of 100 people
and organizes it into a table:

Research Question:
Is there a statistically significant association between gender and book preference?

Step-by-Step Solution:

State the Hypotheses:

o Null Hypothesis (H₀): There is no association between gender and book preference (they
are independent).

o Alternative Hypothesis (H₁): There is an association between gender and book


preference.

Calculate Expected Frequencies

Expected Frequency = (Row Total × Column Total) / Grand Total


NOTES (DATA MANAGEMENT)

For example, the expected count for males who prefer fiction:

Expected Frequency = (50 × 45) / 100 = 22.5

Using this formula, we fill in the expected counts for each cell:

Compute the Chi-Square Statistic:

Calculating for each cell: χ² = Σ((Observed - Expected)² / Expected)

Male, Fiction: ((20−22.5)² /22.5) = 0.278

Male, Non-Fiction: ((30−27.5)² / 27.5) = 0.227

Female, Fiction: ((25−22.5)² / 22.5) = 0.278

Female, Non-Fiction: ((25−27.5)² / 27.5) = 0.227

Summing these values gives the chi-square statistic:

X2 = 0.278 + 0.227 + 0.278 + 0.227 = 1.01

Degrees of Freedom (df): df = (Number of Rows - 1) × (Number of Columns - 1)

df = (2−1) × (2−1) = 1

Compare to the Critical Value:

At a 5% significance level (α = 0.05) and 1 degree of freedom, the critical value for chi-
square is 3.841. Since our calculated chi-square statistic (1.01) is less than 3.841, we fail to reject
the null hypothesis.

Conclusion:

There is no significant association between gender and book preference in this sample;
any difference observed is likely due to chance.

If χ² ≤ critical value, fail to reject the null hypothesis.


NOTES (DATA MANAGEMENT)

Chi-Square Goodness-of-Fit Test

Scenario: A candy company claims that the colors of their candies are distributed as follows:

• Red: 30%

• Blue: 20%

• Green: 25%

• Yellow: 25%

A researcher suspects that the actual distribution might differ, so they randomly select 200
candies and record the following counts:

Research Question:
Is there a significant difference between the observed distribution of candy colors and the
company’s claimed distribution?

Step-by-Step Solution:

State the Hypotheses:

o Null Hypothesis (H₀): The observed distribution matches the claimed distribution by the
company.

o Alternative Hypothesis (H₁): The observed distribution does not match the claimed
distribution.

Calculate Expected Frequencies


Expected Frequency = Row Total / number of
categories(percentage)
Since the total sample size is 200 candies, we calculate the expected counts for each
color based on the claimed percentages:
NOTES (DATA MANAGEMENT)

• Expected count for Red: 200 × 0.30 = 60


• Expected count for Blue: 200 × 0.20 = 40
• Expected count for Green: 200 × 0.25 = 50
• Expected count for Yellow: 200 × 0.25 = 50

Therefore,

Compute Chi-square Statistic

χ² = Σ((Observed - Expected)² / Expected)

Calculating for each color:

Red: ((55−60)² / 60) = 0.42

Blue: ((30−40)² / 40) = 2.5

Green: ((60−50)² / 50) = 2.0

Yellow: ((55−50)² / 50) = 0.5

χ² = 0.42 + 2.5 + 2.0 + 0.5 = 5.42

Degrees of Freedom (df):

df = number of categories – 1

= 4 -1 = 3

Compare to the Critical Value:

At a 5% significance level (α = 0.05) and 3 degrees of freedom, the critical value for chi-
square is approximately 7.815. Since our calculated chi-square statistic (5.42) is less than 7.815,
we fail to reject the null hypothesis.

Conclusion:
There is no significant difference between the observed distribution of candy colors and
the company’s claimed distribution. This means that the observed color distribution is consistent
with the company’s claim.
NOTES (DATA MANAGEMENT)

If χ² ≤ critical value, fail to reject the null hypothesis.

You might also like