0% found this document useful (0 votes)
1 views

Deck 1- Data Types, Data Display, and Summary 2024F

The document outlines various types of data, including univariate vs. multivariate and quantitative vs. qualitative distinctions. It discusses the importance of understanding populations and samples, as well as different statistical measures such as mean, median, and variability. Additionally, it covers data display methods like bar charts and histograms, and introduces descriptive statistics for summarizing data.

Uploaded by

nithila4140
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Deck 1- Data Types, Data Display, and Summary 2024F

The document outlines various types of data, including univariate vs. multivariate and quantitative vs. qualitative distinctions. It discusses the importance of understanding populations and samples, as well as different statistical measures such as mean, median, and variability. Additionally, it covers data display methods like bar charts and histograms, and introduces descriptive statistics for summarizing data.

Uploaded by

nithila4140
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Data Types, Data Display, and Summary Statistics

Professor Turetsky
Deck 1
Types of Data

2
Types of Data

• Univariate or Multivariate?
– Univariate: one fact for each object in a dataset (“one column in a spreadsheet”)
– Multivariate: two or more facts for each object in a dataset (“many columns in a spreadsheet”)

3
Types of Data

• Quantitative or Qualitative?
– Quantitative: presented as numbers permitting arithmetic
• Interest rate
• Temperature
– Qualitative (categorical): everything else
• Country of birth
• Supplier/Vendor

4
Types of Data (Quantitative)

• Discrete or Continuous?
– Discrete: counted
• Cars sold
• Number of children
– Continuous: measured (always allow “in-between” values)
• Gallons of oil sold
• Temperature

– What about age? Money?

5
Types of Data (Qualitative)

• Nominal Data
– Definition: “Qualitative data that has no ordering”
– Example:

6
Types of Data (Qualitative)

• Ordinal Data
– Definition: “Qualitative data that has an ordering”
– Example – Likert Scale:

disagree strongly  disagree  neutral  agree  agree strongly


– Often “measure” with numbers:
1 = disagree strongly
2 = disagree

5 = agree strongly

7
Types of Data (Time Series)

• Time Series or Cross-Sectional?


– Time series: when time sequencing is important
• US historical inflation rates
• A baby’s weight
– Cross-sectional: data are contemporaneous, or time doesn’t matter
• Inflation rates for several countries
• Weight at birth

8
Population vs. Sample

• Population
– Entire Set of Observations
• The entire population is assumed to be fixed.
• Often hypothetical since we generally cannot collect the entire population
• However, must be defined
• Sample
– A subset of observations from the population
• Easier to collect
• Sample needs to be representative of the population.
• Ultimately, we want to infer something about the population.

9
Random Sample

• Random Sample:
– All elements of the population have the same chance of being chosen in the sample.
• Hence, our sample will be representative of the population.
• There are various types of random sampling procedures.

10
Population vs Sample

• Population is Fixed
• Sample is Random
• Parameter: Numerical Summary Measure of Population
• Statistic: Numerical Summary Measure of Sample
• Parameters are fixed while Statistics are RANDOM.

11
Estimator vs. Estimate

• Estimator:
– A statistical method or formula used to calculate an estimate for a population parameter
• Estimate:
– The specific value that is calculated using an estimator on the sample data

12
Descriptive Statistics:
Graphical Summaries

13
Data Display

• In an MBA class, we know about each student:


– First name
– Last name
– Age
– Undergraduate major
– Work experience (in years)
– Annual Compensation

14
Dataset

15
The Bar Chart

Horizontal axis is qualitative

16
The Stacked Bar Chart

17
The Histogram

• Horizontal axis is quantitative


• No space between columns
18
The Histogram
• With a histogram, can detect
– Skewness:

– Outliers:

– Bimodal distribution:

19
A Bit More on Skewness
The Scatter Plot

21
The Scatter Plot
• Positive (linear) relation:

• Negative (linear) relation:

• No relation:

22
Time Series

23
Descriptive Statistics:
Numerical Summaries

24
The Sample Mean
• Compensation Data from MBA students: Category - Other
– n = 9 numbers:
86.24, 94.62, 95.64, 95.60, 95.42, 88.77, 88.95, 99.80, 97.18

x1 x2 x3 x4 x5 x6 x7 x8 xn

• Sample Mean: 93.58


n
x 1
n x
i 1
i  1
n x
i
i  1
n x

25
Population Mean

26
The Median

• Median = the middle value in a sorted dataset

86.24, 88.77, 88.95, 94.62, 95.42, 95.60, 95.64, 97.18, 99.80

• When n is odd, take “true” middle value


• When n is even, take average of two middle values

27
Mean vs. Median

• Mean is
– Sensitive to outliers
– Includes “more information”
• Median is
– Useful when ranking is important
– Important in demographics
• Other “typical value” measures
– Mode = the most common value
– Trimmed mean (ignore upper and lower x% of data)

28
Variability: Is it Good or Bad?

Histogram of Price
0 1
40

30
Frequency

20

10

0
20 30 40 50 60 70 20 30 40 50 60 70
Price
Panel variable: Before/After

29
Measuring Variability

30
Population Variance and Standard Deviation

31
Sample Variance: Calculating s2
x
(xi - x)
 2
s 
2 1
n-1   (x i - x) 
 i 

 
s2  n1-1   x i  nx 2 
2

 i 

32
Sample Variance vs Population Variance

 
s2  n11   (x i - x)2 
 i 

33
Using Mean and Standard Deviation

• Empirical Rule (assuming distribution is somewhat bell shaped and symmetric)


• Approximately 68% of observations are within 1 standard deviation of the mean
• Approximately 95% of observations are within 2 standard deviation of the mean
• Approximately 99% of observations are within 3 standard deviation of the mean

34
Z-transformation

35
Quartiles

• First quartile = the value in position (n+1)/4 in a sorted dataset


– Also called QL, 25th percentile, Q1
• Median is the second quartile
• Third quartile = the value in position 3(n+1)/4 in a sorted dataset
– Also called QU, 75th percentile, Q3

36
Measures of Variability

• Range = max – min (very sensitive to outliers)


• σ ≡ population standard deviation and s ≡ sample standard deviation (sensitive to
outliers, why?)
• IQR = QU – QL (not sensitive to outliers)
– Can be used to detect outliers.

37
The Box Plot

• Inter-Quartile Range (IQR) = QU – QL


• Inner fences:
– Lower inner fence = QL – 1.5 IQR
– Upper inner fence = QU + 1.5 IQR
• Outer fences:
– Lower outer fence = QL – 3.0 IQR
– Upper outer fence = QU + 3.0 IQR
• Data outside inner fence = outlier
• Data outside outer fence = serious outlier

38
The Box Plot
• Box edges (“hinges”) at QU and QL; mark median with bar
• Draw “whiskers” from edges to smallest and largest non-
outliers

39
The Box Plot

40
The Box Plot

41
Transformations

• Often we want to convert a skewed dataset to a more symmetric one

• Can use the Log transformation

42

You might also like