Deck 1- Data Types, Data Display, and Summary 2024F
Deck 1- Data Types, Data Display, and Summary 2024F
Professor Turetsky
Deck 1
Types of Data
2
Types of Data
• Univariate or Multivariate?
– Univariate: one fact for each object in a dataset (“one column in a spreadsheet”)
– Multivariate: two or more facts for each object in a dataset (“many columns in a spreadsheet”)
3
Types of Data
• Quantitative or Qualitative?
– Quantitative: presented as numbers permitting arithmetic
• Interest rate
• Temperature
– Qualitative (categorical): everything else
• Country of birth
• Supplier/Vendor
4
Types of Data (Quantitative)
• Discrete or Continuous?
– Discrete: counted
• Cars sold
• Number of children
– Continuous: measured (always allow “in-between” values)
• Gallons of oil sold
• Temperature
5
Types of Data (Qualitative)
• Nominal Data
– Definition: “Qualitative data that has no ordering”
– Example:
6
Types of Data (Qualitative)
• Ordinal Data
– Definition: “Qualitative data that has an ordering”
– Example – Likert Scale:
7
Types of Data (Time Series)
8
Population vs. Sample
• Population
– Entire Set of Observations
• The entire population is assumed to be fixed.
• Often hypothetical since we generally cannot collect the entire population
• However, must be defined
• Sample
– A subset of observations from the population
• Easier to collect
• Sample needs to be representative of the population.
• Ultimately, we want to infer something about the population.
9
Random Sample
• Random Sample:
– All elements of the population have the same chance of being chosen in the sample.
• Hence, our sample will be representative of the population.
• There are various types of random sampling procedures.
10
Population vs Sample
• Population is Fixed
• Sample is Random
• Parameter: Numerical Summary Measure of Population
• Statistic: Numerical Summary Measure of Sample
• Parameters are fixed while Statistics are RANDOM.
11
Estimator vs. Estimate
• Estimator:
– A statistical method or formula used to calculate an estimate for a population parameter
• Estimate:
– The specific value that is calculated using an estimator on the sample data
12
Descriptive Statistics:
Graphical Summaries
13
Data Display
14
Dataset
15
The Bar Chart
16
The Stacked Bar Chart
17
The Histogram
– Outliers:
– Bimodal distribution:
19
A Bit More on Skewness
The Scatter Plot
21
The Scatter Plot
• Positive (linear) relation:
• No relation:
22
Time Series
23
Descriptive Statistics:
Numerical Summaries
24
The Sample Mean
• Compensation Data from MBA students: Category - Other
– n = 9 numbers:
86.24, 94.62, 95.64, 95.60, 95.42, 88.77, 88.95, 99.80, 97.18
x1 x2 x3 x4 x5 x6 x7 x8 xn
25
Population Mean
26
The Median
27
Mean vs. Median
• Mean is
– Sensitive to outliers
– Includes “more information”
• Median is
– Useful when ranking is important
– Important in demographics
• Other “typical value” measures
– Mode = the most common value
– Trimmed mean (ignore upper and lower x% of data)
28
Variability: Is it Good or Bad?
Histogram of Price
0 1
40
30
Frequency
20
10
0
20 30 40 50 60 70 20 30 40 50 60 70
Price
Panel variable: Before/After
29
Measuring Variability
30
Population Variance and Standard Deviation
31
Sample Variance: Calculating s2
x
(xi - x)
2
s
2 1
n-1 (x i - x)
i
s2 n1-1 x i nx 2
2
i
32
Sample Variance vs Population Variance
s2 n11 (x i - x)2
i
33
Using Mean and Standard Deviation
34
Z-transformation
35
Quartiles
36
Measures of Variability
37
The Box Plot
38
The Box Plot
• Box edges (“hinges”) at QU and QL; mark median with bar
• Draw “whiskers” from edges to smallest and largest non-
outliers
39
The Box Plot
40
The Box Plot
41
Transformations
42