0% found this document useful (0 votes)
8 views

Module 1 - Descriptive Stats

Uploaded by

jennylehuynh29
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module 1 - Descriptive Stats

Uploaded by

jennylehuynh29
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Module 1: Descriptive Statistics

Data Types
Qualitative/categorical
● Mutually exclusive labels (one label cannot mean two things)
● Not often numbers, if so, numbers have no mathematical meaning
- Nominal: ordering/ranking makes no sense, numerical labels are arbitrary
- Ordinal: ordering/ranking has meaning/can be interpreted, numerical labels
respect the ordering
Quantitative/numerical
● Numbers used to record certain events, numbers have mathematical meaning
- Interval: quantity in difference is meaningful, but in ratio is not; zero has no
natural meaning
- Ratio : difference and ratio of two quantities is also meaningful; zero is
meaningful

Using categorical/qualitative data


Frequency distribution
● Frequency: the total number of occurrences for each
category
● Relative frequency: the fraction of total number of items
belonging to category (eg. 102 ➗808 = 0.1262)
● Percent frequency: relative frequency x 100%
Histograms
● Categories on x-axis
● Frequency, relative frequency, percent frequency on y-axis

Using numerical/quantitative data


Frequency distributions and histograms
● Categories on x-axis are grouped (eg. 0-5, 5-10, 10-15)
● Density frequency

Probability theory
● Random variable (r.v.) - a variable’s value appears randomly
● population - the complete pool of a certain random variable
● Sample - a random collection of certain size from the population

Probability distribution
● Probability distribution - the general shape of probability for values that a random
variable may take

Notation
● Random variable denoted by X, Y (capital letters)
- Eg. X: number of children in household
- Eg. Y: amount of time spent by husband on
housework per day
● realisations/observations of a random variable denoted by xᵢ,
yᵢ (lowercase letters with subscript)
- Eg. x₁: number of children in household is 1
- Eg. y₁₃₇:amount of time spent by husband is 137 on housework per day
● N and n denote the size or number of observations.
- N is referred to population size
- n denotes the sample size

Descriptive Statistics
Central tendency
● Measure of central tendency yields info about the centre of a set of numbers
(distribution of a r.v.’s) – does not focus on the span of the dataset or how far values
are from middle numbers
● gives an idea of what a typical, middle, or average that a r.v. can take
● sometimes called measures of location

three measures of central tendency

Mode ● most frequently occurring value in a set of data


● If there are 2 modes, the 2 modes are listed and the data is said to be bimodal
● Datasets with 3 or more modes are referred to as multimodal
● Concept of mode is often used in determining sizes
● Appropriate descriptive summary measure for categorical data

Median ● middle value in an ordered array of numbers


n+1
● locate the median by finding the th term in the ordered array
2
● Large and small values do not inordinately influence the median – hence the
● best measure of location to use in the analysis of variables in which extreme but
acceptable values can occur at just one end of the data
● Not all info from the dataset is used
● Data must be quantitative or be able to be ranked

Mean ● Average of a set of numbers


● Sample mean is represented by X̄
● Population mean is represented by μ
● Data should be quantitative as it needs to be summed
● Affected by all values – advantage because it reflects all the data, but
disadvantage because extreme values pull the mean towards extremes
● To calculate the mean forecast value, we need to multiply each possible value by
its probability and sum up the products.

- If we denote the r.v. by X:


Variability
● Measures of variability yield info about the likelihood of a realisation of the r.v. is
away from the centre of its distribution, describes the spread/dispersion of a dataset
● Gives an idea of fluctuation and volatility across realisations of the r.v.
● The more variability in a dataset, the less typical they are of the whole set
● Using measures of variability in conjunction with measures of central tendency
makes possible a more complete numerical description of the data (measure of
variability is necessary to complement the mean value when describing data)
● Conveys fluctuations and volatility across realisation of random variable
● The more spread out the r.v. is, the larger the risk/dispersion the variability is
● Also called measures of scale, spread, dispersion or risk
● Measures of variability
- Variance (Var) - average of squared distance from the mean
- Standard deviation (std): square root of variance
- Coefficient of variation - standard deviation/ mean x100%

Variability formulas
Variance
● It computes the average squared distance between data points and their mean,
depending on sample or population
● Population variance
- Finite population
- Denoted by σ ² (stigma square) or
Var(X)/Variance of X
● Sample variance
- Denoted by s²
Standard deviation
● Standard deviation solves the problem of
squared units. It has the same unit of the
original data
● Population standard deviation
- Denoted by σ (stigma) or std(X)
● Sample standard deviation
- Denoted by s
Coefficient of variation
● Measures standard deviation per unit of mean
● In finance when the r.v. X denotes assets returns, CV measures risk per unit of
expected return
● It is unit free, because both the numerator and
denominator have the same unit as the original data and
they cancel each other
● Population CV
- when σ increase, CV increase
- when μ increase, CV decreases
- Ratio between risk and expected return
Skewness
Shape
● Central tendency and variability are useful to describe and summarise data or the
distribution of r.v.’s
● Skewness - measure of asymmetry
● Mode: value on the horizontal axis where the high point of the curve occurs
● M
e
a
n
:

towards
the tail of
the

distribution (drawn towards the extreme values)


● Median: generally located somewhere between the mode and the mean

Probability theory
● Multi-dimensional data
● Experiment: a random process that creates outcomes (eg. the data collection
procedure)
● Sample space: the set of all possible outcomes
● Event: a set of outcomes (can contain no outcome, single outcome or multiple
outcomes) of an experiment to which probability is assigned. So an event is a subset
of the sample space
● Relative frequency: outcomes receive probability corresponding to their number of
occurrences → P(outcomes)= number of occurrences of outcomeı ÷ total number of
occurrences of all outcomes

Law of addition
Joint vs marginal probabilities
● Distinguish joint and marginal probability through multidimensional outcomes
● Joint probability: denotes relative frequency when asking about all dimensions
- Eg. what is relative frequency that customer bought a $49 plan on a weekday
● Marginal probability: displays relative frequency when only asking about a single
dimension
- Eg. relative frequency that customer bought a $49 plan

Complement of the
event denoted
as A’ →
pronounced as A prime - meaning not A - if there is a dash at the top = not the outcome

When referring to joint probability, we use intersection “∩”. The event A∩B (it reads:the
intersection of A and B, or A intersection B) means the event where both A and B are
true or both A and B occur

Venn diagram: visualisation of probability


● Venn diagram shows logic relations across sets
● The external rectangle indicates the whole sample space
● The internal circle indicates some event A
Joint events
● Joint events such as A ∩ B is the intersection (∩) of A and B
Union of events
● Indicates the event A or B happens
● This is denoted by A∪B, pronounced as the union of A and B or A union B.
So P(A∪B) indicates the probability that A or B is true or that A or B occur

Mutually exclusive events


● If event A occurs only if event B does not occur (cannot occur at
the same time), we say A and B are mutually exclusive (events)
● Any event and its complement are mutually exclusive. Either “A
occurs” or “A does not occur
● P(A∩A’) = 0

Collectively exhaustive events


● If the occurrence of events A and B covers the whole sample
space, we say A and B are collectively exhaustive (events
● Any event and its complement are collectively exhaustive. “A occurs” and “A does not
occur” make up all possible outcomes
● P( A∪A’) = 1

Conditional probabilities and independence


Conditional probabilities
● P(A|B) denotes the probability that event A occurs, conditional on that B occurs.
● The symbol P(X=x|Y=y) denotes the probability of r.v. X taking value x, conditional on
the r.v. Y taking value y
● formula:

● Bayes rule:

Law of total probability


● Joint probability = conditional probability multiplied by marginal probability

Independent events: formula


● If A and B are independent events, whether or not B occurs should not affect the
probability that A occurs; also, whether or not A occurs should affect the probability
that B occurs
● Formula:

● Bayes rule:

Implications of formulas

Binomial experiments
● Eg. toss a coin 3 times in a row and you are interested in how likely it is that you get
exactly two heads
● A binomial experiment assesses the number of a certain outcome from repeated
independent trials
● Each trial has two possible outcomes (eg. heads or tails, success or failure)

Binomial tree
● When two outcomes are independent, P(A|B) = P(A)
● Suppose we have three products, each can be defect (D) with probability p or
functional (F) with probability q= = 1 - p

Continuous probability distributions


● Discrete probability distribution: the distribution of a discrete random variable
● Discrete random variable: a r.v. that takes discrete values. Discrete r.v. typically
counts
- Eg. number of kids in a household, number of successes in n trials
● Continuous random variable: a r.v. that takes values on (part of) the real line.
Continuous r.v. measures
- Eg. waiting time in a queue, height of soldiers, inflation rates

2 different probability distribution functions (pdfs): Discrete, Continuous

Scores add up
to 1

Probability density function


● continuous probability distribution for X is defined via the
means of probability density function (pdf) which assigns a
positive value to possible outcomes of X such that the
density is integrated to 1 (this means that the area under the
curve is 1). The probability that X lies between two numbers
is the area under pdf function between those numbers
Discrete random variable
● P(X=x), where x is some specific value because P(X=x) =0 always
● A continuous r.v. has infinitely many outcomes. If a single outcome had positive
probability, the probabilities would add up to infinity and not 1
● Eg. What is the probability that a random person waits exactly
2.71285748634050284… minutes?
- The probability is 0. However the probability that a person waits in between
2.71…84 and 2.71…9 is strictly positive
Implications for Inequalities

Cumulative density function for a continuous pdf


● P(X<x) for a continuous r.v. defines the cumulative
density function (CDF)

Conditions for pdf f χ (x):


1. Total area under the pdf equals 1: P(-∞ <X<∞ )=1
2. Given how probability is worked with areas we can
also say that the pdf can never be negative,
because it would imply negative probabilities over some range
- Eg.

Continuous uniform distribution


● A r.v. X taking any value within [a,b]
is said to follow the continuous
uniform distribution
● X ~ Unif(a,b)
● If all potential outcomes (realisations) between a and b are equally likely
● There are two parameters:
- a: the minimum value that X can assume
- b: the maximum value that X can assume

a+b (b−a)²
- E( X )= , Var ( X)=
2 12

● For any continuous r.v.’s P(x ₁< X < x ₂)=P ¿ )−P ¿), the area under the pdf from x₂
to x₁ is the difference between the values of the cdf at x₂ and x₁

You might also like