0% found this document useful (0 votes)
16 views

Introduction To Statistics Using R - Session 2

Slides from session 2 of the seminar series Intro to statistics using R by Chloé Warret Rodrigues, PhD, DVM, MSc In this second session, we will review the concept of probability distribution (not to be confused with frequency distribution), and become familiar with the most common distribution types. We will, for example, see which distribution is usually best with which kind of data (that part will come handy when we start analyzing data!).
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Introduction To Statistics Using R - Session 2

Slides from session 2 of the seminar series Intro to statistics using R by Chloé Warret Rodrigues, PhD, DVM, MSc In this second session, we will review the concept of probability distribution (not to be confused with frequency distribution), and become familiar with the most common distribution types. We will, for example, see which distribution is usually best with which kind of data (that part will come handy when we start analyzing data!).
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Introduction to statistics

using R
Seminar series - session 2
Session 2 – Learning objectives
• Understand the concept of probability distribution
• A very short intro to R and Rstudio
But first,
Distribution
The concept of distribution is the soul of data
analysis

If you understand distribution, then you’ll


understand stats

Statistical tests assume data follow specific


distributions
What is distribution?
• Probability distribution
• Mathematical function defining all the possible values
of a random variable, and how often they occur
• Within a given range
• Describes the long-run behavior of random variables
• How the possible values are plotted depends on:
• Central tendency
• Spread
• Skewness
• Curtosis (among others)
5
Do not confuse
Frequency vs probability
distribution

6
Frequency and probability distribution
Frequency:
• Empirical
• Summarizes observed data

Probability:
• Theoretical
• Calculates the observation probabilities

7
An example
• We captured 5 sandgrouses, 7
ravens, and 3 falcons

• Frequency distribution of
species is 5, 7 and 3

• Probability distribution of
species is 5/15, 7/15 and 3/15
Why use probability?
• Data come from experiment or complex systems
• Variability and randomness
• Measure uncertainty in the data
• Probability measures uncertainty
Some definitions
• random variable: variable
which outcome varies from
measurement to measurement
• Probability distribution:
possible values a random
variable can take, and how
likely they are
• Behavior of random variables
must be defined using
probability. 10
Do not confuse random
variable and variable

A random variable is always numerical and


depends on the outcome of a chance
experiment.

A variable is any property that you can


measure or control
Unfortunate choice of words
• Random variables: not random & not variables. They
are functions mapping from possible outcomes of
sample space to measurable space
• Categorical variables can also describe random
outcomes
Some more • Chance experiment
definitions Uncertain situation with 2 or more
possible outcomes
• Chance experiment
Uncertain situation with 2 or more
possible outcomes

Your dragonborn barbarian wants to open a closed


door in the dungeon using her Sword of Vengeance.
When you roll your d20 to perform the deed, you’ll
have 20 possible outcomes from nat 1 to nat 20.
• Outcome
Some more Result of a single trial
definitions
You rolled your d20 and got a nat 1.
Some more
definitions
• Sample space
Collection of all
possible
outcomes from
a chance
experiment

The sample space when rolling your d20 is all


integers from 1 to 20.
OR

• Event
Some more Set of outcomes from a chance
definitions experiment

Event “nat. 20 or 1”: all dice rolls that are either 20 or 1.


• Probability
Mesure of likelihood of an event
occurence

The Barbarian dragonborn rolls nat 1 three

Some times in a row, almost killing off the party.

more
What was the probability?
definitions 1/20*1/20*1/20 = 1/8000 = 0.000125
Probability
Distribution

Continuous Discrete
Probability
Distribution

Continuous Discrete

Continuous
variables
Probability
Distribution

Continuous Discrete

Continuous
variables
Continuum of value within range
Probability
Distribution

Continuous Discrete

Continuous Discrete
variables variables
Continuum of value within range
Probability
Distribution

Continuous Discrete

Continuous Discrete
variables variables
Continuum of value within range Only takes distinct values
Continuous variable:
Most • Normal (a.k.a. Gaussian)
common • Uniform (a.k.a. rectangular)
distributions • Beta
• gamma 24
Gaussian or Normal • Symmetrical
distribution • Defined by μ and σ
Uniform or rectangular distribution

Defined by a (minimum value) and b (maximum value)

Constant probability over interval [a-b]


Gamma distribution

• Positive and right-skewed


• Defined by parameters k (shape) and θ (scale)

• Continuous, positive, right-


skewed data with constant
variance on the log-scale
• Survival data, rainfall,…
• χ2 distribution: special case of
gamma
• Used in χ2 goodness-of-fit tests
Beta distribution
• Defined on [0,1]
• Defined by shape parameters α and β, both >0
Beta distribution
• Derived variables, notably proportions
• Doesn’t deal with exact 0s and 1s (loglikelihood function contains ln(x)
and ln(1-x), and thus unbounded since ln(1) = 0 and ln(0) = -∞ )

When 0 and 1 occur:


• Possible transforma on: (y * (n−1) + 0.5) / n, n = sample size
• Zero-one inflated beta models
Discrete variables:
Most • Uniform (Yes, again!)
common • Poisson
distributions • Negative Binomial
• Binomial
30
Discrete uniform distribution
All outcome equally likely
Defined by a (min) and b (max) like the continuous

1
n
• Count distribution
The Poisson • Defined by 1 parameter λ mean
distribution number of events
• λ = μ = σ2 (lambda = mean = var)
• As mean increases, Poisson
distributions approximate normal
The Poisson distribution
distribution • Count data with large mean can
often be modelled as continuous
Expected relative frequency

K (# times event occurs)


• In most cases μ = σ2 will not hold
Negative • Overdispersion: μ < σ2
binomial • 1 more parameter θ
distribution 2
• λ = μ and σ = μ +
μ
Binomial distribution

• Series of Bernoulli trials


• Bernouilli trials: only 2
outcomes
• Binary variable: 1/0
• # of 1 in n independent
Bernoulli trials
• Defined by n (# trials) and p
(probability of success in 1
trial)
Categorical distributions
Which morphs are associated with which habitat?
• Coat color: response variable with k categories
• Multinomial distribution: possible results of a
random variable that can take on one of K
categories, given the specified probability of each
category
• Generalizes the binomial distribution
Intro to R and RStudio
What is R?

• A programming language
• A calculator
• An environment:
integrated suite of
software

• Open source
• Get under the hood,
unlike most software:
you know what you’re
doing and what’s going
on
Why is R?
• Data handling
• Computation
• Graphics

• Wide variety of:


• Statistical techniques
• Graphical techniques
• Data-type it can
handle
What is
RStudio?

• integrated development
environment for R
• A tool to make your life easier
with R
Why is
RStudio?
tools for:
• plotting
• history
• debugging
• workspace
management
R resources
• https://r4ds.had.co.nz/ for data wrangling
and graphs
• List of sites I provided during the last
seminar
• I’m happy to help you trouble shoot your
code, but from now on, I’m expecting you
to first try these resources

Copy and paste


errors in
Panes in RStudio

1 3

2 4
1 Source editor

This is where you type your code

# will prevent whatever comments you add to be read as code


# you need to add the # symbol at the beginning of each line, which
# you don’t want to run

You can also add # at the end of a code bit


The first part of the line will be seen as code # the second will not
1 Source editor

Source editor color convention: default theme

Black
Objects and anything you define or call

Blue
commands

Green
“characters”
# will not be considered as code, thus won’t run
1 Source editor

But you
can change
the theme!
2 R Console

> This is actual R


> Your code runs here

> All commands are blue preceded by >


Output is black
Warnings and errors are red
2 R Console

> If you type code in the console


> It will be executed directly. Ex.:

>3+2
[1] 5
3 Environment pane

Very useful
Shows all your objects
3 Environment pane

Dataframes, matrices,
arrays and lists are under
“Data”
3 Environment pane

Vectors or functions are


under “values”
3 Environment pane

The pane indicates how


many columns you have in
you data frame, or the
number of elements in a
list
3 Environment pane

You can also see how


many observations you
have in your columns
3 Environment pane

When you click on the


arrow
3 Environment pane

When you click on the


arrow, you can see:
• Columns of data frames
• Elements of a list
3 Environment pane

R also indicates what


operator you must use to
call a particular element.
For ex., butterfly$site
calls the column “site” of
data frame “butterfly”
3 Environment pane

You can also check the


type of data (character,
numeric…) stored in the
column or list
3 Environment pane

Same for vectors, you


can check the type of
data
3 Environment pane

Same for vectors, you


can check the type of
data, and how many
observations you have
4 Output pane

4 tabs of interest
4 Output pane

The file tab is a navigation pane where you can manage


files like in you operating system
4 Output pane

The plot tab is where you can see you plots


4 Output pane

The package tab shows the packages stored in your computer


4 Output pane

The package tab also has the install button from which
you can directly install packages
4 Output pane

Same to update already installed packages


4 Output pane

The Help tab provides access to R documentation for all


functions
68

You might also like