0% found this document useful (0 votes)
33 views

Going Beyond Linear Regression: Ita Cirovic Donev

This document provides an overview of generalized linear models (GLMs) in Python. It discusses how GLMs extend linear regression to handle non-normal distributions and non-constant variance. The key components of a GLM, including the data type, domain, link function, and family, are explained for linear regression, logistic regression, and Poisson regression examples. The process of fitting a GLM in Python using statsmodels, including describing the model formula, fitting the model, and summarizing results, is also outlined.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Going Beyond Linear Regression: Ita Cirovic Donev

This document provides an overview of generalized linear models (GLMs) in Python. It discusses how GLMs extend linear regression to handle non-normal distributions and non-constant variance. The key components of a GLM, including the data type, domain, link function, and family, are explained for linear regression, logistic regression, and Poisson regression examples. The process of fitting a GLM in Python using statsmodels, including describing the model formula, fitting the model, and summarizing results, is also outlined.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Going beyond linear

regression
GENERALIZED LINEAR MODELS IN PYTHON

Ita Cirovic Donev


Data Science Consultant
Course objectives
Learn building blocks of GLMs Chapter 1: How are GLMs an extension of
linear models
Train GLMs
Chapter 2: Binomial (logistic) regression
Interpret model results
Chapter 3: Poisson regression
Assess model performance
Chapter 4: Multivariate logistic regression
Compute predictions

GENERALIZED LINEAR MODELS IN PYTHON


Review of linear models
salary ∼ experience

salary = β0 + β1 × experience + ϵ

y = β0 + β1 x1 + ϵ

GENERALIZED LINEAR MODELS IN PYTHON


Review of linear models
salary ∼ experience

salary = β0 + β1 × experience + ϵ

y = β0 + β1 x1 + ϵ

where:
y - response variable (output)

GENERALIZED LINEAR MODELS IN PYTHON


Review of linear models
salary ∼ experience

salary = β0 + β1 × experience + ϵ

y = β0 + β1 x1 + ϵ

where:
y - response variable (output)
x - explanatory variable (input)

GENERALIZED LINEAR MODELS IN PYTHON


Review of linear models
salary ∼ experience

salary = β0 + β1 × experience + ϵ

y = β0 + β1 x1 + ϵ

where:
y - response variable (output)
x - explanatory variable (input)
β - model parameters
β0 - intercept
β1 - slope

GENERALIZED LINEAR MODELS IN PYTHON


Review of linear models
salary ∼ experience

salary = β0 + β1 × experience + ϵ

y = β0 + β1 x1 + ϵ

where:
y - response variable (output)
x - explanatory variable (input)
β - model parameters
β0 - intercept
β1 - slope
ϵ - random error

GENERALIZED LINEAR MODELS IN PYTHON


LINEAR MODEL - ols() GENERALIZED LINEAR MODEL - glm()

from statsmodels.formula.api import ols import statsmodels.api as sm


from statsmodels.formula.api import glm

model = ols(formula = 'y ~ X',


data = my_data).fit() model = glm(formula = 'y ~ X',
data = my_data,
family = sm.families.____).fit()

GENERALIZED LINEAR MODELS IN PYTHON


Assumptions of linear models
Regression function

E[y] = μ = β0 + β1 x1

Assumptions

Linear in parameters

Errors are independent and normally


distributed

Constant variance

salary = 25790 + 9449 × experience

GENERALIZED LINEAR MODELS IN PYTHON


What if ... ?
The response is binary or count → NOT continuous

The variance of y is not constant → depends on the mean

GENERALIZED LINEAR MODELS IN PYTHON


Dataset - nesting of horseshoe crabs
Variable Name Description

sat Number of satellites residing in the nest

y There is at least one satellite residing in the nest; 0/1

weight Weight of the female crab in kg

width Width of the female crab in cm

color 1 - light medium, 2 - medium, 3 - dark medium, 4 - dark

spine 1 - both good, 2 - one worn or broken, 3 - both worn or broken

1 A. Agresti, An Introduction to Categorical Data Analysis, 2007.

GENERALIZED LINEAR MODELS IN PYTHON


Linear model and binary response

satellite crab ∼ female crab weight

y ~ weight

P (satellite crab is present) = P (y = 1)

GENERALIZED LINEAR MODELS IN PYTHON


Linear model and binary response

GENERALIZED LINEAR MODELS IN PYTHON


Linear model and binary response

GENERALIZED LINEAR MODELS IN PYTHON


Linear model and binary response

GENERALIZED LINEAR MODELS IN PYTHON


Linear model and binary data

GENERALIZED LINEAR MODELS IN PYTHON


Linear model and binary data

GENERALIZED LINEAR MODELS IN PYTHON


From probabilities to classes

GENERALIZED LINEAR MODELS IN PYTHON


Let's practice!
GENERALIZED LINEAR MODELS IN PYTHON
How to build a
GLM?
GENERALIZED LINEAR MODELS IN PYTHON

Ita Cirovic Donev


Data Science Consultant
Components of the GLM

GENERALIZED LINEAR MODELS IN PYTHON


Components of the GLM

GENERALIZED LINEAR MODELS IN PYTHON


Components of the GLM

GENERALIZED LINEAR MODELS IN PYTHON


Components of the GLM

GENERALIZED LINEAR MODELS IN PYTHON


Components of the GLM

GENERALIZED LINEAR MODELS IN PYTHON


Continuous → Linear Regression
Data type: continuous
Domain: (−∞, ∞)
Examples: house price, salary, person's height

Family: Gaussian()
Link: identity
g(μ) = μ = E(y)

Model = Linear regression

GENERALIZED LINEAR MODELS IN PYTHON


Binary → Logistic regression
Data type: binary
Domain: 0, 1
Examples: True/False

Family: Binomial()
Link: logit

Model = Logistic regression

GENERALIZED LINEAR MODELS IN PYTHON


Count → Poisson regression
Data type: count
Domain: 0, 1, 2, ..., ∞
Examples: number of votes, number of
hurricanes

Family: Poisson()
Link: logarithm

Model = Poisson regression

GENERALIZED LINEAR MODELS IN PYTHON


Link functions
Density Link: η = g(μ) Default link glm(family=...)

Normal η=μ identity Gaussian()

Poisson η = log(μ) logarithm Poisson()

Binomial η = log[p/(1 − p)] logit Binomial()

Gamma η = 1/μ inverse Gamma()

Inverse Gaussian η = 1/μ2 inverse squared InverseGaussian()

GENERALIZED LINEAR MODELS IN PYTHON


Benefits of GLMs
A uni ed framework for many di erent data distributions
Exponential family of distributions

Link function
Transforms the expected value of y

Enables linear combinations

Many techniques from linear models apply to GLMs as well

GENERALIZED LINEAR MODELS IN PYTHON


Let's practice
GENERALIZED LINEAR MODELS IN PYTHON
How to fit a GLM in
Python?
GENERALIZED LINEAR MODELS IN PYTHON

Ita Cirovic Donev


Data Science Consultant
statsmodels
Importing statsmodels

import statsmodels.api as sm

Support for formulas

import statsmodels.formula.api as smf

Use glm() directly

from statsmodels.formula.api import glm

GENERALIZED LINEAR MODELS IN PYTHON


Process of model fit
1. Describe the model → glm()
2. Fit the model → .fit()
3. Summarize the model → .summary()
4. Make model predictions → .predict()

GENERALIZED LINEAR MODELS IN PYTHON


Describing the model
FORMULA based ARRAY based

from statsmodels.formula.api import glm import statsmodels.api as sm

model = glm(formula, data, family) X = sm.add_constant(X)


model = sm.glm(y, X, family)

GENERALIZED LINEAR MODELS IN PYTHON


Formula Argument
response ∼ explanatory variable(s)

output ∼ input(s)

formula = 'y ~ x1 + x2'

C(x1) : treat x1 as categorical variable

-1 : remove intercept
x1:x2 : an interaction term between x1 and x2

x1*x2 : an interaction term between x1 and x2 and the individual variables

np.log(x1) : apply vectorized functions to model variables

GENERALIZED LINEAR MODELS IN PYTHON


Family Argument
family = sm.families.____()

The family functions:

Gaussian(link = sm.families.links.identity) → the default family


Binomial(link = sm.families.links.logit)
probit, cauchy, log, and cloglog
Poisson(link = sm.families.links.log)
identity and sqrt
Other distribution families you can review at statsmodels website.

GENERALIZED LINEAR MODELS IN PYTHON


Summarizing the model
print(model_GLM.summary())

GENERALIZED LINEAR MODELS IN PYTHON


Generalized Linear Model Regression Results
=============================================================================
Dep. Variable: y No. Observations: 173
Model: GLM Df Residuals: 171
Model Family: Binomial Df Model: 1
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -97.226
Date: Mon, 21 Jan 2019 Deviance: 194.45
Time: 11:30:01 Pearson chi2: 165.
No. Iterations: 4 Covariance Type: nonrobust
=============================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------
Intercept -12.3508 2.629 -4.698 0.000 -17.503 -7.199
width 0.4972 0.102 4.887 0.000 0.298 0.697
=============================================================================

GENERALIZED LINEAR MODELS IN PYTHON


Regression coefficients
.params prints regression coe cients .conf_int(alpha=0.05, cols=None)
prints con dence intervals
model_GLM.params
model_GLM.conf_int()
Intercept -12.350818
width 0.497231 0 1
dtype: float64 Intercept -17.503010 -7.198625
width 0.297833 0.696629

GENERALIZED LINEAR MODELS IN PYTHON


Predictions
Specify all the model variables in test data

.predict(test_data) computes predictions


model_GLM.predict(test_data)

0 0.029309
1 0.470299
2 0.834983
3 0.972363
4 0.987941

GENERALIZED LINEAR MODELS IN PYTHON


Let's practice!
GENERALIZED LINEAR MODELS IN PYTHON

You might also like