SlideShare a Scribd company logo
Geometric and
Topological Extensions
of Regression Models
Colleen M. Farrelly
Background
Introduction
 Real data is messy.
 Large volumes
 Small volumes
 More predictors than individuals
 Missing data
 Correlated predictors
 The messiness of data can create computational
issues for algorithms based on linear algebra
solvers.
 Least squares algorithm
 Principle components algorithm
 Introducing solvers based on topology and
geometry can mitigate some of these issues and
produce robust algorithms.
Generalized Linear Models
 Flexible extensions of multiple
regression (Gaussian distribution)
common in data science today:
 Yes/no outcomes (binomial distribution)
 Count outcomes (Poisson distribution)
 Survival models (Weibull distribution)
 Transforms regression equation to fit
the outcome distribution
 Sort of like silly putty stretching the
outcome variable in the data space
 Suffers same drawbacks as multiple
regression:
 P>n
 Correlations between predictors
 Local optima
 Impose penalties on the generalized linear
model frameworks:
 Sparsity (set most estimates to 0 to reduce
model size and complexity)
 Robustness (generalizability of the results
under noise)
 Reduce the number of predictors
 Shrink some predictor estimates to 0
 Examine sets of similar predictors
 Similar to a cowboy at the origin roping
coefficients that get too close
 Includes LASSO, LARS, elastic net, and
ridge regression, among others
Penalized Regression Models
Homotopy-Based LASSO (lasso2)
 Homotopy arrow example
◦ Red and blue arrows
 Anchor start and finish points
 Wiggle middle parts of the line until
arrows overlap
◦ Yellow arrow
 Hole presents issues
 Can’t wiggle into blue or red arrow
without breaking the yellow arrow
 Homotopy LASSO/LARS wiggles an
easy regression path into an
optimal regression path
◦ Avoids local optima
 Peaks
 Valleys
 Saddles
 R package lasso2 implements for a
variety of outcome types
 Homotopy as path equivalence
◦ Intrinsic property of topological
spaces
 Instead of fitting model to data, fit model to tangent space (what isn’t
the data)
 Deals with collinearity, as parallel vectors share the same tangent space
 LARS/LASSO extensions
 Partition model into sets of predictors based on tangent space
 Fit sets that correspond well to an outcome
 Rao scoring for selection.
 Effect estimates (angles)
 Model selection criteria
 Information criteria
 Deviance scoring
 New extensions of R package dglars
 Most exponential family distributions
 Binomial
 Poisson
 Gaussian
 Gamma
Differential Geometry and Regression (dglars)
Applications in R
Example Dataset (Open-Source)
 Link to code and data:
 https://www.researchgate.net/project/Miami-Data-Science-Meetup
 https://archive.ics.uci.edu/ml/datasets/Student+Performance (original downloaded data)
 Code:
#load data
mydata<-read.csv("MathScores.csv")
#retrieve only first term scores
mydata<-mydata[,-c(32:33)]
#split to train and test set
s<-sample(1:395,0.7*395)
train<-mydata[s,]
test<-mydata[-s,]
lasso2 Package
 R package implementing homotopy-based LASSO model
 Example pieces of code for logistic regression:
library(lasso2)
#run the model, can use multiple bounds and compare fit
etastart<-NULL
las<-gl1ce(G1~., train, family=gaussian(link=identity), bound=5, standardize=F)
#predict scores of test group
lpred<-predict(las, test, link="response")
sum((lpred-test$G1)^2)/119
#compare to MSE of mean model
sum((mean(test$G1)-test$G1)^2)/119
#obtain coefficients
coef(las)
#obtain deviance estimate (model fit—can be used to derive AIC/BIC)
deviance(las)
 Try it out on your dataset!
dglars Package
 R package implementing differential-geometry-based LARS algorithm
 Example pieces of code for logistic regression:
library(dglars)
dg<-dglars(G1~., family="gaussian", data=train)
#can also use cross-validation (cvdglars() function)
dg2<-cvdglars(G1~., family="gaussian", data=train)
#summary of the model
summary(dg)
#extract coefficients from matrix of coefficients at each step
coef(dg)
#obtain model fit statistics, can also use logLik(dg)
AIC(dg)
AIC(dg2)
#plot path of LARS algorithm or model fit for cross-validated model
plot(dg)
plot(dg2)
 Try it out on your dataset!
Compare with multiple linear regression
#compare DGLARS with multiple linear regression
gl<-lm(G1~., data=train)
AIC(gl) #1418
AIC(dg) #1402
AIC(dg2) #1403
#obtain coefficients to compare with both penalized models
summary(gl)
#Compare prediction accuracy
pred<-predict(gl, test, link="response")
sum((pred-test$G1)^2)/119
sum((lpred-test$G1)^2)/119
sum((mean(test$G1)-test$G1)^2)/119
Conclusions and References
Summary
 Geometry and topology can be leveraged to improve generalized linear
regression and penalized regression model performance, particularly when
data suffers from general “messiness.”
 Multiple R packages exist to implement these algorithms, and algorithms are
built to accommodate many common exponential family distributions of
outcomes.
 Packages provide interpretable models similar to generalized linear
regression, model fit statistics, and prediction capabilities.
 Many more extensions of regression are possible, and there is work being done
to modify other algorithms based on topology and differential geometry.
Open-Source References
 Augugliaro, L., & Mineo, A. (2013, September). Estimation of sparse
generalized linear models: the dglars package. In 9th Scientific Meeting of
the Classification and Data Analysis Group (pp. 20-23). Tommaso Minerva,
Isabella Morlini, Francesco Palumbo.
 Farrelly, C. M. (2017). Topology and Geometry in Machine Learning for Logistic
Regression.
 Lokhorst, J., Venables, B., Turlach, B., & Turlach, M. B. (2013). Package
‘lasso2’.
 Osborne, M. R., Presnell, B., & Turlach, B. A. (2000). A new approach to
variable selection in least squares problems. IMA journal of numerical
analysis, 20(3), 389-403.
 R package tutorials:
 https://cran.r-project.org/web/packages/dglars/dglars.pdf
 https://cran.r-project.org/web/packages/lasso2/lasso2.pdf
Ad

More Related Content

What's hot (20)

Logistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerationsLogistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerations
Colleen Farrelly
 
Morse-Smale Regression
Morse-Smale RegressionMorse-Smale Regression
Morse-Smale Regression
Colleen Farrelly
 
Machine Learning by Analogy II
Machine Learning by Analogy IIMachine Learning by Analogy II
Machine Learning by Analogy II
Colleen Farrelly
 
Quantum persistent k cores for community detection
Quantum persistent k cores for community detectionQuantum persistent k cores for community detection
Quantum persistent k cores for community detection
Colleen Farrelly
 
Machine Learning by Analogy
Machine Learning by AnalogyMachine Learning by Analogy
Machine Learning by Analogy
Colleen Farrelly
 
Topology for data science
Topology for data scienceTopology for data science
Topology for data science
Colleen Farrelly
 
Deep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsDeep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problems
Colleen Farrelly
 
Empirical Network Classification
Empirical Network ClassificationEmpirical Network Classification
Empirical Network Classification
Colleen Farrelly
 
Multiscale Mapper Networks
Multiscale Mapper NetworksMultiscale Mapper Networks
Multiscale Mapper Networks
Colleen Farrelly
 
2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk
Colleen Farrelly
 
Cluster analysis for market segmentation
Cluster analysis for market segmentationCluster analysis for market segmentation
Cluster analysis for market segmentation
Vishal Tandel
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
saba khan
 
Marketing analytics - clustering Types
Marketing analytics - clustering TypesMarketing analytics - clustering Types
Marketing analytics - clustering Types
Suryakumar Thangarasu
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction Stratergies
AnjaliSoorej
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
Subhranil Bhattacharjee
 
Slides distancecovariance
Slides distancecovarianceSlides distancecovariance
Slides distancecovariance
Shrey Nishchal
 
Canonical correlation
Canonical correlationCanonical correlation
Canonical correlation
National Institute of Biologics
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
Jaclyn Kokx
 
Summary2 (1)
Summary2 (1)Summary2 (1)
Summary2 (1)
Adarsh Burma
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlation
domsr
 
Logistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerationsLogistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerations
Colleen Farrelly
 
Machine Learning by Analogy II
Machine Learning by Analogy IIMachine Learning by Analogy II
Machine Learning by Analogy II
Colleen Farrelly
 
Quantum persistent k cores for community detection
Quantum persistent k cores for community detectionQuantum persistent k cores for community detection
Quantum persistent k cores for community detection
Colleen Farrelly
 
Machine Learning by Analogy
Machine Learning by AnalogyMachine Learning by Analogy
Machine Learning by Analogy
Colleen Farrelly
 
Deep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsDeep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problems
Colleen Farrelly
 
Empirical Network Classification
Empirical Network ClassificationEmpirical Network Classification
Empirical Network Classification
Colleen Farrelly
 
Multiscale Mapper Networks
Multiscale Mapper NetworksMultiscale Mapper Networks
Multiscale Mapper Networks
Colleen Farrelly
 
2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk
Colleen Farrelly
 
Cluster analysis for market segmentation
Cluster analysis for market segmentationCluster analysis for market segmentation
Cluster analysis for market segmentation
Vishal Tandel
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
saba khan
 
Marketing analytics - clustering Types
Marketing analytics - clustering TypesMarketing analytics - clustering Types
Marketing analytics - clustering Types
Suryakumar Thangarasu
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction Stratergies
AnjaliSoorej
 
Slides distancecovariance
Slides distancecovarianceSlides distancecovariance
Slides distancecovariance
Shrey Nishchal
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
Jaclyn Kokx
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlation
domsr
 

Similar to Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models (20)

Regression kriging
Regression krigingRegression kriging
Regression kriging
FAO
 
Demography 7263 fall 2015 spatially autoregressive models 2
Demography 7263 fall 2015 spatially autoregressive models 2Demography 7263 fall 2015 spatially autoregressive models 2
Demography 7263 fall 2015 spatially autoregressive models 2
Corey Sparks
 
Colombo14a
Colombo14aColombo14a
Colombo14a
AlferoSimona
 
User biglm
User biglmUser biglm
User biglm
johnatan pladott
 
Relaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataRelaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked data
Alessandro Adamou
 
Statistical Clustering
Statistical ClusteringStatistical Clustering
Statistical Clustering
tim_hare
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS Dataset
Vishva Abeyrathne
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
Ummiya Mohammedi
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
templedf
 
Accounting for uncertainty in species delineation during the analysis of envi...
Accounting for uncertainty in species delineation during the analysis of envi...Accounting for uncertainty in species delineation during the analysis of envi...
Accounting for uncertainty in species delineation during the analysis of envi...
methodsecolevol
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
Ali T. Lotia
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
result analysis for deep leakage from gradients
result analysis for deep leakage from gradientsresult analysis for deep leakage from gradients
result analysis for deep leakage from gradients
國騰 丁
 
Jgrass-NewAge: Kriging component
Jgrass-NewAge: Kriging componentJgrass-NewAge: Kriging component
Jgrass-NewAge: Kriging component
Niccolò Tubini
 
A course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdfA course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdf
SEEMAB AKHTAR
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
Majid Abdollahi
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
Surrogate modeling for industrial design
Surrogate modeling for industrial designSurrogate modeling for industrial design
Surrogate modeling for industrial design
Shinwoo Jang
 
Variable selection for classification and regression using R
Variable selection for classification and regression using RVariable selection for classification and regression using R
Variable selection for classification and regression using R
Gregg Barrett
 
Building Predictive Models R_caret language
Building Predictive Models R_caret languageBuilding Predictive Models R_caret language
Building Predictive Models R_caret language
javed khan
 
Regression kriging
Regression krigingRegression kriging
Regression kriging
FAO
 
Demography 7263 fall 2015 spatially autoregressive models 2
Demography 7263 fall 2015 spatially autoregressive models 2Demography 7263 fall 2015 spatially autoregressive models 2
Demography 7263 fall 2015 spatially autoregressive models 2
Corey Sparks
 
Relaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataRelaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked data
Alessandro Adamou
 
Statistical Clustering
Statistical ClusteringStatistical Clustering
Statistical Clustering
tim_hare
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS Dataset
Vishva Abeyrathne
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
Ummiya Mohammedi
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
templedf
 
Accounting for uncertainty in species delineation during the analysis of envi...
Accounting for uncertainty in species delineation during the analysis of envi...Accounting for uncertainty in species delineation during the analysis of envi...
Accounting for uncertainty in species delineation during the analysis of envi...
methodsecolevol
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
Ali T. Lotia
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
result analysis for deep leakage from gradients
result analysis for deep leakage from gradientsresult analysis for deep leakage from gradients
result analysis for deep leakage from gradients
國騰 丁
 
Jgrass-NewAge: Kriging component
Jgrass-NewAge: Kriging componentJgrass-NewAge: Kriging component
Jgrass-NewAge: Kriging component
Niccolò Tubini
 
A course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdfA course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdf
SEEMAB AKHTAR
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
Majid Abdollahi
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
Surrogate modeling for industrial design
Surrogate modeling for industrial designSurrogate modeling for industrial design
Surrogate modeling for industrial design
Shinwoo Jang
 
Variable selection for classification and regression using R
Variable selection for classification and regression using RVariable selection for classification and regression using R
Variable selection for classification and regression using R
Gregg Barrett
 
Building Predictive Models R_caret language
Building Predictive Models R_caret languageBuilding Predictive Models R_caret language
Building Predictive Models R_caret language
javed khan
 
Ad

More from Colleen Farrelly (20)

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
Colleen Farrelly
 
Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023
Colleen Farrelly
 
Modeling Climate Change.pptx
Modeling Climate Change.pptxModeling Climate Change.pptx
Modeling Climate Change.pptx
Colleen Farrelly
 
Natural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptxNatural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptx
Colleen Farrelly
 
The Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptxThe Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptx
Colleen Farrelly
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptx
Colleen Farrelly
 
Emerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptxEmerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptx
Colleen Farrelly
 
Applications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptxApplications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptx
Colleen Farrelly
 
Geometry for Social Good.pptx
Geometry for Social Good.pptxGeometry for Social Good.pptx
Geometry for Social Good.pptx
Colleen Farrelly
 
Topology for Time Series.pptx
Topology for Time Series.pptxTopology for Time Series.pptx
Topology for Time Series.pptx
Colleen Farrelly
 
Time Series Applications AMLD.pptx
Time Series Applications AMLD.pptxTime Series Applications AMLD.pptx
Time Series Applications AMLD.pptx
Colleen Farrelly
 
An introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxAn introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptx
Colleen Farrelly
 
An introduction to time series data with R.pptx
An introduction to time series data with R.pptxAn introduction to time series data with R.pptx
An introduction to time series data with R.pptx
Colleen Farrelly
 
NLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved AreasNLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved Areas
Colleen Farrelly
 
Geometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptxGeometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptx
Colleen Farrelly
 
Topological Data Analysis.pptx
Topological Data Analysis.pptxTopological Data Analysis.pptx
Topological Data Analysis.pptx
Colleen Farrelly
 
Transforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptxTransforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptx
Colleen Farrelly
 
Natural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptxNatural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptx
Colleen Farrelly
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing
Colleen Farrelly
 
WIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network ScienceWIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network Science
Colleen Farrelly
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
Colleen Farrelly
 
Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023
Colleen Farrelly
 
Modeling Climate Change.pptx
Modeling Climate Change.pptxModeling Climate Change.pptx
Modeling Climate Change.pptx
Colleen Farrelly
 
Natural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptxNatural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptx
Colleen Farrelly
 
The Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptxThe Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptx
Colleen Farrelly
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptx
Colleen Farrelly
 
Emerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptxEmerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptx
Colleen Farrelly
 
Applications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptxApplications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptx
Colleen Farrelly
 
Geometry for Social Good.pptx
Geometry for Social Good.pptxGeometry for Social Good.pptx
Geometry for Social Good.pptx
Colleen Farrelly
 
Topology for Time Series.pptx
Topology for Time Series.pptxTopology for Time Series.pptx
Topology for Time Series.pptx
Colleen Farrelly
 
Time Series Applications AMLD.pptx
Time Series Applications AMLD.pptxTime Series Applications AMLD.pptx
Time Series Applications AMLD.pptx
Colleen Farrelly
 
An introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxAn introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptx
Colleen Farrelly
 
An introduction to time series data with R.pptx
An introduction to time series data with R.pptxAn introduction to time series data with R.pptx
An introduction to time series data with R.pptx
Colleen Farrelly
 
NLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved AreasNLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved Areas
Colleen Farrelly
 
Geometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptxGeometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptx
Colleen Farrelly
 
Topological Data Analysis.pptx
Topological Data Analysis.pptxTopological Data Analysis.pptx
Topological Data Analysis.pptx
Colleen Farrelly
 
Transforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptxTransforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptx
Colleen Farrelly
 
Natural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptxNatural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptx
Colleen Farrelly
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing
Colleen Farrelly
 
WIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network ScienceWIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network Science
Colleen Farrelly
 
Ad

Recently uploaded (20)

How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 

Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models

  • 1. Geometric and Topological Extensions of Regression Models Colleen M. Farrelly
  • 3. Introduction  Real data is messy.  Large volumes  Small volumes  More predictors than individuals  Missing data  Correlated predictors  The messiness of data can create computational issues for algorithms based on linear algebra solvers.  Least squares algorithm  Principle components algorithm  Introducing solvers based on topology and geometry can mitigate some of these issues and produce robust algorithms.
  • 4. Generalized Linear Models  Flexible extensions of multiple regression (Gaussian distribution) common in data science today:  Yes/no outcomes (binomial distribution)  Count outcomes (Poisson distribution)  Survival models (Weibull distribution)  Transforms regression equation to fit the outcome distribution  Sort of like silly putty stretching the outcome variable in the data space  Suffers same drawbacks as multiple regression:  P>n  Correlations between predictors  Local optima
  • 5.  Impose penalties on the generalized linear model frameworks:  Sparsity (set most estimates to 0 to reduce model size and complexity)  Robustness (generalizability of the results under noise)  Reduce the number of predictors  Shrink some predictor estimates to 0  Examine sets of similar predictors  Similar to a cowboy at the origin roping coefficients that get too close  Includes LASSO, LARS, elastic net, and ridge regression, among others Penalized Regression Models
  • 6. Homotopy-Based LASSO (lasso2)  Homotopy arrow example ◦ Red and blue arrows  Anchor start and finish points  Wiggle middle parts of the line until arrows overlap ◦ Yellow arrow  Hole presents issues  Can’t wiggle into blue or red arrow without breaking the yellow arrow  Homotopy LASSO/LARS wiggles an easy regression path into an optimal regression path ◦ Avoids local optima  Peaks  Valleys  Saddles  R package lasso2 implements for a variety of outcome types  Homotopy as path equivalence ◦ Intrinsic property of topological spaces
  • 7.  Instead of fitting model to data, fit model to tangent space (what isn’t the data)  Deals with collinearity, as parallel vectors share the same tangent space  LARS/LASSO extensions  Partition model into sets of predictors based on tangent space  Fit sets that correspond well to an outcome  Rao scoring for selection.  Effect estimates (angles)  Model selection criteria  Information criteria  Deviance scoring  New extensions of R package dglars  Most exponential family distributions  Binomial  Poisson  Gaussian  Gamma Differential Geometry and Regression (dglars)
  • 9. Example Dataset (Open-Source)  Link to code and data:  https://www.researchgate.net/project/Miami-Data-Science-Meetup  https://archive.ics.uci.edu/ml/datasets/Student+Performance (original downloaded data)  Code: #load data mydata<-read.csv("MathScores.csv") #retrieve only first term scores mydata<-mydata[,-c(32:33)] #split to train and test set s<-sample(1:395,0.7*395) train<-mydata[s,] test<-mydata[-s,]
  • 10. lasso2 Package  R package implementing homotopy-based LASSO model  Example pieces of code for logistic regression: library(lasso2) #run the model, can use multiple bounds and compare fit etastart<-NULL las<-gl1ce(G1~., train, family=gaussian(link=identity), bound=5, standardize=F) #predict scores of test group lpred<-predict(las, test, link="response") sum((lpred-test$G1)^2)/119 #compare to MSE of mean model sum((mean(test$G1)-test$G1)^2)/119 #obtain coefficients coef(las) #obtain deviance estimate (model fit—can be used to derive AIC/BIC) deviance(las)  Try it out on your dataset!
  • 11. dglars Package  R package implementing differential-geometry-based LARS algorithm  Example pieces of code for logistic regression: library(dglars) dg<-dglars(G1~., family="gaussian", data=train) #can also use cross-validation (cvdglars() function) dg2<-cvdglars(G1~., family="gaussian", data=train) #summary of the model summary(dg) #extract coefficients from matrix of coefficients at each step coef(dg) #obtain model fit statistics, can also use logLik(dg) AIC(dg) AIC(dg2) #plot path of LARS algorithm or model fit for cross-validated model plot(dg) plot(dg2)  Try it out on your dataset!
  • 12. Compare with multiple linear regression #compare DGLARS with multiple linear regression gl<-lm(G1~., data=train) AIC(gl) #1418 AIC(dg) #1402 AIC(dg2) #1403 #obtain coefficients to compare with both penalized models summary(gl) #Compare prediction accuracy pred<-predict(gl, test, link="response") sum((pred-test$G1)^2)/119 sum((lpred-test$G1)^2)/119 sum((mean(test$G1)-test$G1)^2)/119
  • 14. Summary  Geometry and topology can be leveraged to improve generalized linear regression and penalized regression model performance, particularly when data suffers from general “messiness.”  Multiple R packages exist to implement these algorithms, and algorithms are built to accommodate many common exponential family distributions of outcomes.  Packages provide interpretable models similar to generalized linear regression, model fit statistics, and prediction capabilities.  Many more extensions of regression are possible, and there is work being done to modify other algorithms based on topology and differential geometry.
  • 15. Open-Source References  Augugliaro, L., & Mineo, A. (2013, September). Estimation of sparse generalized linear models: the dglars package. In 9th Scientific Meeting of the Classification and Data Analysis Group (pp. 20-23). Tommaso Minerva, Isabella Morlini, Francesco Palumbo.  Farrelly, C. M. (2017). Topology and Geometry in Machine Learning for Logistic Regression.  Lokhorst, J., Venables, B., Turlach, B., & Turlach, M. B. (2013). Package ‘lasso2’.  Osborne, M. R., Presnell, B., & Turlach, B. A. (2000). A new approach to variable selection in least squares problems. IMA journal of numerical analysis, 20(3), 389-403.  R package tutorials:  https://cran.r-project.org/web/packages/dglars/dglars.pdf  https://cran.r-project.org/web/packages/lasso2/lasso2.pdf

Editor's Notes

  • #5: Same assumptions as multiple regression, minus outcome’s normal distribution (link function extends to non-normal distributions). McCullagh, P. (1984). Generalized linear models. European Journal of Operational Research, 16(3), 285-292.
  • #6: Relaxes predictor independence requirement and adds penalty term. Adds a penalty to reduce generalized linear model’s model size. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.
  • #8: Exists for several types of models, including survival, binomial, and Poisson regression models Augugliaro, L., Mineo, A. M., & Wit, E. C. (2013). Differential geometric least angle regression: a differential geometric approach to sparse generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3), 471-498. Augugliaro, L., & Mineo, A. M. (2015). Using the dglars Package to Estimate a Sparse Generalized Linear Model. In Advances in Statistical Models for Data Analysis (pp. 1-8). Springer International Publishing.