Distributed Logistic Model Trees

0 likes1,794 views

The document discusses the use of interpretable algorithms, specifically logistic model trees, in data science compared to black box methods such as logistic regression and decision trees. It highlights the importance of metrics like accuracy, precision, and recall in evaluating model performance, particularly in distributed implementations suitable for big data environments. The authors showcase example algorithms and the advantages of using distributed systems like Spark for building decision trees and conducting logistic regressions.

Data & Analytics

Distributed Logistic Model Trees, Stratio Intelligence
Mateo Álvarez and Antonio Soriano
@StratioBD

Aerospace Engineer, MSc in
Propulsion Systems (UPM), Master
in Data Science (URJC).
Working as data scientist and Big
Data developer at Stratio Big Data in
the data science department
mateo-alvarez

Ph.D. in Telecommunications, MSc in
Electronic Systems Engineering and
Telecommunication Technologies,
Systems and Networks (UPV), and MSc
“Big Data Expert” (UTAD).
Working as data scientist and Big Data
developer at at Stratio Big Data in the
data science department
@Phd_A_Soriano

Why using interpretable algorithms instead of “black boxes”
Logistic Regression
Decision Trees
Variance-Bias tradeoff
Metrics
Demo
Logistic Model Trees
Distributed implementation
Cost function & configuration params
Demo

• Why use interpretable algorithms instead of “black boxes”
• Logistic Regression
• Decision Trees
• Variance-Bias tradeoff
@StratioBD

Accuracy Explainability
VS
Medical Studies Power management Financial environment Criminal activity

Threshold
Probability
Local bad adjust
Feature

Feature 1
Root node
Feature 2 Feature 2
Leaf node Leaf node Leaf node Leaf node

Feature 1
Root node
Feature 2 Feature 2
Leaf node Leaf node Leaf node Leaf node
Local bad adjust

MeanError
Model complexity
Test error
Training error
Model complexity
Variance
Total error
Bias
2
Error
OverfittingUnderfitting

Missing important variables for the problem
to make the predictions
Variance
Bias

Overfitting to the sample/training data
Variance
Bias

Irreducible error on prediction
Variance
Bias

• Logistic Model Trees
• Distributed implementation
• Cost function & configuration parametres
• Demo
@StratioBD

Performance
metrics
Feature 1
Root node
Performance
metrics
Performance
metrics

Feature 1
Root node
Performance
metrics
Performance
metrics
Performance
metrics
Performance
metrics
Performance
metrics
Performance
metrics

DISTRIBUTED IMPLEMENTATION
Spark’s Decision Tree
(distributed implementation of random forests)
Spark’s Logistic Regression / weka’s
Logistic Regression on the nodes

LMT Cost function to fix the logistic regression threshold
• AccuracyCostFunction
• ConfusionMatrix
• PrecisionCostFunction
• PrecisionRecallCostFunction
• RocCostFunction
The same cost function for pruning criteria
Performance
metrics
Performance
metrics
Performance
metrics

Big datasets
Power of spark to distribute building the
tree and logistic regressions
ADVANTAGES OF THIS IMPLEMENTATION
Medium datasets
Distributed tree growth and weka’s
logistic regression
Small datasets
Although it can be slow to
distribute the data for the decision
tree, cost functions can be still
used and specific optimization for
particular cases

Example of DLMT algorithm
in a synthetic dataset

PREDICTION
Positive Negative
TRUE
CONDITION
Positive True Positives False Negatives
Negative False Positives True Negatives
Precission
True Positive Rate (Recall)
False Positive Rate
TPR = TP/(TP+FN) Insensitive to unbalance
FPR = FP/(FP+TN) Insensitive to unbalance
Precision = TP/(TP+FP) Sensitive to unbalance
Accuracy = (TP+TN)/(TP+TN+FP+FN) Sensitive to unbalance

PREDICTION
Positive Negative
TRUE
CONDITION
Positive True Positives False Negatives
Negative False Positives True Negatives
True Positive Rate (Recall)
False Positive Rate
AUROC (AUC): TPR/FPR -> Insensitive to unbalance! TPR
FPR
Best performance

Accuracy ExplainabilityVS
Performance Metrics:
AUROC, AUPRC, ACCURACY
Automatic
Benchmarking
Framework
f
f
1
n
BenchmarkABF
1
2
3 4

THANK YOU
UNITED STATES
Tel: (+1) 408 5998830
EUROPE
Tel: (+34) 91 828 64 73
contact@stratio.com
www.stratio.com
@StratioBD

people@stratio.com
WE ARE HIRING
@StratioBD

More Related Content

What's hot (18)

PDF

Big Data Landscape 2016Josef Adersberger

PDF

Big Data Tech StackAbdullah Çetin ÇAVDAR

PPT

Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit

PDF

Embedding Insight through Prediction Driven LogisticsDatabricks

PDF

The Curse of the Data Lake MonsterThoughtworks

PDF

Future of Data Platform in Cloud Native worldSrivatsan Srinivasan

PDF

The Synapse IoT Stack: Technology Trends in IOT and Big DataInMobi Technology

PDF

Graphs in Telecommunications - Jesus Barrasa, Neo4jNeo4j

PDF

Graph-based Network & IT Management.Linkurious

PDF

Introducing Databricks DeltaDatabricks

PDF

Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media

PDF

Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best PracticesNeo4j

PDF

Privacy-Preserving AI Network - PlatON 2.0 ShiHeng1

PDF

From hadoop to sparksteccami

PPTX

Data Mining - The Big Picture!Khalid Salama

PDF

RAPIDS cuGraph – Accelerating all your Graph needsConnected Data World

PDF

"Application monitoring — from requirements to tools, not the other way aroun...Fwdays

PDF

Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks

Big Data Landscape 2016Josef Adersberger

Big Data Tech StackAbdullah Çetin ÇAVDAR

Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit

Embedding Insight through Prediction Driven LogisticsDatabricks

The Curse of the Data Lake MonsterThoughtworks

Future of Data Platform in Cloud Native worldSrivatsan Srinivasan

The Synapse IoT Stack: Technology Trends in IOT and Big DataInMobi Technology

Graphs in Telecommunications - Jesus Barrasa, Neo4jNeo4j

Graph-based Network & IT Management.Linkurious

Introducing Databricks DeltaDatabricks

Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media

Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best PracticesNeo4j

Privacy-Preserving AI Network - PlatON 2.0 ShiHeng1

From hadoop to sparksteccami

Data Mining - The Big Picture!Khalid Salama

RAPIDS cuGraph – Accelerating all your Graph needsConnected Data World

"Application monitoring — from requirements to tools, not the other way aroun...Fwdays

Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks

Viewers also liked (20)

PDF

Stratio platform overview v4.1Stratio

PPTX

Lunch&Learn: Combinación de modelosStratio

PPTX

[Strata] SparktaStratio

PDF

Meetup: Spark + KerberosStratio

PDF

Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016Stratio

PDF

Multiplaform Solution for Graph DatasourcesStratio

PDF

Stratio CrossData: an efficient distributed datahub with batch and streaming ...Stratio

PDF

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit

PDF

Apache Spark & Cassandra use case at Telefónica Cbs by Antonio AlcacerStratio

PDF

Spark Streaming @ Berlin Apache Spark Meetup, March 2015Stratio

PDF

Functional programming in scalaStratio

PDF

Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014Stratio

PDF

Introduction to Asynchronous scalaStratio

PPTX

La Unión Bancaria Europeakoball

PPTX

PresentacionComunicacionesPDB

PPT

El modelo europeo de reporting y el lenguaje XBRL - Ignacio BoixoAsociación XBRL España

PPTX

UNION BANCARIA EN LA UNION EUROPEARamiro Ojeda

PDF

Recuperación y Unión Bancaria Europea. Emilio OntiverosUniversidad de Deusto - Deustuko Unibertsitatea - University of Deusto

PPTX

Stratio big data spainÁlvaro Agea Herradón

PPTX

Estándares en Unión Europea: Marco, Desafíos y Oportunidades - Francisco Garc...Asociación XBRL España

Stratio platform overview v4.1Stratio

Lunch&Learn: Combinación de modelosStratio

[Strata] SparktaStratio

Meetup: Spark + KerberosStratio

Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016Stratio

Multiplaform Solution for Graph DatasourcesStratio

Stratio CrossData: an efficient distributed datahub with batch and streaming ...Stratio

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit

Apache Spark & Cassandra use case at Telefónica Cbs by Antonio AlcacerStratio

Spark Streaming @ Berlin Apache Spark Meetup, March 2015Stratio

Functional programming in scalaStratio

Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014Stratio

Introduction to Asynchronous scalaStratio

La Unión Bancaria Europeakoball

PresentacionComunicacionesPDB

El modelo europeo de reporting y el lenguaje XBRL - Ignacio BoixoAsociación XBRL España

UNION BANCARIA EN LA UNION EUROPEARamiro Ojeda

Recuperación y Unión Bancaria Europea. Emilio OntiverosUniversidad de Deusto - Deustuko Unibertsitatea - University of Deusto

Stratio big data spainÁlvaro Agea Herradón

Estándares en Unión Europea: Marco, Desafíos y Oportunidades - Francisco Garc...Asociación XBRL España

Similar to Distributed Logistic Model Trees (20)

PDF

Understanding Decision Trees in Machine Learning: A Comprehensive GuideCyberPro Magazine

PPTX

Apache Spark MLlib - Random Foreset and Desicion TreesTuhin Mahmud

PPTX

Ai & Machine learning - 31140523010 - BDS302.pptxBhaktMahadevKA

PDF

Random forests-talk-nl-meetupWillem Hendriks

PPTX

Introduction to Machine Learning ConceptsRyujiChanneru

PPTX

Decision Tree in Machine LearningTutort Academy

PDF

Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Databricks

PPTX

Intro to ml_2021Sanghamitra Deb

PPTX

Decision Tree Machine Learning Detailed Explanation.DrezzingGaming

PDF

Machine Learning Unit-5 Decesion Trees & Random Forest.pdfAdityaSoraut

PPTX

DECESION TREE and -SVM-NAIVEs bayes-BAYS.pptxDulalChandraDas1

PPTX

Introduction to random forest and gradient boosting methods a lectureShreyas S K

PPTX

Learning Trees - Decision Tree Learning MethodsHPCC Systems

PPTX

Decision_Tree_Presentation_with_indepth_analysis.pptxsman22230

PDF

Random Forest / Bootstrap AggregationRupak Roy

PDF

Diabetes Prediction Using Machine Learningjagan477830

PPTX

Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Simplilearn

PPTX

Decision Forest: Twenty Years of ResearchLior Rokach

PPTX

supervised machine learning algorithms support vector machinepranalisonawane8600

PPTX

Machine learning tree models for classificationKv Sagar

Understanding Decision Trees in Machine Learning: A Comprehensive GuideCyberPro Magazine

Apache Spark MLlib - Random Foreset and Desicion TreesTuhin Mahmud

Ai & Machine learning - 31140523010 - BDS302.pptxBhaktMahadevKA

Random forests-talk-nl-meetupWillem Hendriks

Introduction to Machine Learning ConceptsRyujiChanneru

Decision Tree in Machine LearningTutort Academy

Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Databricks

Intro to ml_2021Sanghamitra Deb

Decision Tree Machine Learning Detailed Explanation.DrezzingGaming

Machine Learning Unit-5 Decesion Trees & Random Forest.pdfAdityaSoraut

DECESION TREE and -SVM-NAIVEs bayes-BAYS.pptxDulalChandraDas1

Introduction to random forest and gradient boosting methods a lectureShreyas S K

Learning Trees - Decision Tree Learning MethodsHPCC Systems

Decision_Tree_Presentation_with_indepth_analysis.pptxsman22230

Random Forest / Bootstrap AggregationRupak Roy

Diabetes Prediction Using Machine Learningjagan477830

Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Simplilearn

Decision Forest: Twenty Years of ResearchLior Rokach

supervised machine learning algorithms support vector machinepranalisonawane8600

Machine learning tree models for classificationKv Sagar

More from Stratio (15)

PPTX

Mesos Meetup - Building an enterprise-ready analytics and operational ecosyst...Stratio

PPTX

Can an intelligent system exist without awareness? BDS18Stratio

PPTX

Kafka and KSQL - Apache Kafka MeetupStratio

PPTX

Wild Data - The Data Science MeetupStratio

PPTX

Ensemble methods in Machine Learning Stratio

PPTX

Stratio Sparta 2.0Stratio

PPTX

Big Data Security: Facing the challengeStratio

PPTX

Artificial Intelligence on Data Centric PlatformStratio

PDF

Introduction to Artificial Neural NetworksStratio

PDF

Meetup: Cómo monitorizar y optimizar procesos de Spark usando la Spark Web - ...Stratio

PDF

Advanced search and Top-K queries in CassandraStratio

PDF

[Spark meetup] Spark Streaming OverviewStratio

PDF

Why spark by Stratio - v.1.0Stratio

PPTX

On-the-fly ETL con EFK: ElasticSearch, Flume, KibanaStratio

PDF

Spark Summit - Stratio Streaming Stratio

Mesos Meetup - Building an enterprise-ready analytics and operational ecosyst...Stratio

Can an intelligent system exist without awareness? BDS18Stratio

Kafka and KSQL - Apache Kafka MeetupStratio

Wild Data - The Data Science MeetupStratio

Ensemble methods in Machine Learning Stratio

Stratio Sparta 2.0Stratio

Big Data Security: Facing the challengeStratio

Artificial Intelligence on Data Centric PlatformStratio

Introduction to Artificial Neural NetworksStratio

Meetup: Cómo monitorizar y optimizar procesos de Spark usando la Spark Web - ...Stratio

Advanced search and Top-K queries in CassandraStratio

[Spark meetup] Spark Streaming OverviewStratio

Why spark by Stratio - v.1.0Stratio

On-the-fly ETL con EFK: ElasticSearch, Flume, KibanaStratio

Spark Summit - Stratio Streaming Stratio

Recently uploaded (20)

PDF

apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)apidays

PPTX

美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买Taqyea

PPTX

How to Add Columns and Rows in an R Data Framesubhashenia

PPTX

01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025FinTech Belgium

PPTX

Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...Debolina Ghosh

PDF

5991-5857_Agilent_MS_Theory_EN (1).pdf. pdfNohaSalah45

PPTX

Generative AI Boost Data Governance and Quality- Tejasvi AddagadaTejasvi Addagada

PDF

apidays Singapore 2025 - Surviving an interconnected world with API governanc...apidays

PDF

Using AI/ML for Space Biology Research.pdfVICTOR MAESTRE RAMIREZ

PDF

Loading Data into Snowflake (Bulk & Stream)Accentfuture

PPTX

thid ppt defines the ich guridlens and gives the information about the ICH gu...shaistabegum14

PDF

Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...Janette Toral

PPTX

SHREYAS25 INTERN-I,II,III PPT (1).pptx preswapnilherage

PDF

apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...apidays

PPTX

big data eco system fundamentals of data sciencearivukarasi

PPTX

BinarySearchTree in datastructures in detailkichokuttu

PPTX

Krezentios memories in college data.pptxnotknown9

PPTX

04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025FinTech Belgium

PPTX

apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...apidays

PPTX

Data anlytics Hospitals Research India.pptxSayantanChakravorty2

apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)apidays

美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买Taqyea

How to Add Columns and Rows in an R Data Framesubhashenia

01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025FinTech Belgium

Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...Debolina Ghosh

5991-5857_Agilent_MS_Theory_EN (1).pdf. pdfNohaSalah45

Generative AI Boost Data Governance and Quality- Tejasvi AddagadaTejasvi Addagada

apidays Singapore 2025 - Surviving an interconnected world with API governanc...apidays

Using AI/ML for Space Biology Research.pdfVICTOR MAESTRE RAMIREZ

Loading Data into Snowflake (Bulk & Stream)Accentfuture

thid ppt defines the ich guridlens and gives the information about the ICH gu...shaistabegum14

Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...Janette Toral

SHREYAS25 INTERN-I,II,III PPT (1).pptx preswapnilherage

apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...apidays

big data eco system fundamentals of data sciencearivukarasi

BinarySearchTree in datastructures in detailkichokuttu

Krezentios memories in college data.pptxnotknown9

04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025FinTech Belgium

apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...apidays

Data anlytics Hospitals Research India.pptxSayantanChakravorty2

Distributed Logistic Model Trees

1. Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD

2. Aerospace Engineer, MSc in Propulsion Systems (UPM), Master in Data Science (URJC). Working as data scientist and Big Data developer at Stratio Big Data in the data science department mateo-alvarez

3. Ph.D. in Telecommunications, MSc in Electronic Systems Engineering and Telecommunication Technologies, Systems and Networks (UPV), and MSc “Big Data Expert” (UTAD). Working as data scientist and Big Data developer at at Stratio Big Data in the data science department @Phd_A_Soriano

4. Why using interpretable algorithms instead of “black boxes” Logistic Regression Decision Trees Variance-Bias tradeoff Metrics Demo Logistic Model Trees Distributed implementation Cost function & configuration params Demo

5. • Why use interpretable algorithms instead of “black boxes” • Logistic Regression • Decision Trees • Variance-Bias tradeoff @StratioBD

6. Accuracy Explainability VS Medical Studies Power management Financial environment Criminal activity

7. Threshold Probability Feature

8. Threshold Probability Local bad adjust Feature

9. Threshold Probability Feature

10. Feature 1 Root node Leaf node Leaf node

11. Feature 1 Root node Feature 2 Feature 2 Leaf node Leaf node Leaf node Leaf node

12. Feature 1 Root node Feature 2 Feature 2 Leaf node Leaf node Leaf node Leaf node Local bad adjust

13. MeanError Model complexity Test error Training error Model complexity Variance Total error Bias 2 Error OverfittingUnderfitting

14. Variance Bias

15. Missing important variables for the problem to make the predictions Variance Bias

16. Overfitting to the sample/training data Variance Bias

17. Irreducible error on prediction Variance Bias

18. • Logistic Model Trees • Distributed implementation • Cost function & configuration parametres • Demo @StratioBD

19. Root node Performance metrics

20. Performance metrics Feature 1 Root node Performance metrics Performance metrics

21. Feature 1 Root node Performance metrics Performance metrics Performance metrics Performance metrics Performance metrics Performance metrics

22. Feature 1 Root node Feature 2 Feature 2

23. Feature 1 Root node Feature 2

24. Feature 1 Root node Feature 2

25. DISTRIBUTED IMPLEMENTATION Spark’s Decision Tree (distributed implementation of random forests) Spark’s Logistic Regression / weka’s Logistic Regression on the nodes

26. LMT Cost function to fix the logistic regression threshold • AccuracyCostFunction • ConfusionMatrix • PrecisionCostFunction • PrecisionRecallCostFunction • RocCostFunction The same cost function for pruning criteria Performance metrics Performance metrics Performance metrics

27. Big datasets Power of spark to distribute building the tree and logistic regressions ADVANTAGES OF THIS IMPLEMENTATION Medium datasets Distributed tree growth and weka’s logistic regression Small datasets Although it can be slow to distribute the data for the decision tree, cost functions can be still used and specific optimization for particular cases

28. Example of DLMT algorithm in a synthetic dataset

29. • Metrics • Demo @StratioBD

30. PREDICTION Positive Negative TRUE CONDITION Positive True Positives False Negatives Negative False Positives True Negatives Precission True Positive Rate (Recall) False Positive Rate TPR = TP/(TP+FN) Insensitive to unbalance FPR = FP/(FP+TN) Insensitive to unbalance Precision = TP/(TP+FP) Sensitive to unbalance Accuracy = (TP+TN)/(TP+TN+FP+FN) Sensitive to unbalance

31. PREDICTION Positive Negative TRUE CONDITION Positive True Positives False Negatives Negative False Positives True Negatives True Positive Rate (Recall) False Positive Rate AUROC (AUC): TPR/FPR -> Insensitive to unbalance! TPR FPR Best performance

32. PREDICTION Positive Negative TRUE CONDITION Positive True Positives False Negatives Negative False Positives True Negatives True Positive Rate (Recall) AUPRC: Precision/TPR -> Sensitive to unbalance! Precission Precision Recall Best performance

33. f f 1 n Benchmark ABF Data Algorithms

34. @StratioBD

35. Accuracy ExplainabilityVS Performance Metrics: AUROC, AUPRC, ACCURACY Automatic Benchmarking Framework f f 1 n BenchmarkABF 1 2 3 4

36. THANK YOU UNITED STATES Tel: (+1) 408 5998830 EUROPE Tel: (+34) 91 828 64 73 [email protected] www.stratio.com @StratioBD

37. [email protected] WE ARE HIRING @StratioBD