SlideShare a Scribd company logo
Kaggle competition
Santander Customer Satisfaction
Objective
• To build a machine learning algorithm using the training data set and predict the
total satisfied and unsatisfied customers in the Santander test data set
• Here a “Two Class Boosted Decision Tree” is used to build a machine learning
model
• The decision tree method is one of the simpler classification techniques used to
build real world problems
Training data set characteristics
• As per Kaggle the data set has anonymous columns with numeric data
• Data set has many unnamed columns
• Unnamed columns make feature extraction tedious
• Exploratory data analysis on the training set can reveal latent features which
actually contribute to the Target column
Steps for developing the model
• Load the training data to Azure ML
• Use metadata to select the Target column
• Use Split data to randomly split data (75% training set and 25% validation set)
• Two boosted decision tree with single parameter is used to train the model
• Train model to score the model with the validation set(25% data set)
• The score model is used to evaluate and compare other models
• The test data is loaded in Azure after adding Target column and setting the
column to 1
• The trained model is then used to score the test data
• The output of the score model is used to obtain a Kaggle score
Azure (ML) Model
Train Model
Training data
Test data
Edit Metadata
Edit Metadata
Score
model
Evaluate Model Convert data CSV
Boosted Decision Tree
Tune model
hyperparameter
Evaluation Model
Train model
Tune model
hyperparameter
Learnings
• Kaggle score from Train model was 0.519 and from the Tune model
Hyperparameter was 0.541
• The redundant columns like ID may be excluded for building the model
• Var30,Var38 are two important features which affect the Target column
• Var3 can be excluded as it not redundant for predicting the Target values
• The data set is not very informative and better feature extraction techniques
should be used to predict Target column

More Related Content

PDF
Machine learning project_promotion
PPTX
Azure machine learning
PDF
Data manipulation
PDF
Practical Predictive Modeling in Python
PDF
Creating Your First Predictive Model In Python
PDF
Feature Engineering & Selection
PPTX
Azure machine learning tech mela
PPTX
data_preprocessingknnnaiveandothera.pptx
Machine learning project_promotion
Azure machine learning
Data manipulation
Practical Predictive Modeling in Python
Creating Your First Predictive Model In Python
Feature Engineering & Selection
Azure machine learning tech mela
data_preprocessingknnnaiveandothera.pptx

Similar to Santander customer satisfaction (20)

PPTX
Python for Machine Learning_ A Comprehensive Overview.pptx
PPTX
Iris Multi-Class Classifier with Azure ML
PDF
Unit 1_Data Validation_Validation Techniques.pdf
PPTX
Azure Machine Learning Challenge_Speakers Presentation.pptx
PPTX
Recommender System Using AZURE ML
PPTX
Machine learning and azure ml studio
PDF
Guiding through a typical Machine Learning Pipeline
PPTX
(Faiz) MachineLearning(ppt).pptx
PPTX
Module III MachineLearningSparkML.pptx
PDF
credit card fraud detection
PPTX
ML Ops.pptx
PPTX
This document is about Ai-Project-Cycle.pptx
PPTX
Machine Learning With ML.NET
PPTX
Machine Learning Essentials and Fundamentals.pptx
PPTX
Practical data science
PDF
11 ta dts2021-11-v2
PPTX
MLIntro_ADA.pptx
PPTX
Automated Machine Learning
PPTX
supervised and unsupervised learning
PPTX
Unlocking DataDriven Talent Intelligence Transforming TALENTX with Industry P...
Python for Machine Learning_ A Comprehensive Overview.pptx
Iris Multi-Class Classifier with Azure ML
Unit 1_Data Validation_Validation Techniques.pdf
Azure Machine Learning Challenge_Speakers Presentation.pptx
Recommender System Using AZURE ML
Machine learning and azure ml studio
Guiding through a typical Machine Learning Pipeline
(Faiz) MachineLearning(ppt).pptx
Module III MachineLearningSparkML.pptx
credit card fraud detection
ML Ops.pptx
This document is about Ai-Project-Cycle.pptx
Machine Learning With ML.NET
Machine Learning Essentials and Fundamentals.pptx
Practical data science
11 ta dts2021-11-v2
MLIntro_ADA.pptx
Automated Machine Learning
supervised and unsupervised learning
Unlocking DataDriven Talent Intelligence Transforming TALENTX with Industry P...
Ad

Recently uploaded (20)

PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Mega Projects Data Mega Projects Data
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Data Science Trends & Career Guide---ppt
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Computer network topology notes for revision
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Launch Your Data Science Career in Kochi – 2025
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Mega Projects Data Mega Projects Data
Business Acumen Training GuidePresentation.pptx
Introduction-to-Cloud-ComputingFinal.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Data Science Trends & Career Guide---ppt
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Knowledge Engineering Part 1
Reliability_Chapter_ presentation 1221.5784
Major-Components-ofNKJNNKNKNKNKronment.pptx
Quality review (1)_presentation of this 21
Computer network topology notes for revision
Galatica Smart Energy Infrastructure Startup Pitch Deck
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)
Launch Your Data Science Career in Kochi – 2025
Ad

Santander customer satisfaction

  • 2. Objective • To build a machine learning algorithm using the training data set and predict the total satisfied and unsatisfied customers in the Santander test data set • Here a “Two Class Boosted Decision Tree” is used to build a machine learning model • The decision tree method is one of the simpler classification techniques used to build real world problems
  • 3. Training data set characteristics • As per Kaggle the data set has anonymous columns with numeric data • Data set has many unnamed columns • Unnamed columns make feature extraction tedious • Exploratory data analysis on the training set can reveal latent features which actually contribute to the Target column
  • 4. Steps for developing the model • Load the training data to Azure ML • Use metadata to select the Target column • Use Split data to randomly split data (75% training set and 25% validation set) • Two boosted decision tree with single parameter is used to train the model • Train model to score the model with the validation set(25% data set) • The score model is used to evaluate and compare other models • The test data is loaded in Azure after adding Target column and setting the column to 1 • The trained model is then used to score the test data • The output of the score model is used to obtain a Kaggle score
  • 5. Azure (ML) Model Train Model Training data Test data Edit Metadata Edit Metadata Score model Evaluate Model Convert data CSV Boosted Decision Tree Tune model hyperparameter
  • 6. Evaluation Model Train model Tune model hyperparameter
  • 7. Learnings • Kaggle score from Train model was 0.519 and from the Tune model Hyperparameter was 0.541 • The redundant columns like ID may be excluded for building the model • Var30,Var38 are two important features which affect the Target column • Var3 can be excluded as it not redundant for predicting the Target values • The data set is not very informative and better feature extraction techniques should be used to predict Target column