0% found this document useful (0 votes)

29 views

Using CHAID For Classification Problems: Ray@hrs - Co.nz

This document discusses using CHAID (Chi-squared Automatic Interaction Detector) for classification problems. It provides two examples: 1) Using longitude and latitude to correctly classify hurricanes as either BARO or TROP. CHAID is able to correctly classify nearly all storms based on longitude alone. 2) Using real financial data to identify subgroups more likely to default on payments. CHAID identifies factors like renting vs owning, having past disputes ("mercs"), and payment plan as significant predictors of default risk. The tree shows defaulters are more common among renters, and among renters with past disputes.

Uploaded by

jogi_lal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Using CHAID For Classification Problems: Ray@hrs - Co.nz

Uploaded by

jogi_lal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Using CHAID for classification problems

Dr Ray Hoare, HRS Ltd, [email protected] A paper presented at the New Zealand Statistical Association 2004 conference, Wellington.

ABSTRACT
The author believes that CHAID analysis could be much more widely used than it is, as a tool for exploring commercial and scientific data. Two examples will be presented, one of which is based on real-world New Zealand financial data, with the objective of showing that CHAID can be an everyday tool for dealing with non-linear or complex datasets, in order to find significant patterns.

Introduction
I talk to many people about statistical software, and often ask them what they know about CHAID. I usually get a blank stare, even among people who are well trained in statistics. I also make a point of reading what I can about CHAID, and often find it is presented as one of those mysterious components of insanely expensive data mining programs, and therefore not of interest to mere mortals. My own experience in dealing with clients data, and the experience of some of my more adventurous clients, is that CHAID, in one or other of its many forms, is a great way to sift certain kinds of data to find out where interesting relationships are buried, especially when the relationships are more complex than the linear or at least monotonic ones usually sought. I will present in this paper two datasets that show very different kinds of application of the CHAID method. This printed paper shows the overview of what will be shown in a live investigation at a conference.

What is CHAID?
The acronym CHAID stands for Chi-squared Automatic Interaction Detector. Although it can be used for regression problems, in this paper I will only build classification trees. The Chi-squared part of the name arises because the technique essentially involves automatically constructing many cross-tabs, and working out statistical significance of the proportions. The most significant relationships are used to control the structure of a tree diagram. Because the goal of classification trees is to predict or explain responses on a categorical dependent variable, the technique has much in common with the techniques used in the more traditional methods of Discriminant Analysis, Cluster Analysis, Nonparametric Statistics, and Nonlinear Estimation. The flexibility of classification trees make them a very attractive analysis option, but this is not to say

that their use is recommended to the exclusion of more traditional methods. Indeed, when the typically more stringent theoretical and distributional assumptions of more traditional methods are met, the traditional methods may be preferable. But as an exploratory technique, or as a technique of last resort when traditional methods fail, classification trees are, in the opinion of many researchers, unsurpassed. Classification trees are widely used in applied fields as diverse as medicine (diagnosis), computer science (data structures), botany (classification), and psychology (decision theory). Classification trees readily lend themselves to being displayed graphically, helping to make them easier to interpret than they would be if only a strict numerical interpretation were possible.

Example 1. Correctly classifying all cases in a dataset

Lets take the artificial data in barotrop.sta, part of which is shown below.
1 2 3 4 5 6 LONGITUD 61.50 61.50 62.00 63.00 63.50 64.00 LATITUDE 17.00 19.00 14.00 15.00 19.00 12.00 CLASS BARO BARO BARO TROP TROP TROP

There are two kinds of hurricane, BARO and TROPO. We want to predict which kind of hurricane it will be from the longitude and latitude. CHAID and other kinds of tree algorithm can classify the data so that if you know the latitude and longitude you will be able to say what kind of hurricane it is likely to be.

BARO TROP

Tree graph for CLASS Num. of non-terminal nodes: 1, Num. of terminal nodes: 3

ID=1 BARO

N=37

LONGITUD

<= 61.500000 ID=2 BARO N=8 ID=3

<= 67.000000 N=18 TROP ID=4

> 67.000000 N=11 BARO

The data shown was analysed with Interactive CHAID in STATISTICA, and the tree graph shown here shows the algorithm has classified the data so that nearly all storms are correctly predicted, based entirely on the longitude, even though the latitude was supplied to the program. The graph of longitude against latitude, with the hurricane type shown as different markers, shows why the classification is so good. In fact, it shows that the classification should be better! The imperfect classification arises because CHAID uses cross-tabs of categorical variables. When you have a continuous variable, such as
Data from Barotrop.sta
22

BARO TROP
20

LATITUDE

10 58

Longitude

longitude, it is automatically broken up into sets of ranges. In this case, the sets do not divide at the value which would lead to perfect classification. Not many real examples are as clear as this one. However, many cases occur where the object is similar come up with a rule that leads to as good a classification as possible. Examples of this occur when you want to do a direct marketing campaign, and you need to hit as many good prospects as possible while minimising the wasted mailings.

Example 2: Locating interesting subsets in a dataset Data used

I have been inviting customers or prospective customers to supply me with real data to explore, so that I can find out for myself how much of the manufacturers enthusiastic ravings to believe. In most practical cases, I have found that the data does not lend itself to simplistic use such as the one above. Instead, it can be used to provide useful insights on subgroups of your data. I will illustrate this with a dataset whose origin I am not at liberty to disclose, but which is local, and real apart from some descriptive labels I invented for clarity. The table shows the descriptions of the variables available for the analysis. We want to find those factors that indicate whether someone is likely to default on a payment. Most of the data is from credit agency, but some is from the application form used to sign up for the service. There are about 15,000 cases in use for the analysis.
client defaulted on payment client discontinued the service early Age Group number of defaults number of mercantiles number of judgements number of bankruptcies number of defaults last 180 days number of mercantiles last 180 days number of judgements last 180 days Number of bankrupticies last 180 days number of defaults last 365 days number of mercantiles last 365 days number of judgements last 365 days Number of bankrupticies last 365 days company's classification of the applicant housing status the category of the service applied for occupation status time with employer time at address number of enquiries at credit agency first enquiry at credit agency for this applicant application is the result of a promotion switching from another supplier month of application - don't use records where this is missing Any mercs? Any Judgements? Any Mercs this year? Any Judgements this year?

Variable screening
Very often when you have large numbers of variables, some of them are irrelevant to the task in hand. In this case we will see if we can ignore some of the variables, by applying a screening algorithm. This will tell us which variables have a p-value less than 0.01. In this case, 24 out of 28 variables are significant. The most significant, according to the screening, are whether they own their own house, the number of mercantile disputes, the age of the applicant, and the occupation status. Bankruptcies do not have any importance. I have chosen to put all the significant variables into a CHAID model, to see how well the model works.

no yes

Tree graph for bad Num. of non-terminal nodes: 5, Num. of terminal nodes: 7
ID=1 no N=15240

hous = own ID=2 no N=9021 ID=3 = rent no N=6219

Any mercs? = No ID=4 no N=8426 ID=5 no = yes N=595 ID=6 no = No N=5592

Any mercs? = yes ID=7 no N=627

class Unknown , Good ID=8 no N=5898 , ... = Dodgy ID=9 no N=2528 = Economy ID=10 no N=987

rate = Full featured, Adjustable = Pay as you go ID=11 N=1674 no ID=12 N=2930 no

The tree down to 3 levels is shown above. The height of each bar shows the proportion of yes and no answers in the Bad variable. No node has a preponderance of Yes answers in a cell. no yes This means that when you prepare the confusion matrix, no 13214.00 2024.000 or classification matrix, as shown here, no cases are yes classified Yes. This may imply that the model is of no use, but really it means that it is of no use if the task is to classify a new respondent as one likely to default. There are other uses for this sort of tree, of course. The first split on the tree tells us that the best predictor of being a bad customer is whether the person owns or rents. Bad customers are significantly more common among renters. Within the renters, those with a Merc are more likely to default. This sounds reasonable. We need to get more quantitative about the good and bad nodes. The tree is great for getting an overall impression, but STATISTICA allows you to save the data in a form that enables you to process it further. I have created a table of the numbers in each class in the terminal nodes, and then calculated the proportions of bad clients in each node. Node 7 11 5 10 12 9 8 Class No 393 1222 456 782 2477 2292 5592 Class Yes 234 452 139 205 453 236 306 Proportion =v2/(v1+v2) 37% 27% 23% 21% 15% 9% 5%

We can see that 37% of the people in node 7 are bad clients. These are the people who rent their house and have Mercs against them. Node 11 has slightly better performance, but there are more of them. These are people on two specific payment plans. Node 5 shows that people who own their own home but have Mercs are bad bets, too. Since the Mercs show up so often, maybe we could explore whether the split on this is highly significant.

STATISTICA allows you to see the p-value for the split at any node, and in this case several variables are about equally probable. (Very low p-values are characteristic of large data sets, and are not always an advantage.) In the above graph I have forced the tree to be split on Mercs, and you can see that those with Mercs are poorer bets. Within those, then house ownership is still important.

Conclusion
Although the real world dataset did not contain relationships that enabled us to define groups that were most probably bad customers, the factors that pointed to this quality of a customer could be revealed by looking at the details within the table. Most of the measured data was relevant, except the previous occurrence of bankruptcy, and the investigation could be worth the effort of manually overriding the automatic selection of splits.

Data Analytics for Beginners: Introduction to Data Analytics
From Everand
Data Analytics for Beginners: Introduction to Data Analytics
Anthony S. Williams
4/5 (18)
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Painless Statistics
From Everand
Painless Statistics
Barron's Educational Series
No ratings yet
The Average is Always Wrong: A real-world guide to putting data at the heart of your business
From Everand
The Average is Always Wrong: A real-world guide to putting data at the heart of your business
Ian Shepherd
No ratings yet
Analytics and Big Data for Accountants
From Everand
Analytics and Big Data for Accountants
Jim Lindell
No ratings yet
Classification and Regression Trees Wadsworth Statistics Probability
No ratings yet
Classification and Regression Trees Wadsworth Statistics Probability
420 pages
Basic From Wiki
No ratings yet
Basic From Wiki
15 pages
Data Analytics - Unit-IV
No ratings yet
Data Analytics - Unit-IV
21 pages
Microprediction: Building an Open AI Network
From Everand
Microprediction: Building an Open AI Network
Peter Cotton
No ratings yet
20mia1128 Bigdata Lab 1
No ratings yet
20mia1128 Bigdata Lab 1
4 pages
Classification Trees - CART and CHAID
No ratings yet
Classification Trees - CART and CHAID
50 pages
1 - Chaid and Cart
No ratings yet
1 - Chaid and Cart
62 pages
Data Analytics
From Everand
Data Analytics
Jeffery Short
1/5 (1)
CHAP 6-CHAID (Chi-Square Automatic Interaction Detector)
No ratings yet
CHAP 6-CHAID (Chi-Square Automatic Interaction Detector)
14 pages
Big Data Tips 1-2-3
From Everand
Big Data Tips 1-2-3
Richard M Batenburg, Jr
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Science Career Guide Interview Preparation
From Everand
Data Science Career Guide Interview Preparation
Gradient Publication
No ratings yet
CART - Machine Learning
No ratings yet
CART - Machine Learning
29 pages
Data Mining
No ratings yet
Data Mining
18 pages
To Ask or Not To Ask, That Is The Question: Hsin-Vonn Seow, Lyn C. Thomas
No ratings yet
To Ask or Not To Ask, That Is The Question: Hsin-Vonn Seow, Lyn C. Thomas
8 pages
Digital (R)evolution: Strategies to Accelerate Business Transformation
From Everand
Digital (R)evolution: Strategies to Accelerate Business Transformation
Yuri B. Aguiar
No ratings yet
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
100% (1)
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
19 pages
Data Mining 101: Core Concepts and Algorithms
From Everand
Data Mining 101: Core Concepts and Algorithms
Swarnalata Verma
No ratings yet
Lecture Notes - CHAID
No ratings yet
Lecture Notes - CHAID
17 pages
Manisha 3001 Week 12
No ratings yet
Manisha 3001 Week 12
22 pages
Don't Fear the Cost Study
From Everand
Don't Fear the Cost Study
Clark Kaml
No ratings yet
Introductory Overview - Basic Ideas
No ratings yet
Introductory Overview - Basic Ideas
14 pages
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
From Everand
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
Zemelak Goraga
No ratings yet
What Does This Company Do?: Understanding a Business and its Risks
From Everand
What Does This Company Do?: Understanding a Business and its Risks
Drago Dimitrov
No ratings yet
Decision Tree (Chaid)
No ratings yet
Decision Tree (Chaid)
3 pages
A Short Guide to Marketing Model Alignment & Design: Advanced Topics in Goal Alignment - Model Formulation
From Everand
A Short Guide to Marketing Model Alignment & Design: Advanced Topics in Goal Alignment - Model Formulation
David Young
No ratings yet
Reference Papers
No ratings yet
Reference Papers
7 pages
Assignment 3 Based On Unit 3
No ratings yet
Assignment 3 Based On Unit 3
6 pages
DA-4th unit
No ratings yet
DA-4th unit
22 pages
DWM Mini Project
No ratings yet
DWM Mini Project
14 pages
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
From Everand
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
alasdair gilchrist
No ratings yet
The Little Book of Managing Uncertainty
From Everand
The Little Book of Managing Uncertainty
Harry Katzan Jr.
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Fraud Analytics: Strategies and Methods for Detection and Prevention
From Everand
Fraud Analytics: Strategies and Methods for Detection and Prevention
Delena D. Spann
5/5 (2)
Decision Trees For Predictive Modeling (Neville)
100% (1)
Decision Trees For Predictive Modeling (Neville)
24 pages
Exploring Data Using Graphics and Visualization Due Feb 21
No ratings yet
Exploring Data Using Graphics and Visualization Due Feb 21
10 pages
Do-It-Yourself Technical Analysis Simplified by Trained Accountant
From Everand
Do-It-Yourself Technical Analysis Simplified by Trained Accountant
Anthony Brticevic
No ratings yet
What Is Data Analytics? A Complete Guide For Beginners
From Everand
What Is Data Analytics? A Complete Guide For Beginners
Piyush Kumar Jain
No ratings yet
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
Understanding Big Data: A Beginners Guide to Data Science & the Business Applications
From Everand
Understanding Big Data: A Beginners Guide to Data Science & the Business Applications
Eileen McNulty-Holmes
4/5 (5)
How to be Clear and Compelling with Data: Principles, Practice and Getting Beyond the Basics
From Everand
How to be Clear and Compelling with Data: Principles, Practice and Getting Beyond the Basics
John J Burrett
No ratings yet
AnswerTree 3.0
No ratings yet
AnswerTree 3.0
6 pages
Becoming a Data Analyst: Skills, Tools, and Real-World Strategies
From Everand
Becoming a Data Analyst: Skills, Tools, and Real-World Strategies
Othman Khalifa
No ratings yet
Introduction to Business Analytics
From Everand
Introduction to Business Analytics
Dwaipayan Sethi
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Ambo University Inistitute of Technology Department of Computer Science
No ratings yet
Ambo University Inistitute of Technology Department of Computer Science
13 pages
What Is Big Data
From Everand
What Is Big Data
Jay Kassing
No ratings yet
Decision Tree Is An Upside
No ratings yet
Decision Tree Is An Upside
17 pages
Statistical Arbitrage: Algorithmic Trading Insights and Techniques
From Everand
Statistical Arbitrage: Algorithmic Trading Insights and Techniques
Andrew Pole
3/5 (4)
Lecture 2 EDA 1
No ratings yet
Lecture 2 EDA 1
26 pages
Discrete Clusters Formulation Through The Exploitation of Optimized K-Modes Algorithm For Hypotheses Validation in Social Work Research: The Case of Greek Social Workers Working With Refugees
No ratings yet
Discrete Clusters Formulation Through The Exploitation of Optimized K-Modes Algorithm For Hypotheses Validation in Social Work Research: The Case of Greek Social Workers Working With Refugees
8 pages
Data Mining Mid Project Report-Sagor
No ratings yet
Data Mining Mid Project Report-Sagor
11 pages
Business 2025: AI, Automation, and Dirty Secrets No One Taught You (A War Guide for SMEs)
From Everand
Business 2025: AI, Automation, and Dirty Secrets No One Taught You (A War Guide for SMEs)
Cristian Vanegas
No ratings yet
Reading 5 How Informative Your Segmentation
No ratings yet
Reading 5 How Informative Your Segmentation
6 pages
The Art of Statistical Thinking: Detect Misinformation, Understand the World Deeper, and Make Better Decisions.
From Everand
The Art of Statistical Thinking: Detect Misinformation, Understand the World Deeper, and Make Better Decisions.
Albert Rutherford
1/5 (1)

Using CHAID For Classification Problems: Ray@hrs - Co.nz

Uploaded by

Using CHAID For Classification Problems: Ray@hrs - Co.nz

Uploaded by

Using CHAID for classification problems

Example 1. Correctly classifying all cases in a dataset

<= 61.500000 ID=2 BARO N=8 ID=3

<= 67.000000 N=18 TROP ID=4

> 67.000000 N=11 BARO

Example 2: Locating interesting subsets in a dataset Data used

hous = own ID=2 no N=9021 ID=3 = rent no N=6219

Any mercs? = No ID=4 no N=8426 ID=5 no = yes N=595 ID=6 no = No N=5592

Any mercs? = yes ID=7 no N=627

You might also like