0% found this document useful (0 votes)
6 views

DecisionTreeAssignment

The document outlines an assignment to build decision tree classifiers for four datasets: Loan Default Prediction, Fruit Classification, Employee Promotion Prediction, and Restaurant Success Prediction. Each dataset requires data splitting, entropy and Gini impurity calculations, tree growth through attribute selection, final tree presentation, and testing accuracy evaluation. The goal is to develop models that provide insights into the respective prediction tasks based on various attributes.

Uploaded by

mb24033
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

DecisionTreeAssignment

The document outlines an assignment to build decision tree classifiers for four datasets: Loan Default Prediction, Fruit Classification, Employee Promotion Prediction, and Restaurant Success Prediction. Each dataset requires data splitting, entropy and Gini impurity calculations, tree growth through attribute selection, final tree presentation, and testing accuracy evaluation. The goal is to develop models that provide insights into the respective prediction tasks based on various attributes.

Uploaded by

mb24033
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Assignment Overview

This assignment requires you to build a decision tree classifier for several problem contexts. Strictly
follow the instructions below to attempt the questions for each dataset.

1. Data Split: In each of the data sets, choose 10 records for training and 5 for testing. Create
separate tables for your chosen training and testing data for each dataset.
2. Root Node: Using your training data for each dataset, calculate the initial entropy and Gini
impurity of the target variable, showing all calculation steps clearly.

3. Tree Growth:
ˆ For each attribute in your training data, calculate the Information Gain (using entropy) and
Gini Gain (using Gini impurity). Show all calculation steps clearly.
ˆ Select the attribute with the highest gain (for both entropy and Gini) to split the current
node. Clearly state the gain values for all attributes and the attribute chosen for the split.
Explain your decision.
ˆ Describe the resulting branches created by the split, based on the values of the chosen at-
tribute.
ˆ Repeat this splitting process recursively for each new branch (node) until all leaf nodes are pure
(containing only one class) or there are no more attributes to split on. Show all calculations
and decisions at each step of the tree growth.
4. Final Tree: Present the final decision tree constructed for each dataset. This can be done textually
by describing the nodes and branches.
5. Testing: Use your held-out testing data for each dataset to predict the target variable by traversing
your constructed decision tree. Report the number of correctly and incorrectly classified instances
and calculate the testing accuracy for each tree.
6. Presentation: Present all the details of your work for each dataset in a PowerPoint presentation.
Use proper animations and slide transitions to clearly illustrate the steps of data splitting, root
node calculations, tree growth (including calculations and decisions at each split), the final tree
structure, and the testing results.

1 Data Set 1: Loan Default Prediction


The ability to accurately predict loan default is of paramount importance for financial institutions.
Prudent lending practices not only safeguard the financial health of the institution but also contribute
to economic stability. Approving loans to individuals who are likely to default can lead to significant
financial losses, strain resources, and potentially hinder the institution’s capacity to support other de-
serving applicants. Conversely, overly cautious lending policies might exclude creditworthy individuals
and impede economic growth.
To address this critical challenge, we turn to the analysis of historical data. This section introduces a
dataset compiled from past loan applicants and their repayment outcomes at a lending institution. The
dataset aims to capture key characteristics of borrowers that might be indicative of their likelihood to
default on their loan obligations. By examining these past instances, we hope to uncover patterns and
relationships that can inform future lending decisions.
The dataset comprises information on twenty individual loan applicants. For each applicant, we
have recorded several pertinent attributes at the time of their loan application, along with the eventual
outcome of their loan repayment. The attributes considered are those commonly believed to influence a
borrower’s financial stability and their capacity to honor their debt commitments. The target variable
in this analysis is whether the borrower ultimately ’Defaulted’ on their loan. The complete dataset is
presented in Table 1.

1
Table 1: Borrower Data
Home Owner Marital Status Annual Income Defaulted Borrower
Yes Single 80K No
No Married 70K Yes
No Single 90K No
Yes Married 60K Yes
No Divorced 75K Yes
Yes Single 70K Yes
No Married 100K No
Yes Divorced 85K No
No Single 65K Yes
Yes Married 95K No
No Single 80K No
Yes Married 70K Yes
Yes Divorced 90K No
No Single 60K Yes
Yes Single 75K No
No Married 85K Yes
No Divorced 100K No
Yes Single 65K Yes
Yes Married 80K No
No Single 70K Yes

1.1 Field Descriptions


To effectively analyze this dataset, it is crucial to understand the meaning and potential significance of
each attribute:

ˆ Home Owner: This is a binary categorical variable indicating whether the borrower owned a
home at the time of the loan application. The possible values are ’Yes’ or ’No’. Homeownership is
often considered an indicator of financial stability and responsibility, potentially reducing the risk
of default.
ˆ Marital Status: This is a categorical variable representing the borrower’s marital status. The
recorded values are ’Single’, ’Married’, or ’Divorced’. Marital status can sometimes be associated
with different patterns of financial behavior and responsibilities. For instance, married individuals
might have shared financial obligations or a more stable financial structure compared to single or
divorced individuals.
ˆ Annual Income: This is a categorical variable representing the borrower’s approximate annual
income at the time of the loan application. The income is recorded in categories denoted by
thousands of Rupees (though represented as ’K’ for simplicity). The specific income levels present
in the data are ’80K’, ’70K’, ’90K’, ’60K’, ’75K’, ’100K’, ’85K’, ’65K’, and ’95K’. A borrower’s
income is a fundamental factor in assessing their ability to repay a loan; higher income generally
suggests a lower risk of default.
ˆ Defaulted Borrower: This is the target variable for our prediction task. It is a binary categorical
variable indicating whether the borrower ultimately defaulted on their loan repayment. The possi-
ble values are ’Yes’ or ’No’. Our primary objective is to build a decision tree model that can learn
the relationships between the ’Home Owner’, ’Marital Status’, and ’Annual Income’ attributes and
the likelihood of a borrower falling into the ’Yes’ category for ’Defaulted Borrower’.

By analyzing this historical data, we aim to develop a decision tree model that can provide valuable
insights into the factors influencing loan default and ultimately aid in making more informed lending
decisions in the future.

2
2 Data Set 2: Fruit Classification
The classification of produce plays a vital role in various stages, from cultivation and harvesting to market
distribution and consumer understanding. Accurate categorization based on observable characteristics is
essential for quality control, inventory management, and informed consumer choices. This section delves
into the problem of fruit classification, specifically aiming to distinguish between citrus and non-citrus
fruits based on a set of readily identifiable attributes.
To automate and streamline the sorting process, a system capable of classifying fruits based on
their visual and tactile properties would be highly beneficial. This dataset represents a collection of
observations made on various types of fruits, recording their color, size, sweetness level, and texture.
The ultimate goal is to build a decision tree model that can learn the underlying rules governing the
classification of these fruits into the categories of ’Citrus’ and ’Not Citrus’.
The dataset comprises fifteen instances, each representing a different fruit. For each fruit, we have a
record of its key characteristics, which are believed to be relevant in determining its classification. The
target variable in this case is the ’Class’, indicating whether the fruit belongs to the ’Citrus’ category or
the ’Not Citrus’ category. The data collected is presented in Table 2.

Table 2: Fruit Data


Color Size Sweetness Texture Class (Citrus/Not Citrus)
Orange Medium High Smooth Not Citrus
Yellow Small Low Firm Citrus
Green Large Medium Smooth Not Citrus
Green Medium Medium Crisp Not Citrus
Green Small High Crisp Citrus
Yellow Medium High Soft Not Citrus
Yellow Large Medium Smooth Citrus
Red Large High Soft Not Citrus
Blue Small Medium Firm Not Citrus
Orange Small High Smooth Citrus
Red Small High Soft Not Citrus
Yellow Medium High Soft Not Citrus
Green Small Medium Smooth Not Citrus
Orange Small High Smooth Citrus
Blue Small Medium Firm Not Citrus

2.1 Field Descriptions


To understand the basis for classifying these fruits, let’s examine each of the recorded attributes:

ˆ Color: This is a categorical variable describing the predominant color of the fruit. The observed
colors in this dataset include ’Orange’, ’Yellow’, ’Green’, ’Red’, and ’Blue’. Color is often a primary
visual cue used in fruit identification.
ˆ Size: This is a categorical variable indicating the general size of the fruit. The recorded sizes are
’Medium’, ’Small’, and ’Large’. Size can be another distinguishing physical characteristic.
ˆ Sweetness: This is a categorical variable representing the perceived sweetness level of the fruit.
The levels recorded are ’High’, ’Low’, and ’Medium’. Sweetness is a key sensory attribute that
often correlates with fruit type.
ˆ Texture: This is a categorical variable describing the texture of the fruit’s skin or flesh. The
textures recorded are ’Smooth’, ’Firm’, ’Crisp’, and ’Soft’. Texture can be an important tactile
characteristic for identification.
ˆ Class (Citrus/Not Citrus): This is the target variable, a binary categorical variable indicating
the classification of the fruit. The possible values are ’Citrus’ and ’Not Citrus’. Our objective is
to build a decision tree that can predict this class based on the fruit’s color, size, sweetness, and
texture.

3
By constructing a decision tree using this data, we aim to create a simple yet interpretable model that
can effectively classify fruits as either citrus or non-citrus based on their readily observable characteristics.
This model could potentially be used as a foundational step in developing automated fruit sorting or
identification systems for agricultural applications.

3 Data Set 3: Employee Promotion Prediction


Within the organizational landscape of a growing technology firm, the process of employee promotion
is a critical aspect of talent management and workforce development. Promotions not only recognize
employee contributions and potential but also play a significant role in boosting morale and retaining
valuable talent. However, ensuring a fair and effective promotion process requires careful consideration
of various employee attributes and performance indicators.
Imagine the HR department of this company seeking to understand the factors that have historically
influenced promotion decisions. By analyzing data from past promotion cycles, they aim to identify key
characteristics that are strongly associated with an employee’s likelihood of being promoted. This insight
can then be used to refine the promotion process, ensure greater transparency, and potentially identify
high-potential employees who may be overlooked.
This section introduces a dataset compiled from the company’s employee records, focusing on a se-
lection of employees and whether they were promoted in the last review cycle. The dataset includes
several attributes related to the employees’ professional profiles, such as their department, salary, edu-
cation level, and performance rating. The target variable is ’Promoted’, indicating the outcome of the
promotion decision. The data collected is presented in Table 3.

Table 3: Employee Promotion Data


Department Salary (K) Education Level Performance Rating Promoted
Sales 60 Bachelor 4 No
Marketing 50 Master 3 No
Sales 75 Bachelor 3 Yes
HR 45 Associate 4 No
Engineering 90 Master 4 Yes
Marketing 40 Bachelor 3 No
Sales 70 Master 3 No
HR 55 Bachelor 4 Yes
Engineering 80 Bachelor 3 No
Marketing 42 Associate 4 Yes
Sales 85 Master 4 Yes
HR 65 Master 3 No
Engineering 78 Bachelor 4 Yes
Sales 72 Associate 3 No
Marketing 52 Bachelor 4 Yes

3.1 Field Descriptions


To understand the factors considered in past promotion decisions, let’s examine each of the attributes
recorded in the dataset:

ˆ Department: This is a categorical variable indicating the department to which the employee
belongs. The departments represented in this dataset are ’Sales’, ’Marketing’, ’HR’ (Human Re-
sources), and ’Engineering’. Different departments might have varying promotion criteria and
opportunities.

ˆ Salary (K): This is a numerical variable representing the employee’s current annual salary in thou-
sands of Rupees. Salary level can often be correlated with experience, seniority, and performance,
potentially influencing promotion decisions.

4
ˆ Education Level: This is a categorical variable indicating the employee’s highest level of educa-
tional attainment. The levels recorded are ’Bachelor’, ’Master’, and ’Associate’. Education level
can be a factor considered for certain roles and career advancements.
ˆ Performance Rating: This is an ordinal categorical variable representing the employee’s perfor-
mance rating as assessed in their most recent performance review. The ratings in this dataset are
on a scale, with values of 3 and 4 observed. Higher performance ratings are generally expected to
increase the likelihood of promotion.
ˆ Promoted: This is the target variable, a binary categorical variable indicating whether the em-
ployee was promoted in the review cycle corresponding to the recorded attributes. The possible
values are ’Yes’ or ’No’. Our goal is to build a decision tree model that can predict this outcome
based on the employee’s department, salary, education level, and performance rating.

By analyzing this employee data, we aim to construct a decision tree model that can shed light on
the relationships between these employee attributes and the likelihood of promotion within the company.
This model could provide valuable insights for the HR department in understanding the implicit rules
governing past promotion decisions.

4 Data Set 4: Restaurant Success Prediction


The vibrant culinary scene of a diverse region is characterized by a wide range of restaurants, each striving
to attract patrons and achieve sustainable success. In this competitive environment, understanding
the factors that contribute to a restaurant’s prosperity is crucial for both aspiring entrepreneurs and
established owners looking to optimize their operations. Identifying the key elements that differentiate
highly successful eateries from those that struggle can provide valuable insights for strategic decision-
making, ranging from menu design and pricing to location selection and marketing efforts.
Consider a group of local investors contemplating opening a new restaurant. To make informed
decisions and maximize their chances of success, they have gathered data on existing restaurants in
the area and surrounding regions. This dataset aims to capture several key characteristics of these
restaurants, such as the type of cuisine they offer, their average cost per meal, the quality of their
location, and their investment in marketing. The ultimate goal is to build a predictive model that can
help understand which combinations of these factors are most strongly associated with a restaurant’s
level of success.
This section introduces a dataset comprising information on fifteen different restaurants. For each
restaurant, we have recorded its cuisine type, average cost per meal, a rating of its location quality, the
amount spent on marketing, and a classification of its overall success level. The target variable in this
analysis is ’Success’, categorized as either ’High’ or ’Low’. The collected data is presented in Table 4.

Table 4: Restaurant Success Data


Cuisine Type Average Cost Location Quality (1-5) Marketing Spend (K) Success (High/Low)
Mexican 20 2 5 High
Italian 45 5 25 Low
Asian 25 3 10 High
American 35 4 20 Low
Italian 38 4 18 High
Asian 22 3 8 High
Mexican 28 2 7 Low
Italian 32 3 12 Low
Asian 20 5 6 Low
Indian 33 5 22 High
Italian 35 4 15 High
Mexican 25 3 10 High
Asian 40 2 5 Low
American 30 5 22 High
Indian 28 4 18 Low

5
4.1 Field Descriptions
To understand the factors that might influence a restaurant’s success in the region, let’s examine each
of the recorded attributes:

ˆ Cuisine Type: This is a categorical variable indicating the primary type of cuisine offered by
the restaurant. The cuisine types represented in this dataset include ’Mexican’, ’Italian’, ’Asian’,
’American’, and ’Indian’. Different cuisines might appeal to varying segments of the local popula-
tion and have different operational costs.
ˆ Average Cost: This is a numerical variable representing the estimated average cost per meal
for a customer at the restaurant. Pricing strategy is a critical factor in attracting customers and
ensuring profitability.

ˆ Location Quality (1-5): This is an ordinal categorical variable representing a subjective rating
of the restaurant’s location quality on a scale from 1 to 5, where 1 indicates a poor location
and 5 indicates an excellent location. Location is a fundamental aspect of a restaurant’s success,
influencing foot traffic and accessibility.
ˆ Marketing Spend (K): This is a numerical variable representing the approximate amount of
money spent by the restaurant on marketing and advertising efforts, expressed in thousands of
Rupees. Effective marketing is essential for raising awareness and attracting customers.
ˆ Success (High/Low): This is the target variable, a binary categorical variable indicating the
overall level of success achieved by the restaurant. The success is categorized as either ’High’ or
’Low’, based on factors such as profitability, customer reviews, and longevity. Our objective is to
build a decision tree model that can predict this success level based on the restaurant’s cuisine type,
average cost, location quality, and marketing spend within the context of the restaurant market.

By analyzing this data, we aim to construct a decision tree model that can provide insights into the
key factors driving restaurant success (or lack thereof). This model could be a valuable tool for potential
restaurant owners and existing establishments looking to understand the dynamics of the local culinary
market.

5 Data Set 5: Customer Churn Prediction


In the rapidly evolving telecommunications market, customer retention has become a critical focus for
service providers. The cost of acquiring new customers often outweighs the cost of retaining existing ones,
making the prediction and prevention of customer churn (the rate at which customers stop using a service)
a significant business imperative. Understanding the factors that contribute to customer attrition can
enable telecommunication companies to proactively implement strategies aimed at improving customer
satisfaction and loyalty.
Consider a local telecommunications provider experiencing a noticeable rate of customer churn. To
mitigate this issue, they have begun collecting data on their subscriber base, including information about
their account tenure, the type of service plan they have subscribed to, their monthly data usage, the
number of support tickets they have raised, and the primary device they use to access the services. By
analyzing this historical data, the company aims to identify patterns and correlations that can help
predict which customers are at a higher risk of churning.
This section introduces a dataset compiled from the records of fifteen telecommunication service
subscribers. For each subscriber, we have information on their account age, plan type, monthly data
usage, the number of support tickets they have submitted, the primary device they use, and whether
they have ultimately churned. The target variable in this analysis is ’Churn’, indicating whether the
customer discontinued their service. The collected data is presented in Table 5.

5.1 Field Descriptions


To understand the characteristics of the telecommunication subscribers and their service usage, let’s
examine each of the attributes recorded in the dataset:

6
Table 5: Customer Churn Data
AccountAge PlanType MonthlyUsageGB SupportTickets DeviceType Churn
Young Basic High None Mobile No
Mid Premium Low One Tablet Yes
Mid Basic Low None Mobile No
Mid Economy High Two Desktop Yes
Old Basic Low None Desktop No
Old Premium Medium One Mobile Yes
Young Basic Medium One Tablet Yes
Mid Premium Low Two Mobile No
Old Premium High One Desktop Yes
Young Premium Medium Two Tablet Yes
Mid Basic High None Mobile No
Old Premium Low One Desktop Yes
Young Basic High Two Tablet Yes
Mid Premium Low None Mobile No
Old Basic Medium One Desktop Yes

ˆ AccountAge: This is an ordinal categorical variable indicating the age of the customer’s account
(values: ’Young’, ’Mid’, ’Old’). Account tenure might be related to customer loyalty, with newer
accounts potentially being at higher risk of churn.
ˆ PlanType: This is a categorical variable representing the type of service plan the customer has
subscribed to (values: ’Basic’, ’Premium’, ’Economy’). Different plan features and pricing might
influence customer satisfaction and churn.
ˆ MonthlyUsageGB: This is an ordinal categorical variable indicating the customer’s average
monthly data usage in gigabytes (values: ’High’, ’Low’, ’Medium’). Usage patterns might cor-
relate with plan suitability and customer needs.
ˆ SupportTickets: This is an ordinal categorical variable indicating the number of support tickets
the customer has submitted (values: ’None’, ’One’, ’Two’). A higher number of support interactions
could indicate dissatisfaction or issues with the service, potentially leading to churn.

ˆ DeviceType: This is a categorical variable indicating the primary type of device the customer
uses to access the services (values: ’Mobile’, ’Tablet’, ’Desktop’). The primary device might reflect
the customer’s usage habits and needs.
ˆ Churn: This is the target variable, a binary categorical variable indicating whether the customer
has churned (discontinued their service) (values: ’Yes’, ’No’). Our objective is to build a deci-
sion tree model that can predict this outcome based on the customer’s account age, plan type,
monthly data usage, support ticket history, and primary device type within the context of the
telecommunications market.

By analyzing this customer data, we aim to construct a decision tree model that can identify the key
factors associated with customer churn for the telecommunications provider. This model could provide
valuable insights for developing targeted retention strategies and improving overall customer experience.

You might also like