DecisionTreeAssignment
DecisionTreeAssignment
This assignment requires you to build a decision tree classifier for several problem contexts. Strictly
follow the instructions below to attempt the questions for each dataset.
1. Data Split: In each of the data sets, choose 10 records for training and 5 for testing. Create
separate tables for your chosen training and testing data for each dataset.
2. Root Node: Using your training data for each dataset, calculate the initial entropy and Gini
impurity of the target variable, showing all calculation steps clearly.
3. Tree Growth:
For each attribute in your training data, calculate the Information Gain (using entropy) and
Gini Gain (using Gini impurity). Show all calculation steps clearly.
Select the attribute with the highest gain (for both entropy and Gini) to split the current
node. Clearly state the gain values for all attributes and the attribute chosen for the split.
Explain your decision.
Describe the resulting branches created by the split, based on the values of the chosen at-
tribute.
Repeat this splitting process recursively for each new branch (node) until all leaf nodes are pure
(containing only one class) or there are no more attributes to split on. Show all calculations
and decisions at each step of the tree growth.
4. Final Tree: Present the final decision tree constructed for each dataset. This can be done textually
by describing the nodes and branches.
5. Testing: Use your held-out testing data for each dataset to predict the target variable by traversing
your constructed decision tree. Report the number of correctly and incorrectly classified instances
and calculate the testing accuracy for each tree.
6. Presentation: Present all the details of your work for each dataset in a PowerPoint presentation.
Use proper animations and slide transitions to clearly illustrate the steps of data splitting, root
node calculations, tree growth (including calculations and decisions at each split), the final tree
structure, and the testing results.
1
Table 1: Borrower Data
Home Owner Marital Status Annual Income Defaulted Borrower
Yes Single 80K No
No Married 70K Yes
No Single 90K No
Yes Married 60K Yes
No Divorced 75K Yes
Yes Single 70K Yes
No Married 100K No
Yes Divorced 85K No
No Single 65K Yes
Yes Married 95K No
No Single 80K No
Yes Married 70K Yes
Yes Divorced 90K No
No Single 60K Yes
Yes Single 75K No
No Married 85K Yes
No Divorced 100K No
Yes Single 65K Yes
Yes Married 80K No
No Single 70K Yes
Home Owner: This is a binary categorical variable indicating whether the borrower owned a
home at the time of the loan application. The possible values are ’Yes’ or ’No’. Homeownership is
often considered an indicator of financial stability and responsibility, potentially reducing the risk
of default.
Marital Status: This is a categorical variable representing the borrower’s marital status. The
recorded values are ’Single’, ’Married’, or ’Divorced’. Marital status can sometimes be associated
with different patterns of financial behavior and responsibilities. For instance, married individuals
might have shared financial obligations or a more stable financial structure compared to single or
divorced individuals.
Annual Income: This is a categorical variable representing the borrower’s approximate annual
income at the time of the loan application. The income is recorded in categories denoted by
thousands of Rupees (though represented as ’K’ for simplicity). The specific income levels present
in the data are ’80K’, ’70K’, ’90K’, ’60K’, ’75K’, ’100K’, ’85K’, ’65K’, and ’95K’. A borrower’s
income is a fundamental factor in assessing their ability to repay a loan; higher income generally
suggests a lower risk of default.
Defaulted Borrower: This is the target variable for our prediction task. It is a binary categorical
variable indicating whether the borrower ultimately defaulted on their loan repayment. The possi-
ble values are ’Yes’ or ’No’. Our primary objective is to build a decision tree model that can learn
the relationships between the ’Home Owner’, ’Marital Status’, and ’Annual Income’ attributes and
the likelihood of a borrower falling into the ’Yes’ category for ’Defaulted Borrower’.
By analyzing this historical data, we aim to develop a decision tree model that can provide valuable
insights into the factors influencing loan default and ultimately aid in making more informed lending
decisions in the future.
2
2 Data Set 2: Fruit Classification
The classification of produce plays a vital role in various stages, from cultivation and harvesting to market
distribution and consumer understanding. Accurate categorization based on observable characteristics is
essential for quality control, inventory management, and informed consumer choices. This section delves
into the problem of fruit classification, specifically aiming to distinguish between citrus and non-citrus
fruits based on a set of readily identifiable attributes.
To automate and streamline the sorting process, a system capable of classifying fruits based on
their visual and tactile properties would be highly beneficial. This dataset represents a collection of
observations made on various types of fruits, recording their color, size, sweetness level, and texture.
The ultimate goal is to build a decision tree model that can learn the underlying rules governing the
classification of these fruits into the categories of ’Citrus’ and ’Not Citrus’.
The dataset comprises fifteen instances, each representing a different fruit. For each fruit, we have a
record of its key characteristics, which are believed to be relevant in determining its classification. The
target variable in this case is the ’Class’, indicating whether the fruit belongs to the ’Citrus’ category or
the ’Not Citrus’ category. The data collected is presented in Table 2.
Color: This is a categorical variable describing the predominant color of the fruit. The observed
colors in this dataset include ’Orange’, ’Yellow’, ’Green’, ’Red’, and ’Blue’. Color is often a primary
visual cue used in fruit identification.
Size: This is a categorical variable indicating the general size of the fruit. The recorded sizes are
’Medium’, ’Small’, and ’Large’. Size can be another distinguishing physical characteristic.
Sweetness: This is a categorical variable representing the perceived sweetness level of the fruit.
The levels recorded are ’High’, ’Low’, and ’Medium’. Sweetness is a key sensory attribute that
often correlates with fruit type.
Texture: This is a categorical variable describing the texture of the fruit’s skin or flesh. The
textures recorded are ’Smooth’, ’Firm’, ’Crisp’, and ’Soft’. Texture can be an important tactile
characteristic for identification.
Class (Citrus/Not Citrus): This is the target variable, a binary categorical variable indicating
the classification of the fruit. The possible values are ’Citrus’ and ’Not Citrus’. Our objective is
to build a decision tree that can predict this class based on the fruit’s color, size, sweetness, and
texture.
3
By constructing a decision tree using this data, we aim to create a simple yet interpretable model that
can effectively classify fruits as either citrus or non-citrus based on their readily observable characteristics.
This model could potentially be used as a foundational step in developing automated fruit sorting or
identification systems for agricultural applications.
Department: This is a categorical variable indicating the department to which the employee
belongs. The departments represented in this dataset are ’Sales’, ’Marketing’, ’HR’ (Human Re-
sources), and ’Engineering’. Different departments might have varying promotion criteria and
opportunities.
Salary (K): This is a numerical variable representing the employee’s current annual salary in thou-
sands of Rupees. Salary level can often be correlated with experience, seniority, and performance,
potentially influencing promotion decisions.
4
Education Level: This is a categorical variable indicating the employee’s highest level of educa-
tional attainment. The levels recorded are ’Bachelor’, ’Master’, and ’Associate’. Education level
can be a factor considered for certain roles and career advancements.
Performance Rating: This is an ordinal categorical variable representing the employee’s perfor-
mance rating as assessed in their most recent performance review. The ratings in this dataset are
on a scale, with values of 3 and 4 observed. Higher performance ratings are generally expected to
increase the likelihood of promotion.
Promoted: This is the target variable, a binary categorical variable indicating whether the em-
ployee was promoted in the review cycle corresponding to the recorded attributes. The possible
values are ’Yes’ or ’No’. Our goal is to build a decision tree model that can predict this outcome
based on the employee’s department, salary, education level, and performance rating.
By analyzing this employee data, we aim to construct a decision tree model that can shed light on
the relationships between these employee attributes and the likelihood of promotion within the company.
This model could provide valuable insights for the HR department in understanding the implicit rules
governing past promotion decisions.
5
4.1 Field Descriptions
To understand the factors that might influence a restaurant’s success in the region, let’s examine each
of the recorded attributes:
Cuisine Type: This is a categorical variable indicating the primary type of cuisine offered by
the restaurant. The cuisine types represented in this dataset include ’Mexican’, ’Italian’, ’Asian’,
’American’, and ’Indian’. Different cuisines might appeal to varying segments of the local popula-
tion and have different operational costs.
Average Cost: This is a numerical variable representing the estimated average cost per meal
for a customer at the restaurant. Pricing strategy is a critical factor in attracting customers and
ensuring profitability.
Location Quality (1-5): This is an ordinal categorical variable representing a subjective rating
of the restaurant’s location quality on a scale from 1 to 5, where 1 indicates a poor location
and 5 indicates an excellent location. Location is a fundamental aspect of a restaurant’s success,
influencing foot traffic and accessibility.
Marketing Spend (K): This is a numerical variable representing the approximate amount of
money spent by the restaurant on marketing and advertising efforts, expressed in thousands of
Rupees. Effective marketing is essential for raising awareness and attracting customers.
Success (High/Low): This is the target variable, a binary categorical variable indicating the
overall level of success achieved by the restaurant. The success is categorized as either ’High’ or
’Low’, based on factors such as profitability, customer reviews, and longevity. Our objective is to
build a decision tree model that can predict this success level based on the restaurant’s cuisine type,
average cost, location quality, and marketing spend within the context of the restaurant market.
By analyzing this data, we aim to construct a decision tree model that can provide insights into the
key factors driving restaurant success (or lack thereof). This model could be a valuable tool for potential
restaurant owners and existing establishments looking to understand the dynamics of the local culinary
market.
6
Table 5: Customer Churn Data
AccountAge PlanType MonthlyUsageGB SupportTickets DeviceType Churn
Young Basic High None Mobile No
Mid Premium Low One Tablet Yes
Mid Basic Low None Mobile No
Mid Economy High Two Desktop Yes
Old Basic Low None Desktop No
Old Premium Medium One Mobile Yes
Young Basic Medium One Tablet Yes
Mid Premium Low Two Mobile No
Old Premium High One Desktop Yes
Young Premium Medium Two Tablet Yes
Mid Basic High None Mobile No
Old Premium Low One Desktop Yes
Young Basic High Two Tablet Yes
Mid Premium Low None Mobile No
Old Basic Medium One Desktop Yes
AccountAge: This is an ordinal categorical variable indicating the age of the customer’s account
(values: ’Young’, ’Mid’, ’Old’). Account tenure might be related to customer loyalty, with newer
accounts potentially being at higher risk of churn.
PlanType: This is a categorical variable representing the type of service plan the customer has
subscribed to (values: ’Basic’, ’Premium’, ’Economy’). Different plan features and pricing might
influence customer satisfaction and churn.
MonthlyUsageGB: This is an ordinal categorical variable indicating the customer’s average
monthly data usage in gigabytes (values: ’High’, ’Low’, ’Medium’). Usage patterns might cor-
relate with plan suitability and customer needs.
SupportTickets: This is an ordinal categorical variable indicating the number of support tickets
the customer has submitted (values: ’None’, ’One’, ’Two’). A higher number of support interactions
could indicate dissatisfaction or issues with the service, potentially leading to churn.
DeviceType: This is a categorical variable indicating the primary type of device the customer
uses to access the services (values: ’Mobile’, ’Tablet’, ’Desktop’). The primary device might reflect
the customer’s usage habits and needs.
Churn: This is the target variable, a binary categorical variable indicating whether the customer
has churned (discontinued their service) (values: ’Yes’, ’No’). Our objective is to build a deci-
sion tree model that can predict this outcome based on the customer’s account age, plan type,
monthly data usage, support ticket history, and primary device type within the context of the
telecommunications market.
By analyzing this customer data, we aim to construct a decision tree model that can identify the key
factors associated with customer churn for the telecommunications provider. This model could provide
valuable insights for developing targeted retention strategies and improving overall customer experience.