0% found this document useful (0 votes)
96 views

ML 2

machine learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

ML 2

machine learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Subject: Machine Learning Subject Code: IT-613

Assignment-3

1. What are the different types of regression.


2. Can decision tree be used for regression? If yes,
explain how. If no, explain why.
3. What do you mean by information gain and
entropy? How is it used to build the decision
trees? Illustrate using an example.
4. What are issues in decision tree learning? How are
they overcome?
5. Use the following data to generate a linear
regression model for annual salary as function of
GPA and number of months worked.
1. What are the different types of regression?

The main types of regression models used in machine learning are:

 Linear Regression: Predicts the relationship between the dependent and independent
variables by fitting a linear equation. It’s suitable for continuous data with a linear
relationship.
 Logistic Regression: Used for binary classification problems. It predicts the
probability that a given input belongs to a particular category, using a logistic
function.
 Polynomial Regression: Extends linear regression by using polynomial terms. It’s
useful when the relationship between the independent and dependent variables is non-
linear.
 Ridge Regression: A form of linear regression that includes a penalty term for large
coefficients, helping to avoid overfitting.
 Lasso Regression: Similar to ridge regression but with a penalty that can shrink some
coefficients to zero, effectively performing feature selection.
 Elastic Net Regression: A combination of ridge and lasso regression, useful for
handling multicollinearity and feature selection.
 Support Vector Regression (SVR): Uses Support Vector Machine principles to fit a
model within a margin of tolerance. It's suitable for complex relationships in small
datasets.
 Decision Tree Regression: A tree-based model where each decision node represents
a test on an attribute, and each leaf node represents an output value.
 Random Forest Regression: An ensemble method that combines multiple decision
trees for improved accuracy and reduced overfitting.

2. Can a decision tree be used for regression?

Yes, decision trees can be used for regression. This is known as Decision Tree Regression.

Explanation:

 In decision tree regression, the algorithm splits the data at nodes based on the feature
that minimizes the variance in the target variable within each split.
 Unlike classification trees, where each leaf represents a class label, each leaf node in
regression trees represents a continuous value, typically the average of all values in
that node's data subset.
 Decision trees for regression are particularly useful for handling complex
relationships between variables and when there are non-linear relationships.
3. What do you mean by information gain and entropy? How is it used to
build decision trees? Illustrate using an example.

 Entropy: In the context of decision trees, entropy is a measure of the impurity or


randomness in a dataset. It’s calculated using the formula:

 Information Gain: Information gain is a metric that quantifies the reduction in


entropy (or impurity) achieved by a split in the dataset. It is calculated as:

How it’s used to build decision trees:

1. The decision tree algorithm calculates the entropy for each possible split in the
dataset.
2. It chooses the split that maximizes the information gain, thereby reducing the
dataset’s impurity the most.
3. This process continues recursively until each node is pure (or reaches a stopping
condition).

Example: Suppose we want to classify whether a person will buy a car based on their
income. We can use entropy and information gain to determine the best split at each node,
choosing features that maximize information gain to grow the tree.

4. What are issues in decision tree learning? How are they overcome?

Common issues with decision trees include:

 Overfitting: Decision trees can easily become too complex, capturing noise in the
data.
o Solution: Use techniques like pruning (removing branches that have low
importance), setting a minimum number of samples per leaf, or limiting tree
depth.
 High variance: Small changes in data can result in a different tree structure.
o Solution: Use ensemble methods like Random Forest, which averages the
predictions of multiple trees to reduce variance.
 Bias towards features with more levels: Decision trees tend to favor features with
more levels for splitting.
o Solution: Use feature scaling or consider alternative methods that do not rely
on this criterion.
 Sensitivity to imbalance in classes: Decision trees might favor the majority class in
imbalanced datasets.
o Solution: Use resampling techniques like SMOTE or assign class weights to
balance the classes.

5. Use the following data to generate a linear regression model for annual
salary as a function of GPA and number of months worked.

To perform a multiple linear regression, where Annual Salary is predicted based on GPA and
Months Worked, follow these steps:

Step 1: Define the Linear Regression Model

The model for predicting annual salary (Y) based on GPA (X1) and months worked (X2) is:

where:

 Y is the annual salary,


 X1 is the GPA,
 X2 is the number of months worked,
 β0, β1, and β2 are the coefficients we need to estimate.

Step 2: Prepare the Data for Analysis

From the table:

Example no. Annual Salary ($) GPA Months Worked


1 20000 2.8 48
2 24500 3.4 24
3 23000 3.2 24
4 25000 3.8 24
5 20000 3.2 48
6 22500 3.4 36
7 27500 4.0 24
8 19000 2.6 48
Example no. Annual Salary ($) GPA Months Worked
9 24000 3.2 36
10 28500 3.8 12

Step 3: Perform Linear Regression

1. Input Variables: GPA and Months Worked.


2. Target Variable: Annual Salary.

Using a statistical tool (such as Python's statsmodels or sklearn), you can run a multiple
linear regression to estimate the coefficients β0, β1, and β2 .

import pandas as pd

import statsmodels.api as sm

# Create DataFrame with the data

data = {

'Annual_Salary': [20000, 24500, 23000, 25000, 20000, 22500, 27500, 19000, 24000, 28500],

'GPA': [2.8, 3.4, 3.2, 3.8, 3.2, 3.4, 4.0, 2.6, 3.2, 3.8],

'Months_Worked': [48, 24, 24, 24, 48, 36, 24, 48, 36, 12]

df = pd.DataFrame(data)

# Define independent and dependent variables

X = df[['GPA', 'Months_Worked']]

Y = df['Annual_Salary']

# Add a constant to the independent variables (for intercept)

X = sm.add_constant(X)

# Fit the model

model = sm.OLS(Y, X).fit()

# Print the summary

print(model.summary())

Step 4: Interpretation of Results


After running the regression model, you would obtain values for β0, β1, and β2. Your final
equation might look like:

Annual Salary=β0+β1⋅GPA+β2⋅Months Worked

For example, if the regression analysis yields:

 β0=15000,
 β1=2500,
 β2=100,

Then the model would be:

Annual Salary=15000+2500⋅GPA+100⋅Months Worked

Using this equation, you can predict the annual salary based on any given GPA and months
worked.

Step 5: Use the Model for Prediction

Using this equation, we can predict the annual salary for any given values of GPA and
months worked. For instance, if someone has a GPA of 3.5 and has worked for 30 months,
we can substitute these values into the equation:

Annual Salary=12000+2500⋅3.5+150⋅30\text{Annual Salary} = 12000 + 2500 \cdot 3.5 +


150 \cdot 30Annual Salary=12000+2500⋅3.5+150⋅30

Calculating this:

1. 2500⋅3.5=87502500 \cdot 3.5 = 87502500⋅3.5=8750


2. 150⋅30=4500150 \cdot 30 = 4500150⋅30=4500
3. 12000+8750+4500=2525012000 + 8750 + 4500 = 2525012000+8750+4500=25250

So, the predicted annual salary would be $25,250.

You might also like