DS_UNIT_4
DS_UNIT_4
• Simple linear regression establishes a relationship between two variables using a straight line.
• The goal is to determine the slope and intercept that define the line minimizing the regression errors.
Example: Predicting an employee's salary based on their years of experience. Data points for experience and
salary are plotted, and a line is fitted to predict future salaries.
• Multiple linear regression models the linear relationship between one dependent variable and two or more
independent variables.
• It assumes a linear relationship between predictors and the target variable.
Example: Predicting sales based on advertising expenditure on TV and newspapers. Data for these variables
is used to create a model that predicts sales using a linear combination of the predictors.
2. Describe briefly about model evaluation using confusion matrix with example
3. What is a confusion matrix, and how is it used to evaluate classification models?
Define the terms True Positive, True Negative, False Positive, and False Negative
with an example.
A confusion matrix is a table used to evaluate the performance of a classification model by comparing
actual and predicted outcomes. It summarizes the model's performance on a test dataset for which the true
values are known.
Components of a Confusion Matrix
• True Positive (TP): The model predicted "True," and it is actually "True."
• True Negative (TN): The model predicted "False," and it is actually "False."
• False Positive (FP): The model predicted "True," but it is actually "False." (Type I Error)
• False Negative (FN): The model predicted "False," but it is actually "True." (Type II Error)
4. What is polynomial regression? Give the types and assumptions of polynomial
regression with diagrams
Polynomial Regression
Polynomial regression is a type of regression analysis where the relationship between the independent
variable (X) and the dependent variable (Y) is modeled as an n-degree polynomial. It is used when the data
points form a nonlinear relationship, and the goal is to fit a curve that best describes the data.
Assumptions of Polynomial Regression
1. Additive Relationship
o The dependent variable is an additive function of the independent variables and their polynomial
terms.
2. Independent Variables
o The independent variables are not correlated with one another.
3. Normally Distributed Errors
o The errors are normally distributed with a mean of zero and constant variance.
4. No Multicollinearity
o The independent variables should not be strongly correlated.
5. Explain polynomial regression and when it is preferred over linear regression. Fit a
second-degree polynomial to the following data points: X = [1, 2, 3, 4] and Y = [2.3,
4.1, 6.2, 8.5].
1. Non-Linear Relationships:
o When the relationship between XXX and YYY is not linear and exhibits curvature, polynomial
regression is preferred.
2. Underfitting by Linear Models:
o If a linear model fails to capture the data's trend or patterns, polynomial regression provides a better
fit.
3. Smooth Approximation:
o Polynomial regression offers a smooth approximation to the data, unlike some machine learning
methods that may overfit.
6. Explain Logistic Regression with an example.
Logistic Regression
Logistic Regression is a statistical method used for binary classification problems, where the outcome
(dependent variable) is categorical, typically with two possible classes (e.g., yes/no, true/false,
success/failure). Unlike linear regression, which predicts a continuous output, logistic regression predicts the
probability that a given input point belongs to a certain class.
Applications of Logistic Regression
1. Binary Classification:
o Predicting customer churn (Yes/No)
o Medical diagnoses (Diseased/Healthy)
2. Multinomial Logistic Regression:
o For problems with more than two classes (e.g., classifying types of fruits: apple, banana, cherry).
7. What is logistic regression, and how does it differ from linear regression? Derive
the sigmoid function and explain its role in binary classification.
Logistic Regression is a statistical model used for binary classification tasks, where the outcome (dependent
variable) is categorical and typically consists of two possible outcomes (e.g., 0/1, Yes/No, True/False). It
predicts the probability that a given input point belongs to a certain class (usually class 1). Unlike linear
regression, which predicts continuous values, logistic regression uses a logistic (sigmoid) function to model
probabilities that range from 0 to 1.
• Output: Logistic regression predicts probabilities, which are continuous values between 0 and 1.
• Binary Classification: It is mainly used when the dependent variable has two categories (binary outcomes).
• Uses Sigmoid Function: The predicted value is passed through the sigmoid function to convert it into a
probability.
8. Explain the key evaluation metrics derived from the confusion matrix, such as
accuracy, precision, recall, and F1-score. How would you interpret these metrics in
a real-world classification task?
The confusion matrix is a tool used to evaluate the performance of a classification model, particularly for
binary classification tasks. It summarizes the results of a classification problem by comparing the predicted
and actual values. From the confusion matrix, we can derive several key evaluation metrics to assess model
performance.
1. Accuracy
Accuracy is the proportion of correctly classified instances (both positive and negative) out of all instances
in the dataset.
• Interpretation: Accuracy gives a quick overall sense of how well the model performs. However, it
can be misleading when dealing with imbalanced datasets (e.g., when one class is much more
frequent than the other). In such cases, accuracy might be high even if the model performs poorly on
the minority class.
Example: If a model correctly classifies 90 out of 100 instances, the accuracy would be 90%.
2. Precision
Precision (also called Positive Predictive Value) measures the proportion of correctly predicted positive
instances out of all instances predicted as positive.
•
Interpretation: Precision answers the question: "Out of all instances the model predicted as positive,
how many were actually positive?" It is crucial when the cost of a False Positive (FP) is high, for
example, in medical diagnoses (where wrongly diagnosing a patient as diseased might lead to
unnecessary treatments).
Example: If a model predicts 50 instances as positive, but 40 are correct (True Positives) and 10 are False
Positives, the precision would be 80% (40/50).
Recall (also called Sensitivity or True Positive Rate) measures the proportion of correctly predicted positive
instances out of all actual positive instances in the dataset.
• Interpretation: Recall answers the question: "Out of all actual positive instances, how many did the
model correctly identify?" It is important when the cost of a False Negative (FN) is high. For
example, in detecting diseases, failing to identify a positive case (False Negative) could be
dangerous, so a high recall is desired.
Example: If 30 instances in a dataset are truly positive and the model correctly identifies 25 of them (True
Positives), the recall would be 25/30=83.33%
4. F1-Score
The F1-Score is the harmonic mean of precision and recall, providing a balanced metric when the class
distribution is imbalanced.
Interpreting These Metrics in a Real-World Classification Task
Let’s take an example of a spam email detection task, where the goal is to predict whether an email is spam
(positive class, 1) or not spam (negative class, 0).
• Accuracy: 95%
• Recall: 50%
• In this case, the model has a high accuracy, but the recall is low. This indicates that the model is classifying
most emails correctly, but it is missing half of the actual spam emails (False Negatives). In this scenario, the
model is not effective in catching all spam emails, which is critical in a spam filter.
• Precision: 90%
• Recall: 50%
• This model is very good at identifying spam emails when it predicts them, but it misses many spam emails. It
would avoid falsely marking legitimate emails as spam, but it fails to catch many spam emails (False
Negatives). In a real-world setting, this might be acceptable if the user wants to avoid false alarms
(legitimate emails marked as spam).
• Precision: 80%
• Recall: 80%
• Here, both precision and recall are balanced. The F1-score will also be high, reflecting a good balance
between catching most spam emails while minimizing false positives. This is the ideal situation in most
practical applications, as it offers a good trade-off between catching spam and avoiding false alarms.
9. What is multiple linear regression, and how does it differ from simple linear
regression? Construct a model for predicting house prices using independent
variables such as size, location, and number of rooms.
Multiple Linear Regression is an extension of simple linear regression that models the relationship
between two or more independent variables (predictors) and a dependent variable (target). It assumes that
the dependent variable is a linear function of the independent variables.
How Does Multiple Linear Regression Differ from Simple Linear Regression?
• Simple Linear Regression involves only one independent variable to predict the dependent variable.
Its equation is:
It models the relationship between a single independent variable X and a dependent variable Y.
• Multiple Linear Regression involves two or more independent variables, making it suitable for cases
where multiple factors influence the dependent variable. It captures more complex relationships between the
predictors and the target.
Example of Difference:
1. Collect Data: Gather data for house prices, size, location, and number of rooms.
2. Prepare Data: Clean the data, handle any missing values, and encode categorical variables (e.g., location).
3. Split Data: Divide the data into training and testing sets.
4. Fit the Model: Use a statistical or machine learning technique to fit the model to the training data.
5. Evaluate the Model: Check the model’s performance using metrics like R-squared, Mean Squared Error
(MSE), etc.
10.What is regression analysis? Briefly describe about simple and multiple regression.