0% found this document useful (0 votes)
132 views17 pages

U-4_IML

Uploaded by

tejaswim1070
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views17 pages

U-4_IML

Uploaded by

tejaswim1070
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

IML UNIT-4 QUESTIONS

1. What is regression analysis? What are the assumptions and challenges in


Regression Analysis? Explain two main problems in regression analysis.

Ans:

Regression Analysis:

Regression Analysis is a supervised learning analysis.

It is used for predictive data or quantitative or numerical data.

In R Programming Language Regression Analysis is a statistical model which gives the


relationship between the dependent variables and independent variables.

Assumptions in Regression Analysis:

The assumptions in regression analysis are conditions that must be met for the model to provide
reliable estimates and valid inferences.

• Simple Linear Regression: In this simple linear regression there is only one dependent
and one independent variable. This linear regression model only one predictor.
▪ The mathematical equation for the simple linear regression model is
shown below.
▪ y=ax+b
o where y is a dependent variable
o x is a independent variable
o a, b are the regression coefficients
• Polynomial Regression: It analysis is a non-linear regression analysis. Polynomial
regression analysis helps for the flexible curve fitting of the data, involves the fitting of
polynomial equation of the data.
▪ The mathematical expression for the polynomail regression analysis is
shown below.
▪ y=a0+a1x+a2x^2+...........+anx^n

o where y is dependent variable


o x is independent variable
o a0,a1,a2 are the coefficients of independent variable.
• Exponential Regression: It is a non-linear type of regression. Exponential regression
can be expressed in two ways. Exponential regression can be used in finance, biology,
physics etc fields.
▪ The mathematical expression for the exponential regression is:
▪ y=ae^(bx)
o where y is dependent variable
o x is independent variable
o a, b are the regression coefficients.

Challenges in Regression Analysis:

1. Model Overfitting or Underfitting: An overly complex model may fit the training
data too closely (overfitting) and fail to generalize, while a simple model may miss
patterns (underfitting).

2. Outliers and Influential Points: Outliers can heavily skew results, especially in small
datasets, affecting the accuracy of the model.

3. Missing Data: Missing values can bias the model or reduce the power of analysis if not
handled correctly.

4. Nonlinear Relationships: If relationships are nonlinear, linear regression might not


capture them effectively, necessitating transformations or alternative models.

TWO MAIN PROBLEM IN REGRESSION

Two main problems in regression analysis are multicollinearity and heteroscedasticity

▪ Multicollinearity: It happens when independent variables are strongly correlated,


making it difficult to assess the unique effect of each on the dependent variable. This
overlap in variance causes inflated standard errors and can make regression coefficients
unstable, potentially misleading analysis.
▪ Heteroscedasticity:
Heteroscedasticity occurs when the error variance is not constant across values of the
independent variables, violating the assumption of equal error variance.
2.What is Linear Regression? Explain the concept of Linear Regression with
an example. How to improve accuracy of the linear regression model?
Mention the Applications, Advantages and disadvantages of Linear
Regression.

Ans:

Linear regression:

Linear regression is a type of supervised machine learning algorithm that computes the linear
relationship between the dependent variable and one or more independent features by fitting a
linear equation to observed data.

Concept of Linear Regression:

Linear regression is a statistical method that models the relationship between two variables by
fitting a straight line to the data. This line, known as the "line of best fit," is used to make
predictions.

For example, imagine you want to predict someone’s monthly expenses based on their income.
Using linear regression, you plot income on the x-axis and expenses on the y-axis. The
regression model will calculate the best-fitting line through these points. With this line, you
can predict a person’s expenses for a given income.

The line has an equation of the form:

y=mx+by=mx+b
where:
• yy is the predicted value (e.g., expenses),
• xx is the input variable (e.g., income),
• mm is the slope of the line (how much yy changes with each unit increase in xx),
• bb is the y-intercept (the value of yy when xx is 0).
Applications:

1. Risk Assessment in Finance: Used to estimate the relationship between financial


indicators and stock prices, aiding in risk management.

2. Medical Predictive Analysis: Assists in predicting patient outcomes by modeling


relationships between health indicators (e.g., age, weight) and disease progression.

3. Real Estate Price Estimation: Helps in estimating property prices by analyzing


variables such as location, area, and amenities.

4. Market Trend Analysis: Used in economics to understand and predict trends in


consumer spending, GDP, and employment rates.

5. Manufacturing Quality Control: Helps detect correlations between production


variables and quality measures, improving product consistency.

ADVANTAGES:

• Linear regression is computationally efficient and can handle large datasets


effectively

• Linear regression is relatively robust to outliers compared to other machine learning


algorithms. Outliers may have a smaller impact on the overall model performance.

• Linear regression often serves as a good baseline model for comparison with more
complex machine learning algorithms.

• Linear regression is a well-established algorithm with a rich history and is widely


available in various machine learning libraries and software packages.
DISADVANTAGES:

• Linear regression assumes a linear relationship between the dependent and


independent variables. If the relationship is not linear, the model may not perform
well.

• Linear regression is sensitive to multicollinearity, which occurs when there is a high


correlation between independent variables. Multicollinearity can inflate the variance
of the coefficients and lead to unstable model predictions.

• Linear regression assumes that the features are already in a suitable form for the
model. Feature engineering may be required to transform features into a format that
can be effectively used by the model.
3.With relevant examples In detail explain about Multiple Linear
Regression. Write the basic formula and calculation procedure for Multiple
Linear Regression. List out the difference between Linear Regression and
Multiple Regression. Mention the Applications, Advantages and
disadvantages of Multiple Linear Regression.

Ans:

Multiple Linear Regression:

Multiple linear regression is a style of predictive analysis that is frequently used. You can
comprehend the relationship between such a continuous dependent variable and two or more
independent variables using this kind of analysis.

Multiple regression analysis allows for the simultaneous control of several factors that affect
the dependent variable. The link between independent variables and dependent variables can
be examined using regression analysis.

Let k stand for the quantity of variables denoted by the letters x1, x2, x3… xk.

To use this strategy, we must suppose that we have k independent variables that we may set.
These variables will then probabilistically decide the result Y.

Y = β0 + β1x1 + β2x2 + · · · + βkxk + ε

Example:

Predicting House Prices

Imagine you are trying to predict the price of a house (yyy) based on:

1. Size (in square feet, x1x_1x1)


2. Number of bedrooms (x2x_2x2)
3. Distance to the city center (in miles, x3x_3x3)
The model might look like this:

Price =50,000+200⋅(Size)−5,000⋅(Distance)+10,000⋅(Bedrooms)

If a house is 1500 sq. ft., 3 bedrooms, and 10 miles from the city:

Price = 50,000+200⋅1500−5000⋅10+10,000⋅3=305,00

Difference Between Linear and Multiple Regression:

Multiple linear regression is preferable than basic linear regression when predicting the result
of a complex process.The relationship between two variables in straightforward relationships
can be precisely captured by a straightforward linear regression. However, multiple linear
regression can identify more intricate interactions that demand deeper analysis.

Multiple independent variables are used in a multiple regression model. It can match curved
and non-linear connections since it is not constrained by the same issues as the simple
regression equation. The uses of multiple linear regression are as follows.

• Control and planning.

• Forecasting or prediction

It can be fascinating and helpful to estimate relationships between variables. The multiple
regression model evaluates relationships between variables in terms of their capacity to forecast
the value of the dependent variable, just like all other regression models do.

Applications:

• Economics and Finance: Modeling stock market trends, demand-supply


curves, and price elasticity.
• Engineering and Physical Sciences: Capturing complex relationships in
physics (e.g., projectile motion) or material behavior.
• Medical Studies: Predicting disease progression or drug effectiveness based on
various parameters.
• Agriculture: Analyzing crop yields based on environmental factors.
Advantages:

• Captures Non-Linearity
• Easy to Implement
• Flexible and Interpretable
• Enhances Prediction Accuracy

Limitations:

• Overfitting
• Sensitivity to Outliers
• Lack of Extrapolation Power
• Computational Complexity
• Interpretability Issues
4.How does a Polynomial Regression work? Explain polynomial regression
with real life example. How to overcome the problem of Over fitting and
under fitting in Polynomial Regression? Mention the Applications,
Advantages and Limitations of Polynomial Regression.
Ans:
Polynomial Regression:
It is a form of linear regression in which the relationship between the independent variable x
and dependent variable y is modelled as an nth-degree polynomial. Polynomial regression fits
a nonlinear relationship between the value of x and the corresponding conditional mean of y,
denoted E(y | x).

Polynomial regression is a type of regression analysis used in statistics and machine


learning when the relationship between the independent variable (input) and the
dependent variable (output) is not linear.

Polynomial Regression work:

If we observe closely then we will realize that to evolve from linear regression to polynomial
regression. We are just supposed to add the higher-order terms of the dependent features in the
feature space. This is sometimes also known as feature engineering but not exactly.

When the relationship is non-linear, a polynomial regression model introduces higher-degree


polynomial terms.

The general form of a polynomial regression equation of degree n is:

• y is the dependent variable.


• x is the independent variable.
• 0,1,…n are the coefficients of the polynomial terms.
• n is the degree of the polynomial.
• represents the error term.
The basic goal of regression analysis is to model the expected value of a dependent variable
y in terms of the value of an independent variable x. In simple linear regression, we used
the following equation – y = a +bx+e
Here y is a dependent variable, a is the y-intercept, b is the slope and e is the error rate.In
general, we can model it for the nth value.

Polynomial Regression Real-Life Example:


Let’s consider a real-life example to illustrate the application of polynomial
regression Suppose you are working in the field of finance, and you are analyzing the
relationship between the years of experience (in years) an employee has and their
corresponding salary (in dollars). You suspect that the relationship might not be linear and
that higher degrees of the polynomial might better capture the salary progression over time.

Now, let’s apply polynomial regression to model the relationship between years of
experience and salary. We’ll use a quadratic polynomial (degree 2) for this example.

The quadratic polynomial regression equation is:

Salary= 0 + 1×Experience+2×Experience^2+….

Now, to find the coefficients that minimize the difference between the predicted salaries and
the actual salaries in the dataset we can use a method of least squares. The objective is to
minimize the sum of squared differences between the predicted values and the actual values.

Over fitting and under fitting in Polynomial Regression:

Polynomial regression often leads to overfitting when the model's complexity increases to fit
the training data too closely, resulting in poor performance on new data. To address this,
regularization techniques like Ridge and Lasso regression are used. These methods penalize
large model weights, reducing overfitting by discouraging overly complex models. Ridge
regression minimizes the sum of squared weights, while Lasso regression encourages sparsity
by driving some weights to zero, simplifying the model. Regularization ensures a balance
between fitting the training data and maintaining generalization to unseen data.

Application:

The reason behind the vast use cases of the polynomial regression is that approximately all
of the real-world data is non-linear in nature and hence when we fit a non-linear model on
the data or a curvilinear regression line then the results that we obtain are far better than what
we can achieve with the standard linear regression. Some of the use cases of the Polynomial
regression are as stated below:

• The growth rate of tissues.

• Progression of disease epidemics

• Distribution of carbon isotopes in lake sediments

Advantages:

• A broad range of functions can be fit under it.

• Polynomial basically fits a wide range of curvatures.

• Polynomial provides the best approximation of the relationship between dependent


and independent variables.

Disadvantages:

• These are too sensitive to outliers.

• The presence of one or two outliers in the data can seriously affect the results of
nonlinear analysis.

• In addition, there are unfortunately fewer model validation tools for the detection of
outliers in nonlinear regression than there are for linear regression.
5.What are odds? Why is it used in logistic regression? With an example,
describe How we can solve multiclass classification problem using Logistic
Regression? Discuss assumptions in logistic regression. Mention the
Applications, Advantages and Limitations of Logistic Regression.
Ans:

Odds:
Odds represent the ratio of the probability of an event occurring to the probability of it not
occurring. Mathematically:
Odds=𝑃(event)/1−𝑃(event)
In logistic regression, odds are used because the model predicts probabilities of outcomes, and
the log of the odds (logit) provides a linear relationship between the predictors and the target.
This makes it easier to estimate parameters and interpret the relationship between variables.

Solving Multiclass Classification with Logistic Regression:

Logistic regression is inherently a binary classifier, but multiclass problems can be solved using
techniques like:

• One-vs-Rest (OvR): The problem is divided into multiple binary classification tasks,
where one class is treated as the positive class and all others as negative. For example,
to classify handwritten digits (0-9), 10 binary classifiers are trained, one for each digit.
• One-vs-One (OvO): A binary classifier is trained for every pair of classes. For
example, for classifying digits, classifiers for (0 vs 1), (0 vs 2), (1 vs 2), and so on are
created, resulting in (10/2)=45 classifiers. The final prediction is based on a voting
mechanism.

In logistic regression, multiclass classification is handled using techniques like One-vs-Rest


(OvR) or Softmax Regression:
Suppose we want to predict the branch of engineering a student will choose based on their
scores in Physics, Chemistry, and Mathematics. The possible branches are:
1: Computer Science
2: Electrical Engineering
3: Mechanical Engineering
Steps:

Using One-vs-Rest (OvR), logistic regression creates separate binary classifiers:


Classifier 1: "Computer Science" vs. "Not Computer Science"
Classifier 2: "Electrical Engineering" vs. "Not Electrical Engineering"
Classifier 3: "Mechanical Engineering" vs. "Not Mechanical Engineering"
Each classifier predicts probabilities, and the branch with the highest probability is selected as
the final prediction.
Alternatively, Softmax Regression computes probabilities directly for all classes in a single
model, assigning the class with the highest probability.

Assumptions:
❖ Independent observations: Each observation is independent of the other. meaning
there is no correlation between any input variables.
❖ Binary dependent variables: It takes the assumption that the dependent variable must
be binary or dichotomous, meaning it can take only two values.
❖ Linearity relationship between independent variables and log odds: The
relationship between the independent variables and the log odds of the dependent
variable should be linear.
❖ No outliers: There should be no outliers in the dataset.
❖ Large sample size: The sample size is sufficiently large

Applications:
❖ Healthcare: Predicting the presence of a disease (e.g., diabetes or cancer detection).
❖ Marketing: Classifying customer responses (e.g., whether they will purchase a product
or not).
❖ Finance: Assessing credit risk or likelihood of loan default.
❖ Education: Predicting student performance or dropout risk.
❖ Social Science: Modeling voting behavior or survey responses.
Advantages:
• Simplicity and Interpretability
• Efficiency
• Probability Output
• Works Well with Linearly Separable Data
• Less Prone to Overfitting

Disadvantages:
• Linear Decision Boundaries
• Sensitive to Outliers
• Assumes Independence of Features
• Limited to Binary and Multiclass Classification
• Requires Large Datasets
6.Write the Difference Between Probability and Likelihood. In detail explain
the concept of Maximum Likelihood Estimation with an example.Mention
the Applications, Advantages and disadvantages of Maximum likelihood
estimation.
Ans:

Difference Between Probability and Likelihood:


Although the working and intuition of both probability and likelihood appear to be the same,
there is a slight difference, here the possibility is a function that defines or tells us how accurate
the particular data point is valuable and contributes to the final algorithm in data distribution
and how likely is to the machine learning algorithm.

Whereas probability, in simple words is a term that describes the chance of some event or thing
happening concerning other circumstances or conditions, mostly known as conditional
probability.

Also, the sum of all the probabilities associated with a particular problem is one and can not
exceed it, whereas the likelihood can be greater than one.

Maximum Likelihood Estimation:


Maximum Likelihood Estimation (MLE) aims to find the parameters that maximize the
likelihood of observing the given data. In a classification problem where the independent
variable is student marks and the target is whether a student gets placed ("Yes" or "No"), MLE
estimates the probabilities for each data point based on the target conditions. The algorithm
then plots the data points and identifies the best-fit line that separates the two classes (placed
vs. not placed). This process continues for several iterations (epochs) to refine the line. Once
the best-fit line is found, it is used to classify new data points by plotting them on the graph.

Example:
The maximum likelihood estimation is a base of some machine learning and deep learning
approaches used for classification problems. One example is logistic regression, where the
algorithm is used to classify the data point using the best-fit line on the graph.
As shown in the above image, all the data observations are plotted in a two-dimensional
diagram where the X-axis represents the independent column or the training data, and the y-
axis represents the target variable. The line is drawn to separate both data observations,
positives and negatives. According to the algorithm, the observations that fall above the line
are considered positive, and data points below the line are regarded as negative data points.

Applications:
• Machine Learning: MLE is used to estimate the parameters of various models,
including logistic regression, Naive Bayes, and hidden Markov models.
• Econometrics: It helps in estimating the parameters of economic models, such as
demand and supply functions.
• Reliability Engineering: Used to estimate the failure rates of systems and components.
• Signal Processing: MLE is applied to estimate parameters in signal noise models, such
as in radar or communication systems.

Advantages:
• Asymptotically Efficient
• Flexibility
• Consistent
• Statistical Inference
• Broad Applicability
Disadvantages:
• Computational Complexity
• Sensitivity to Outliers
• Requires Large Sample Sizes
• Model Assumptions
• Local Maxima

You might also like