0% found this document useful (0 votes)
8 views

Group_1_Practical

The document outlines the process of implementing a linear regression model for predicting startup profits, covering objectives such as understanding linear regression, dataset specifications, data preprocessing, model selection, training, and evaluation. It details the significance of various dataset variables, the importance of data cleaning and feature scaling, and the rationale behind choosing linear regression as a predictive model. Additionally, it discusses model evaluation metrics like Mean Squared Error and R-squared to assess the model's performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Group_1_Practical

The document outlines the process of implementing a linear regression model for predicting startup profits, covering objectives such as understanding linear regression, dataset specifications, data preprocessing, model selection, training, and evaluation. It details the significance of various dataset variables, the importance of data cleaning and feature scaling, and the rationale behind choosing linear regression as a predictive model. Additionally, it discusses model evaluation metrics like Mean Squared Error and R-squared to assess the model's performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Experiment No 4: Linear Regression Model

Objectives:

● Intuition
○ Understanding the basics of linear regression
○ Conceptualizing how linear regression works

● Dataset Specification
○ Describing the dataset used in the practical
○ Identifying the variables and their significance

● Data Pre-processing
○ Cleaning and handling missing data
○ Feature scaling and normalization
○ Encoding categorical variables

● Data Splitting
○ Dividing the dataset into training and testing sets
○ Determining the split ratio

● Model Selection
○ Choosing linear regression as the predictive model
○ Justification for selecting linear regression

● Model Training
○ Implementing a simple linear regression model
○ Training the model using the training dataset

● Model Evaluation (Mean Squared Error and R-squared Metrics)


○ Calculating the mean squared error (MSE)
○ Computing the R-squared value to assess model fit
○ Interpreting the results

● Generalization and Application


○ Discussing the practical applications of linear regression
○ Generalization of the model for real-world scenarios
○ Sharing insights and takeaways
Intuition

Understanding the basics of linear regression

Linear regression is a supervised learning algorithm that is used to predict continuous values.
It is one of the most fundamental and widely used machine learning algorithms.

Linear regression works by finding the best-fit line through a set of data points. The best-fit
line is the line that minimizes the distance between the data points and the line itself.

Once the best-fit line has been found, it can be used to predict the value of the dependent
variable for new input values.

Let’s assume there is a linear relationship between X and Y then value of Y can be predicted
using:

Here,
● are labels to data (Supervised learning)
● are the input independent training data (univariate – one input
variable(parameter))
● are the predicted values.

A linear regression model can be trained using the optimization algorithm gradient descent by
iteratively modifying the model’s parameters to reduce the mean squared error (MSE) of the
model on a training dataset. To update θ1 and θ2 values in order to reduce the Cost function
(minimizing RMSE value) and achieve the best-fit line the model uses Gradient Descent. The
idea is to start with random θ1 and θ2 values and then iteratively update the values

The assumptions of linear regression are:

● The relationship between the dependent variable and the independent variables is
linear.
● The variance of the residuals is constant across all values of the independent
variables.
● The errors are independent of each other.

If these assumptions are not met, the results of the linear regression analysis may not be
reliable.
Conceptualizing how linear regression works

Linear regression is a statistical method used to predict the value of a dependent variable
based on the values of one or more independent variables. It is a supervised learning
algorithm, which means that it is trained on a dataset of known input-output pairs.

There are several ways to conceptualize how linear regression works. One way is to think of
it as a line that best fits a set of data points. The line is fitted to the data using a least squares
approach, which minimizes the sum of the squared distances between the data points and the
line.

Another way to conceptualize linear regression is to think of it as a way to model the


relationship between the dependent and independent variables. The linear regression model
assumes that the relationship between the two variables is linear, i.e., that a change in the
independent variable results in a proportional change in the dependent variable.

Once the linear regression model has been fitted to the data, it can be used to predict the
value of the dependent variable for new input values. For example, if the linear regression
model is used to predict the weight of a person based on their height, the model can be used
to predict the weight of a person of any height.

Linear regression is a powerful tool that can be used to predict the value of a dependent
variable based on the values of one or more independent variables. It is a versatile algorithm
that can be used for a wide variety of tasks, including machine learning.

Here's a step-by-step conceptual overview of how linear regression operates:

1. Data Points :
- Linear regression starts with a dataset containing observations or data points. Each data point
consists of pairs of values: one or more independent variables (features) and the corresponding
dependent variable (the target).

2. Scatter Plot :
- To visualize the relationship between the independent variable(s) and the dependent variable,
you can create a scatter plot. Each data point is plotted, with the independent variable(s) on the
x-axis and the dependent variable on the y-axis.

3. Linear Equation :
- Linear regression assumes that there is a linear relationship between the independent
variable(s) and the dependent variable. This relationship is represented by a linear equation of
the form:
y = b0 + b1 * x

- `y` is the predicted value of the dependent variable.


- `x` is the value of the independent variable.
- `b0` is the intercept (the value of `y` when `x` is 0).
- `b1` is the slope (the change in `y` for a one-unit change in `x`).

4. Best-Fit Line :
- The goal of linear regression is to find the best-fit line that minimizes the sum of the squared
differences between the actual data points and the predicted values along this line. This line
represents the model's estimate of the linear relationship.

5. Training the Model :


- Linear regression calculates the values of `b0` and `b1` that make the best-fit line the best
possible fit to the data. This process involves minimizing the mean squared error (MSE) or a
similar cost function.

6. Predictions :
- Once the model is trained and you have the values of `b0` and `b1`, you can use the linear
equation to make predictions for new, unseen data points. Simply plug in the values of the
independent variable(s) to estimate the dependent variable.

7. Interpretation :
- The coefficients `b0` and `b1` have interpretive significance. `b1` indicates how much the
dependent variable changes for a one-unit change in the independent variable. A positive `b1`
suggests a positive relationship, and a negative `b1` suggests a negative relationship.

8. Model Evaluation :
- Linear regression models are evaluated using various metrics such as R-squared, MSE, or
RMSE. These metrics assess how well the model fits the data and makes accurate predictions.

9. Limitations and Considerations :


- It's important to be aware of the assumptions of linear regression, including linearity,
independence of errors, homoscedasticity, and normality of residuals. Violations of these
assumptions can affect the model's accuracy.
Dataset Specification

Describing the dataset used in the practical

About Dataset:

● R&D Spend: The amount spent annually by a startup in Research and Development
for their product/service.
● Administration: Amount spent annually in managing workforce, including salaries,
machine costs, etc.
● Marketing Spend: Amount spent annually for promoting the product/service both
online and offline.
● State: The name of State where the organization is located or operating from.
● Profit: The net profit amount of the startup company annually.

This dataset can be used for a variety of purposes, such as:

● Predicting the profit to be earned in future.


● Identifying factors that affect the profit of the startup.
● Segmenting the company based on states.
● Developing marketing campaigns for certain startups.

Identifying the variables and their significance

What is variables in dataset


In a dataset, a variable refers to a specific characteristic, attribute, or field that holds
information about individual data points. Variables are the columns or features within the dataset
that store data values, and they play a fundamental role in data analysis and statistical modeling.
1. R&D Spent (Amount):
- Significance : It reflects the investment in innovative activities that can lead to the
development of new products, improved processes, or other competitive advantages. Generally,
higher R&D spending might be associated with higher future profits if those investments
translate into successful products or services.

2. Administration:
- Significance : It can include costs such as salaries for administrative staff, office rent,
utilities, and other overhead expenses. The significance of this column depends on how
efficiently these administrative expenses are managed. High administrative expenses relative to
revenue could negatively impact profitability.

3. Marketing:
- Significance : It's important because marketing is essential for promoting products or
services, expanding the customer base, and increasing sales. Effective marketing can lead to
higher revenue and, ultimately, higher profit. The significance of this column depends on the
effectiveness of the marketing efforts.

4. State:
- Significance : The significance of this categorical variable depends on various factors, such
as state-specific economic conditions, market size, regulatory environment, and consumer
behavior. Different states may offer different business opportunities and challenges, and the
choice of state can impact profitability.

5. Profit:
- Significance : This is the target variable you want to predict. It represents the company's
financial performance, and it's the primary measure of success. The goal is to predict and
maximize profit, so understanding the significance of the other columns in relation to "Profit" is
essential for making informed business decisions.
Data Pre-processing

Data preprocessing is a crucial phase in our startup profit prediction project using linear
regression. This phase involves several key steps to ensure that our dataset is prepared for
effective model training and evaluation. Additionally, data splitting helps assess your model's
performance accurately.

Cleaning and Handling Missing Data:

Missing data is a common issue in datasets. It can lead to inaccurate results and cause problems
for machine learning models.Start by addressing missing data in the dataset, particularly in
essential columns such as ‘profit,' 'Marketing,' and ’Administration'. Utilize techniques like mean
imputation for numerical features and mode imputation for categorical attributes. Clean data
ensures that the linear regression model receives high-quality inputs.

Feature Scaling and Normalization:

Standardization (Z-score normalization): Standardization scales data so that it has a mean of 0


and a standard deviation of 1. It is appropriate when the data is approximately normally
distributed and doesn't have strong outliers.
Min-Max scaling transforms data into a specific range, often [0, 1], by using the minimum and
maximum values of the feature. It's suitable when you want to constrain data to a specific range.

Encoding Categorical Variables:

One-Hot Encoding is a technique used in data preprocessing, particularly in the context of


machine learning and data analysis, to convert categorical variables into a numerical format that
can be used in statistical and machine learning algorithms. It is particularly useful when dealing
with categorical data that cannot be directly used in models that expect numerical input.
Data Splitting

Dividing the Dataset into Training and Testing Sets:

It is essential to split the dataset into two subsets: a training set and a testing set. In the context of
our startup profit prediction project, this division plays a vital role. The training set is where our
machine learning model learns patterns and relationships within the data, such as the impact of
features like 'R&D Spent,' 'Administration Cost', ‘Marketing Cost' and other relevant features.
The testing set, on the other hand, serves as a means to evaluate how well our model performs in
predicting the profit earned by the startup when presented with new, unseen data. This division
ensures that our model not only learns from the data but also generalizes effectively, making
reliable predictions for potential profit seeking startups.

Determining the Split Ratio:

The split ratio is a critical decision in our startup profit prediction project. While common ratios
like 70/30 or 80/20 are often used, the choice depends on the size of our dataset and the specific
goals of our project. In our case, a larger training set allows our model to learn more
comprehensively from historical startups data, enabling it to capture complex profit
determinants. However, we must balance this with the need for a sufficiently substantial testing
set. This is vital for evaluating our model's performance accurately and ensuring that it can
handle diverse costs to spend in various categories involving salaries, marketing and research
inputs. The choice of the split ratio is a strategic decision, and it's essential to find the right
balance between model learning and evaluation.

Splitting the dataset into train and test sets with 80:20 ratio.

● X_train: This contains the features for the training set.


● y_train: This contains the corresponding target values for the training set.
● X_test: This contains the features for the testing set.
● y_test: This contains the corresponding target values for the testing set.

test_size = 0.2 represents the 20% of the dataset to be included in the test set.
random_state is used to control the randomness of the data split. Setting it to a fixed value (e.g.,
0) ensures that you get the same random split every time you run the code. If you don't set it, the
split will be different each time you run the code.

You can use X_train and y_train to train your machine learning model, and then use X_test to
make predictions, which you can compare to y_test to evaluate the model's performance. This
splitting ensures that you have a separate dataset for testing the model's performance, helping to
assess how well it generalizes to new, unseen data.

Model Selection

Why Linear Regression?

● Linearity Assumption: Linear regression assumes a linear relationship between the


independent variables (features) and the dependent variable (Profit). In business and
financial contexts, it is often reasonable to assume that there is a linear relationship
between certain financial indicators (such as R&D Spend, Administration, and Marketing
Spend) and profit. Linear regression can capture this relationship well.
● Interpretability: Linear regression provides straightforward interpretability. You can
easily interpret the coefficients of the regression equation, which represent the change in
profit associated with a one-unit change in each independent variable while holding all
other variables constant. This interpretability is valuable for making business decisions
and understanding the impact of different factors on profit.
● Simplicity: Linear regression is a simple and well-understood model. It doesn't require
complex assumptions or a large amount of data. If the relationship between the features
and profit is roughly linear, a simple linear regression model may provide adequate
predictions.
● Quick to Implement: Linear regression is easy to implement and computationally
efficient. This makes it a good choice for quick initial analysis and as a baseline model.
● Assumption Testing: You can perform various diagnostic tests to assess whether the
assumptions of linear regression are met. If the assumptions are reasonably satisfied,
linear regression can provide reliable predictions.
● Feature Importance: Linear regression provides coefficients for each feature, indicating
their importance in predicting profit. This information can help you identify which
variables have the most significant impact on profit
Here, we can use Linear Regression model with Y (dependant) value as profit (which needs to be
predicted) and Independent values:
1. X1 = R&D Cost
2. X2 = Administration
3. X3 = Marketing

We can then apply Y = aX1 + bX2 + cX3 (multiple linear regression).

Model Training

We will use the LinearRegression() method from the scikit-learn module from python.
Following code implements the same to train the model:

Now, storing the predicted value to the y_pred variable. Printing the predicted and test set values
to compare them.
Model Evaluation

Computing the R-squared metrics for finding the accuracy of the model.

Unexplained Variation or SSR is the Sum of Squares of Residuals (also known as the Sum of
Squared Errors, SSE): It measures the total squared differences between the observed values (the
actual target values) and the predicted values from the model. A lower SSR indicates a better fit.
Total Variation or SST is the Total Sum of Squares: It measures the total squared differences
between the observed values and the mean of the observed values. It represents the total
variability in the dependent variable. A lower SST indicates a better fit.

● If R2 = 1, it means that the model perfectly fits the data, explaining all the variability in
the dependent variable.
● If R2 = 0, it means that the model doesn't explain any of the variability, and it's no better
than a horizontal line (the mean of the dependent variable).

We will use the r2square method from scikit-learn module metrics.

Computing Mean Squared Error metrics for finding the accuracy of the model.

The Mean Squared Error (MSE) is a commonly used metric for evaluating the performance of
regression models. It measures the average squared difference between the predicted values and
the actual (observed) values of the dependent variable (target). A lower MSE indicates a better fit
of the model to the data.

● MSE values are expressed in quadratic equations. Hence when we plot it, we get a
gradient descent with only one global minima.
● For small errors, it converges to the minima efficiently. There are no local minima.
● MSE penalizes the model for having huge errors by squaring them.
● It is particularly helpful in weeding out outliers with large errors from the model by
putting more weight on them.

We will use the mean_squared_error method from scikit-learn module metrics.


Following code implements both metrics:
Generalization

Predicting a single value:


Assuming-
● R&D cost = 56000
● Administration = 67000
● Marketing = 68000

The predicted value:

Plotting the curve for Actual vs Predicted profit values over fitting a linear line:

From the above Accuracy Metrics calculations, we can infer that:


● Accuracy of the Linear Regression model according to R-squared metrics on test set =
93%
● Accuracy of the Linear Regression model according to Mean-Squared-Error on test set =
83.5%

Python code for above regression task:

# importing the required libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.compose import ColumnTransformer


from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split


from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score


from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt

#importing the dataset


df = pd.read_csv('50_Startups.csv')

#assigning the dependent and independent variables


X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

#Encoding the categorical data


ct = ColumnTransformer(transformers=[('encoder',
OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

#Splitting the dataset into training and test set


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.2, random_state = 0)

#Training multiple linear regression model on training set


regressor = LinearRegression()
regressor.fit(X_train, y_train)
#Predicting the test set results
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1),
y_test.reshape(len(y_test),1)),1))

#Computing the R-squared and MSE metrics


model = LinearRegression()

def model_prediction(model, x_train, y_train, x_test, y_test):


model.fit(x_train, y_train)
x_train_pred = model.predict(x_train)
x_test_pred = model.predict(x_test)

#Calculate the R-squared value for training and test


predictions
a = r2_score(y_train, x_train_pred)
b = r2_score(y_test, x_test_pred)
print(f"R2 Score of {type(model).__name__} model on Training
Data: {a:.2f}")
print(f"R2 Score of {type(model).__name__} model on Testing
Data: {b:.2f}")

#Calculate the Mean Squared Error for training and test


predictions
mse_train = mean_squared_error(y_train, x_train_pred)
mse_test = mean_squared_error(y_test, x_test_pred)

print(f"Mean Squared Error on the training set:


{mse_train:.2f}")
print(f"Mean Squared Error on the test set: {mse_test:.2f}")

print(f"Evaluating {type(model).__name__} model : ")


model_prediction(model, X_train, y_train, X_test, y_test)
print("\n")

#Plotting the curve to test fit of actual vs predicted profit


value
import matplotlib.pyplot as plt
# Create a scatter plot to compare the predicted values and the
actual target values
plt.scatter(y_test, y_pred, color='blue', label='Actual vs.
Predicted')
plt.xlabel('Actual Values (y_test)')
plt.ylabel('Predicted Values (y_pred)')
plt.title('Actual vs. Predicted Values')
plt.legend()
plt.grid()

plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)],


color='red', linestyle='--', linewidth=2, label='Perfect
Prediction')
plt.legend()

# Show the plot


plt.show()

Applications

Linear regression is a very versatile algorithm and can be used for a wide variety of tasks,
including:
● Predicting the price of a house based on its square footage and number of bedrooms.
● Predicting the risk of a customer churning based on their past purchase history.
● Predicting the demand for a product based on historical sales data.
● Predicting the performance of a student on a test based on their past test scores.

Linear regression is a relatively simple algorithm to understand and implement. However, it is


important to understand the assumptions of linear regression and to use it appropriately.

References used by students:


https://www.analyticsvidhya.com/blog/2021/10/evaluation-metric-for-regression-models/#:~:text
=Mean%20Squared%20Error%20(MSE),-MSE%20is%20one&text=In%20Mean%20Squared%
20Error%20also,the%20square%20of%20the%20error.

https://www.investopedia.com/terms/r/r-squared.asp#:~:text=The%20calculation%20of%20R%2
Dsquared,values%2C%20and%20square%20the%20results.

You might also like