Group_1_Practical
Group_1_Practical
Objectives:
● Intuition
○ Understanding the basics of linear regression
○ Conceptualizing how linear regression works
● Dataset Specification
○ Describing the dataset used in the practical
○ Identifying the variables and their significance
● Data Pre-processing
○ Cleaning and handling missing data
○ Feature scaling and normalization
○ Encoding categorical variables
● Data Splitting
○ Dividing the dataset into training and testing sets
○ Determining the split ratio
● Model Selection
○ Choosing linear regression as the predictive model
○ Justification for selecting linear regression
● Model Training
○ Implementing a simple linear regression model
○ Training the model using the training dataset
Linear regression is a supervised learning algorithm that is used to predict continuous values.
It is one of the most fundamental and widely used machine learning algorithms.
Linear regression works by finding the best-fit line through a set of data points. The best-fit
line is the line that minimizes the distance between the data points and the line itself.
Once the best-fit line has been found, it can be used to predict the value of the dependent
variable for new input values.
Let’s assume there is a linear relationship between X and Y then value of Y can be predicted
using:
Here,
● are labels to data (Supervised learning)
● are the input independent training data (univariate – one input
variable(parameter))
● are the predicted values.
A linear regression model can be trained using the optimization algorithm gradient descent by
iteratively modifying the model’s parameters to reduce the mean squared error (MSE) of the
model on a training dataset. To update θ1 and θ2 values in order to reduce the Cost function
(minimizing RMSE value) and achieve the best-fit line the model uses Gradient Descent. The
idea is to start with random θ1 and θ2 values and then iteratively update the values
● The relationship between the dependent variable and the independent variables is
linear.
● The variance of the residuals is constant across all values of the independent
variables.
● The errors are independent of each other.
If these assumptions are not met, the results of the linear regression analysis may not be
reliable.
Conceptualizing how linear regression works
Linear regression is a statistical method used to predict the value of a dependent variable
based on the values of one or more independent variables. It is a supervised learning
algorithm, which means that it is trained on a dataset of known input-output pairs.
There are several ways to conceptualize how linear regression works. One way is to think of
it as a line that best fits a set of data points. The line is fitted to the data using a least squares
approach, which minimizes the sum of the squared distances between the data points and the
line.
Once the linear regression model has been fitted to the data, it can be used to predict the
value of the dependent variable for new input values. For example, if the linear regression
model is used to predict the weight of a person based on their height, the model can be used
to predict the weight of a person of any height.
Linear regression is a powerful tool that can be used to predict the value of a dependent
variable based on the values of one or more independent variables. It is a versatile algorithm
that can be used for a wide variety of tasks, including machine learning.
1. Data Points :
- Linear regression starts with a dataset containing observations or data points. Each data point
consists of pairs of values: one or more independent variables (features) and the corresponding
dependent variable (the target).
2. Scatter Plot :
- To visualize the relationship between the independent variable(s) and the dependent variable,
you can create a scatter plot. Each data point is plotted, with the independent variable(s) on the
x-axis and the dependent variable on the y-axis.
3. Linear Equation :
- Linear regression assumes that there is a linear relationship between the independent
variable(s) and the dependent variable. This relationship is represented by a linear equation of
the form:
y = b0 + b1 * x
4. Best-Fit Line :
- The goal of linear regression is to find the best-fit line that minimizes the sum of the squared
differences between the actual data points and the predicted values along this line. This line
represents the model's estimate of the linear relationship.
6. Predictions :
- Once the model is trained and you have the values of `b0` and `b1`, you can use the linear
equation to make predictions for new, unseen data points. Simply plug in the values of the
independent variable(s) to estimate the dependent variable.
7. Interpretation :
- The coefficients `b0` and `b1` have interpretive significance. `b1` indicates how much the
dependent variable changes for a one-unit change in the independent variable. A positive `b1`
suggests a positive relationship, and a negative `b1` suggests a negative relationship.
8. Model Evaluation :
- Linear regression models are evaluated using various metrics such as R-squared, MSE, or
RMSE. These metrics assess how well the model fits the data and makes accurate predictions.
About Dataset:
● R&D Spend: The amount spent annually by a startup in Research and Development
for their product/service.
● Administration: Amount spent annually in managing workforce, including salaries,
machine costs, etc.
● Marketing Spend: Amount spent annually for promoting the product/service both
online and offline.
● State: The name of State where the organization is located or operating from.
● Profit: The net profit amount of the startup company annually.
2. Administration:
- Significance : It can include costs such as salaries for administrative staff, office rent,
utilities, and other overhead expenses. The significance of this column depends on how
efficiently these administrative expenses are managed. High administrative expenses relative to
revenue could negatively impact profitability.
3. Marketing:
- Significance : It's important because marketing is essential for promoting products or
services, expanding the customer base, and increasing sales. Effective marketing can lead to
higher revenue and, ultimately, higher profit. The significance of this column depends on the
effectiveness of the marketing efforts.
4. State:
- Significance : The significance of this categorical variable depends on various factors, such
as state-specific economic conditions, market size, regulatory environment, and consumer
behavior. Different states may offer different business opportunities and challenges, and the
choice of state can impact profitability.
5. Profit:
- Significance : This is the target variable you want to predict. It represents the company's
financial performance, and it's the primary measure of success. The goal is to predict and
maximize profit, so understanding the significance of the other columns in relation to "Profit" is
essential for making informed business decisions.
Data Pre-processing
Data preprocessing is a crucial phase in our startup profit prediction project using linear
regression. This phase involves several key steps to ensure that our dataset is prepared for
effective model training and evaluation. Additionally, data splitting helps assess your model's
performance accurately.
Missing data is a common issue in datasets. It can lead to inaccurate results and cause problems
for machine learning models.Start by addressing missing data in the dataset, particularly in
essential columns such as ‘profit,' 'Marketing,' and ’Administration'. Utilize techniques like mean
imputation for numerical features and mode imputation for categorical attributes. Clean data
ensures that the linear regression model receives high-quality inputs.
It is essential to split the dataset into two subsets: a training set and a testing set. In the context of
our startup profit prediction project, this division plays a vital role. The training set is where our
machine learning model learns patterns and relationships within the data, such as the impact of
features like 'R&D Spent,' 'Administration Cost', ‘Marketing Cost' and other relevant features.
The testing set, on the other hand, serves as a means to evaluate how well our model performs in
predicting the profit earned by the startup when presented with new, unseen data. This division
ensures that our model not only learns from the data but also generalizes effectively, making
reliable predictions for potential profit seeking startups.
The split ratio is a critical decision in our startup profit prediction project. While common ratios
like 70/30 or 80/20 are often used, the choice depends on the size of our dataset and the specific
goals of our project. In our case, a larger training set allows our model to learn more
comprehensively from historical startups data, enabling it to capture complex profit
determinants. However, we must balance this with the need for a sufficiently substantial testing
set. This is vital for evaluating our model's performance accurately and ensuring that it can
handle diverse costs to spend in various categories involving salaries, marketing and research
inputs. The choice of the split ratio is a strategic decision, and it's essential to find the right
balance between model learning and evaluation.
Splitting the dataset into train and test sets with 80:20 ratio.
test_size = 0.2 represents the 20% of the dataset to be included in the test set.
random_state is used to control the randomness of the data split. Setting it to a fixed value (e.g.,
0) ensures that you get the same random split every time you run the code. If you don't set it, the
split will be different each time you run the code.
You can use X_train and y_train to train your machine learning model, and then use X_test to
make predictions, which you can compare to y_test to evaluate the model's performance. This
splitting ensures that you have a separate dataset for testing the model's performance, helping to
assess how well it generalizes to new, unseen data.
Model Selection
Model Training
We will use the LinearRegression() method from the scikit-learn module from python.
Following code implements the same to train the model:
Now, storing the predicted value to the y_pred variable. Printing the predicted and test set values
to compare them.
Model Evaluation
Computing the R-squared metrics for finding the accuracy of the model.
Unexplained Variation or SSR is the Sum of Squares of Residuals (also known as the Sum of
Squared Errors, SSE): It measures the total squared differences between the observed values (the
actual target values) and the predicted values from the model. A lower SSR indicates a better fit.
Total Variation or SST is the Total Sum of Squares: It measures the total squared differences
between the observed values and the mean of the observed values. It represents the total
variability in the dependent variable. A lower SST indicates a better fit.
● If R2 = 1, it means that the model perfectly fits the data, explaining all the variability in
the dependent variable.
● If R2 = 0, it means that the model doesn't explain any of the variability, and it's no better
than a horizontal line (the mean of the dependent variable).
Computing Mean Squared Error metrics for finding the accuracy of the model.
The Mean Squared Error (MSE) is a commonly used metric for evaluating the performance of
regression models. It measures the average squared difference between the predicted values and
the actual (observed) values of the dependent variable (target). A lower MSE indicates a better fit
of the model to the data.
● MSE values are expressed in quadratic equations. Hence when we plot it, we get a
gradient descent with only one global minima.
● For small errors, it converges to the minima efficiently. There are no local minima.
● MSE penalizes the model for having huge errors by squaring them.
● It is particularly helpful in weeding out outliers with large errors from the model by
putting more weight on them.
Plotting the curve for Actual vs Predicted profit values over fitting a linear line:
Applications
Linear regression is a very versatile algorithm and can be used for a wide variety of tasks,
including:
● Predicting the price of a house based on its square footage and number of bedrooms.
● Predicting the risk of a customer churning based on their past purchase history.
● Predicting the demand for a product based on historical sales data.
● Predicting the performance of a student on a test based on their past test scores.
https://www.investopedia.com/terms/r/r-squared.asp#:~:text=The%20calculation%20of%20R%2
Dsquared,values%2C%20and%20square%20the%20results.