0% found this document useful (0 votes)
86 views

Activity 4 CGPA Vs Placement Package Program

Activity 4 CGPA vs Placement Package Program

Uploaded by

Himanshu Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views

Activity 4 CGPA Vs Placement Package Program

Activity 4 CGPA vs Placement Package Program

Uploaded by

Himanshu Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

11/18/24, 12:53 PM ML PROJECT 2: CGPA VS Package.

ipynb - Colab

Data Processing is an important part of any task that includes data-driven work. It helps us to provide meaningful insights from the data. As we
know Python is a widely used programming language, and there are various libraries and tools available for data processing.

In this article, we are going to see Data Processing in Python, Loading, Printing rows and Columns, Data frame summary, Missing data values
Sorting and Merging Data Frames, Applying Functions, and Visualizing Dataframes.

Data Preprocessing involves a series of steps such as:

1.Data Collection.

2.Data Cleaning.

3.Data Transformation.

4.Feature Engineering: Scaling, Normalization and Standardization.

5.Feature Selection.

6.Handling Imbalanced Data.

7.Encoding Categorical Features.

8.Data Splitting.

keyboard_arrow_down IMPORT THE DEPENDENCIES/NECESSARY LIBRARIES


CGPA v/s Package (in LPA) Prediction using Simple Linear Regression

Dataset Link https://www.kaggle.com/datasets/parvmodi/cgpa-vs-package-in-lpa

# import important libraries


import numpy as np # for linear algebra
import pandas as pd # for data frames processing
import matplotlib.pyplot as plt # for plotting basic graphs
import seaborn as sns # for plotting advanced graphics & datavisualization
from sklearn.linear_model import LinearRegression # for linear regression model
from sklearn.model_selection import train_test_split # for splitting the dataset for training and testing
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # for evaluating the model performance

LOAD/IMPORT DATASET FROM CSV FILE TO PANDAS DATA FRAMES

# import the data


placement_data = pd.read_csv('/content/Placement.csv')
print(placement_data) # print command is used to show the output

---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-2-9d60e68a70f6> in <cell line: 2>()
1 # import the data
----> 2 placement_data = pd.read_csv('/content/Placement.csv')
3 print(placement_data) # print command is used to show the output

4 frames
/usr/local/lib/python3.10/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map,
is_text, errors, storage_options)
871 if ioargs.encoding and "b" not in ioargs.mode:
872 # Encoding
--> 873 handle = open(
874 handle,
875 ioargs.mode,

FileNotFoundError: [Errno 2] No such file or directory: '/content/Placement.csv'

Next steps: Explain error

# SHOW/DISPLAY FIRST 5 ROWS OF DATASET


placement_data.head()

# IF WE WANT TO SEE FIRST 10 ROWS


placement_data.head(12)

# SHOW THE LAST 5 ROWS OF DATASET


placement_data.tail()

https://colab.research.google.com/drive/17Px7K5hY0IQ4R396TXofcZRG1ZD-qxUT#scrollTo=3NZ4Y6FmQiR0&printMode=true 1/4
11/18/24, 12:53 PM ML PROJECT 2: CGPA VS Package.ipynb - Colab
# SHOW THE LAST 10 ROWS OF DATASET
placement_data.tail(10)

# FIND THE SHAPE/DIMENSION OF DATASET


placement_data.shape

# GET MORE INFORMATION ABOUT DATASET (SUCH AS NO. OF ROWS, COLUMNS, COLUMN NAME, DATA TYPE, NULL VALUE)
placement_data.info() # Data imputation

# FIND MISSING VALUE IN DATASET


placement_data.isnull().sum() # isnull command tells missing entries in a particular column

# GET STATISTICAL INFORMATION ABOUT DATASET


placement_data.describe()

# VISUALISE DATASET IN SCATTER PLOT


sns.regplot(x = placement_data['cgpa'], y = placement_data['package']) # regplot is used to plot regression plot
plt.show()

Step 2: Performing Simple Linear Regression

Steps of Model Building

1.Create X and y

2.Create train and test sets

3.Train the model on training set (i.e. learn the coefficients)

4.Evaluate the model on training set and test set

# SEPRATE INDEPENDENT VARIABLE(X) AND DEPENDENT VARIABLE (Y)


# 'cgpa' is an independent variable and 'package' is a dependent variable
X = placement_data['cgpa']
Y = placement_data['package']

# print independent and dependent variable


print(X)
print(Y)

# train test split


[X_train, X_test, Y_train,Y_test] = train_test_split(X, Y, train_size = .70, random_state = 2)
print('The shape of X_train is', X_train.shape)
print('The shape of X_test is', X_test.shape)
print('The shape of Y_train is', Y_train.shape)
print('The shape of Y_test is', Y_test.shape)
X_test.head()

# X_train and X_test are a series and we want to convert them to the 2D array for model building
# reshape X_train and X_test to (n,1)
X_train_lm = X_train.values.reshape(-1, 1)
print('The shape of X_train_lm is', X_train_lm.shape)
X_test_lm = X_test.values.reshape(-1, 1)
print('The shape of X_test_lm is', X_test_lm.shape)

# create an object of linear regression


lm = LinearRegression()

# fit the model


lm.fit(X_train_lm, Y_train)

# see the parameters


print('The coefficient is', round(lm.coef_[0], 2))
print('The intercept is', round(lm.intercept_, 2))

# make predictions on the training dataset


x_train_data_pred = lm.predict(X_train_lm)

# plot the model


plt.scatter(X_train, Y_train)
plt.plot(X_train, x_train_data_pred, color = 'r')
plt.show()

https://colab.research.google.com/drive/17Px7K5hY0IQ4R396TXofcZRG1ZD-qxUT#scrollTo=3NZ4Y6FmQiR0&printMode=true 2/4
11/18/24, 12:53 PM ML PROJECT 2: CGPA VS Package.ipynb - Colab

Step 4: Predictions and Evaluation on Test Set

# make predictions on the test set


x_test_data_pred = lm.predict(X_test_lm)
#print(x_test_data_pred)

**R squared Method**

It is a regression error metric that justifies the performance of the model. It represents the value of how much the independent variables are
able to describe the value for the response/target variable.

Thus, an R-squared model describes how well the target variable is explained by the combination of the independent variables as a single
unit.

The R squared value ranges between 0 to 1 and is represented by the below formula:

R2= 1- SSres / SStot

Here,

SSres: The sum of squares of the residual errors. SStot: It represents the total sum of the errors. Always remember, Higher the R square value,
better is the predicted model

# Model Evaluation on training data


MAE_train = round(mean_absolute_error(y_true = Y_train, y_pred = x_train_data_pred), 2)
MSE_train = round(mean_squared_error(y_true = Y_train, y_pred = x_train_data_pred), 2)
RMSE_train = round(np.sqrt(mean_squared_error(y_true = Y_train, y_pred = x_train_data_pred)), 2)
r_square_train = round(r2_score(y_true = Y_train, y_pred = x_train_data_pred), 2)
# Print the each type of error for training data
print('The Mean Absolute Error for the training set is', MAE_train)
print('The Mean Square Error for the training set is', MSE_train)
print('The Root Mean Square Error for the training set is', RMSE_train)
print('The R Square Error for the training set is', r_square_train)

# Model Evaluation on test data


MAE_test = round(mean_absolute_error (y_true = Y_test, y_pred = x_test_data_pred), 2)
MSE_test = round(mean_squared_error(y_true = Y_test, y_pred = x_test_data_pred), 2)
RMSE_test = round(np.sqrt(mean_squared_error(y_true = Y_test, y_pred = x_test_data_pred)), 2)
r_square_test = round(r2_score(y_true = Y_test, y_pred = x_test_data_pred), 2)
# Print the each type of error for Test data
print('The Mean Absolute Error for the Test Data is', MAE_test)
print('The Mean Square Error for the Test Data is', MSE_test)
print('The Root Mean Square Error for the Test Data is', RMSE_test)
print('The R Square Error for the Test Data is', r_square_test)

# plot the model with the test set


plt.scatter(X_train, Y_train)
plt.scatter(X_test, Y_test)
plt.plot(X_test, x_test_data_pred, color = 'g')
plt.show()

# Make predictive model


# NOW PREDICT ON NEW INPUT UNSEEN DATA POINTS
input_data =(7.48)
input_data_as_numpy_array = np.asarray(input_data)
# RESHAPE THE NUMPY ARRAY AS WE ARE PREDICTING ONLY FOR SINGLE DATA INTANCE AT A TIME
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)
prediction = lm.predict(input_data_reshaped)
print(prediction)

https://colab.research.google.com/drive/17Px7K5hY0IQ4R396TXofcZRG1ZD-qxUT#scrollTo=3NZ4Y6FmQiR0&printMode=true 3/4
11/18/24, 12:53 PM ML PROJECT 2: CGPA VS Package.ipynb - Colab

https://colab.research.google.com/drive/17Px7K5hY0IQ4R396TXofcZRG1ZD-qxUT#scrollTo=3NZ4Y6FmQiR0&printMode=true 4/4

You might also like