Real Report
Real Report
DRONACHARYA COLLEGE OF
ENGINEERING,KHENTAWAS ,GURGAON ,HARYANA
2
Submitted By
certificate
5
STUDENT DECLARATION
I hereby declare that the Practical Training Report entitled “data science and
machine learning” is an authentic record of my own work as requirements of 8-
weeks Industrial Training during the period from 26-06-2023 to 26-08-2023 for
the award of degree of B.Tech. (Computer Science & Engineering), Dronacharya
College of Engineering.
Signature of student
Rishabh Pandey
24140
Date: 28-08-2023
Certified that the above statement made by the student is correct to the best of
our knowledge and belief.
Signatures
Acknowledgement
Rishabh Pandey
“YBI Foundation”
Company Background
The platform provides free online instructor-led classes for students to excel in
data science, business analytics, machine learning, cloud computing and big
data. They aim to focus on innovation, creativity, technology approach and keep
themselves in sync with the present industry requirements. They endeavor to
support learners to achieve the highest possible goals in their academics and
professions.
They offers Free programs , scholarships for girls , dual internship program , full
stack dual certificate program and guaranteed placement assistance program for
students, freshers and working professionals.
anyone who wants to learn machine learning, data science and other emerging
technologies for industry 4.0 to make a career in it, whether they are beginners
or professionals, are welcome to enroll to our programs
8
( TABLE OF CONTENTS)
1.Chapter-1-introduction
2.chapter-2-introduction to python(12-16)
4.chapter-4-fundamental projects
4.1:Fraud detection(54)
Chapter-1
SCOPE OF DATA SCIENCE
The field of Data Science is one of the fastest growing in India. In
recent years, there has been a surge in the amount of data
available, and businesses are increasingly looking for ways to
make Science is a relatively new field, covering a wide range of
topics, from machine learning and artificial intelligence to
statistics and cloud computing.
Chapter-2
Introduction to python
Python is a widely used general-purpose, high level programming
language. It was created by Guido van Rossum in 1991 and further
developed by the Python Software Foundation. It was designed with
an emphasis on code readability, and its syntax allows
programmers to express their concepts in fewer lines of code.
Python is a programming language that lets you work quickly and
integrate systems more efficiently.
There are two major Python versions: Python 2 and Python 3.
Both are quite different.
15
INTRODUCTION TO GOOGLE
COLLAB
google is quite aggressive in AI research. Over many years, Google
developed AI framework called TensorFlow and a development tool
called Colaboratory. Today TensorFlow is open-sourced and since 2017,
Google made Colaboratory free for public use. Colaboratory is now known
as Google Colab or simply Colab.
Another attractive feature that Google offers to the developers is the use
of GPU. Colab supports GPU and it is totally free. The reasons for making it
free for public could be to make its software a standard in the academics
for teaching machine learning and data science. It may also have a long
term perspective of building a customer base for Google Cloud APIs which
are sold per-use basis.
Irrespective of the reasons, the introduction of Colab has eased the
learning and development of machine learning applications.
1. Pandas
All of us can do data analysis using pen and paper on small data sets. We
require specialized tools and techniques to analyze and derive meaningful
information from massive datasets. Pandas Python is one of those libraries
for data analysis that contains high-level data structures and tools to
manipulate data in a simple way. Providing an effortless yet effective way
to analyze data requires the ability to index, retrieve, split, join,
restructure, and various other analyses on both multi and single-
dimensional data.
2. NumPy
Numerical Python code name: - NumPy is a Python library for numerical
calculations and scientific computations. NumPy provides numerous
features which Python enthusiasts and programmers can use to work with
high-performing arrays and matrices. NumPy arrays provide vectorization
of mathematical operations, which gives it a performance boost over
Python’s looping constructs.
Pandas Series and DataFrame objects rely primarily on NumPy arrays for
all the mathematical calculations like slicing elements and performing
vector operations.
3) SciPy
Scientific Python code name, SciPy-It is an assortment of mathematical
functions and algorithms built on Python’s extension NumPy. SciPy
provides various high-level commands and classes for manipulating and
visualizing data. SciPy is useful for data-processing and prototyping
systems.
Apart from this, SciPy provides other advantages for building scientific
applications and many specialized, sophisticated applications backed by a
robust and fast-growing Python community.
SciPy does not provide any plotting function because its focus is on
numerical objects and algorithms.
4. Sci-Kit Learn
For machine learning practitioners, Sci-Kit Learn is the savior. It has
supervised and unsupervised machine learning algorithms for production
applications. Sci-Kit Learn focuses on code quality, documentation, ease
of use, and performance as this library provides learning algorithms. Sci-
Kit Learn has a steep learning curve.
5. PyCaret
PyCaret is a fully accessible machine learning package for model
deployment and data processing. It allows you to save time because it is a
low-code library. It's a user-friendly machine learning library that will help
you run end-to-end machine learning tests, whether you're trying to
suggest missing values, analyzing categorical data, engineering features,
tuning hyperparameters, or generating ensemble models.
6)Tensorflow
TensorFlow is a free end-to-end open-source platform for Machine
Learning that includes a wide range of tools, libraries, and resources. The
Google Brain team first released it on November 9, 2015. TensorFlow
makes it simple to design and train Machine Learning models using high-
level APIs like Keras. It also offers various abstraction levels, allowing you
to select the best approach for your model. TensorFlow also enables you
to deploy Machine Learning models in multiple environments, including
the cloud, browser, and your device. If you want the complete experience,
choose TensorFlow Extended (TFX); TensorFlow Lite if you're going to use
TensorFlow on mobile devices; and TensorFlow.js if you're going to train
and deploy models in JavaScript contexts.
7) OpenCV
Licensed under the BSD, OpenCV is a free machine learning and computer
vision library. It offers a shared architecture for computer vision
applications to streamline the implementation of computer vision in
commercial products.
The read_csv method of the Pandas library takes a CSV file as a parameter and
returns a dataframe.
import pandas as pd
df = pd.read_csv('my_csv.csv')
The read_pickle method of the Pandas library takes a pickle file as a parameter and
returns a dataframe.
import pandas as pd
df = pd.read_pickle('my_pkl.pkl')
The read_excel method of the Pandas library takes an excel file as a parameter and
returns a dataframe.
import pandas as pd
df = pd.read_excel('my_excel.xlsx')
Once the data has been read into a data frame, display the data frame to see if the
data has been read
30
correctly.
31
Syntax
Parameters
To see the list of the rest of the optional parameters, click here.
Return value
CODE
import pandas as pd
Chapter-3
Train test split
The train_test_split() method is used to split our data into train and
test sets.
First, we need to divide our data into features (X) and labels (y).
The dataframe gets divided into X_train,X_test , y_train and y_test.
X_train and y_train sets are used for training and fitting the model.
The X_test and y_test sets are used for testing the model if it’s
predicting the right outputs/labels. we can explicitly test the size of
the train and test sets. It is suggested to keep our train sets larger
than the test sets.
train set: The training dataset is a set of data that was utilized to
fit the model. The dataset on which the model is trained. This data
is seen and learned by the model.
test set: The test dataset is a subset of the training dataset that is
utilized to give an accurate evaluation of a final model fit.
Linear regression
Linear Regression is an algorithm that belongs to
supervised Machine Learning. It tries to apply relations that will predict the
outcome of an event based on the independent variable data points. The
relation is usually a straight line that best fits the different data points as close
as possible. The output is of a continuous form, i.e., numerical value. For
example, the output could be revenue or sales in currency, the number of
products sold, etc. In the above example, the independent variable can be
single or multiple.
1. Linear Regression Equation
y= β0+ β 1x+ ε
Here,
Y= Dependent Variable
X= Independent Variable
β 0= intercept of the line
β1 = Linear regression coefficient (slope of the line)
ε = random error
39
3. Non-Linear Regression
When the best fitting line is not a straight line but a curve, it is
referred to as Non-Linear Regression.
import pandas as pd
car_data = pd.read_csv('/content/car_data.csv')
41
car_data.head()
car_data.info()
Assumptions of Linear Regression
1. Easy to Implement
2. Scalability
3. Interpretability
4. Applicability in real-time
Logistic regression
o logistic regression is one of the most popular Machine Learning
algorithms, which comes under the Supervised Learning technique. It is
used for predicting the categorical dependent variable using a given set of
independent variables.
o Logistic regression predicts the output of a categorical dependent
variable. Therefore the outcome must be a categorical or discrete value. It
can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the
exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that
how they are used. Linear Regression is used for solving Regression
problems, whereas Logistic regression is used for solving the
classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S"
shaped logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something
such as whether the cells are cancerous or not, a mouse is obese or not
based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it
has the ability to provide probabilities and classify new data using
continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using
different types of data and can easily determine the most effective
variables used for the classification. The below image is showing the
logistic function:
46
4.7M
507
For this problem, we will build a Machine Learning model using the
Logistic regression algorithm. The dataset is shown in the below image. In
this problem, we will predict the purchased variable (Dependent
Variable) by using age and salary (Independent variables).
48
By executing the above lines of code, we will get the dataset as the
output. Consider the given image:
49
Now, we will extract the dependent and independent variables from the
given dataset. Below is the code for it:
In the above code, we have taken [2, 3] for x because our independent
variables are age and salary, which are at index 2, 3. And we have taken 4
for y variable because our dependent variable is at index 4. The output
will be:
50
Now we will split the dataset into a training set and test set. Below is the
code for it:
1. # Splitting the dataset into training and test set.
2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, rando
m_state=0)
We have well prepared our dataset, and now we will train the dataset
using the training set. For providing training or fitting the model to the
training set, we will import the LogisticRegression class of
the sklearn library.
After importing the class, we will create a classifier object and use it to fit
the model to the logistic regression. Below is the code for it:
Output: By executing the above code, we will get the below output:
Out[5]:
Our model is well trained on the training set, so we will now predict the
result by using test set data. Below is the code for it:
In the above code, we have created a y_pred vector to predict the test set
result.
The above output image shows the corresponding predicted users who
want to purchase or not purchase the car.
Now we will create the confusion matrix here to check the accuracy of the
classification. To create it, we need to import
the confusion_matrix function of the sklearn library. After importing the
function, we will call it using a new variable cm. The function takes two
parameters, mainly y_true( the actual values) and y_pred (the targeted
value return by the classifier). Below is the code for it:
Output:
Finally, we will visualize the training set result. To visualize the result, we
will use ListedColormap class of matplotlib library. Below is the code for
it:
Output: By executing the above code, we will get the below output:
o In the above graph, we can see that there are some Green
points within the green region and Purple points within the purple
region.
55
o All these data points are the observation points from the training
set, which shows the result for purchased variables.
o This graph is made by using two independent variables i.e., Age on
the x-axis and Estimated salary on the y-axis.
o The purple point observations are for which purchased
(dependent variable) is probably 0, i.e., users who did not purchase
the SUV car.
o The green point observations are for which purchased
(dependent variable) is probably 1 means user who purchased the
SUV car.
o We can also estimate from the graph that the users who are
younger with low salary, did not purchase the car, whereas older
users with high estimated salary purchased the car.
o But there are some purple points in the green region (Buying the
car) and some green points in the purple region(Not buying the car).
So we can say that younger users with a high estimated salary
purchased the car, whereas an older user with a low estimated
salary did not purchase the car.
We have successfully visualized the training set result for the logistic
regression, and our goal for this classification is to divide the users who
purchased the SUV car and who did not purchase the car. So from the
output graph, we can clearly see the two regions (Purple and Green) with
the observation points. The Purple region is for those users who didn't buy
the car, and Green Region is for those users who purchased the car.
Linear Classifier:
As we can see from the graph, the classifier is a Straight line or linear in
nature as we have used the Linear model for Logistic Regression. In
further topics, we will learn for non-linear Classifiers.
Our model is well trained using the training dataset. Now, we will visualize
the result for new observations (Test set). The code for the test set will
remain same as above except that here we will use x_test and
y_test instead of x_train and y_train. Below is the code for it:
56
Output:
using the confusion matrix (11 Incorrect output).Hence our model is pretty
good and ready to make new predictions for this classification problem.
58
59
Project 1
Fraud Detection
Frauds in credit card transactions are common today, thanks to the
advancement of technology and increase in online transaction but there is
also increase in credit card frauds causing huge loss.
Project 2
Project 3
Project 4
User_ID: Unique ID of the user. There are a total of 5891 users in the
dataset.
Age: indicates the age group of the person making the transaction.
Project 5
to reduce the cost and time as well as the quality of training or planning
the courses and categorization of candidates. Information related to
demographics, education, experience are in hands from candidates signup
and enrollment. This dataset designed to understand the factors that lead
a person to leave current job for HR researches too. By model(s) that uses
the current credentials, demographics, experience data you will predict
the probability of a candidate to look for a new job or will work for the
company, as well as interpreting affected factors on employee decision.
The whole data divided to train and test . Target isn't included in test but
the test target values data file is in hands for related tasks. A sample
submission correspond to enrollee_id of test set provided too with
columns : enrollee _id , target
target: 0 – Not looking for job change, 1 – Looking for a job change
Inspiration