0% found this document useful (0 votes)
24 views62 pages

Real Report

Uploaded by

hifimai8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views62 pages

Real Report

Uploaded by

hifimai8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 62

1

DATA SCIENCE AND MACHINE LEARNING


USING PYTHON
Practical Training I/II Report
SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE AWARD
OF
DEGREE OF BACHELOR OF TECHNOLOGY IN COMPUTER SCIENCE ENGINEERING

Submitted By SUBMITTED TO:- Dr. Ashima Mehta


Name: Rishabh Pandey
University Roll No: 24140
Department of Computer Science & Engineering (16 size)

DRONACHARYA COLLEGE OF
ENGINEERING,KHENTAWAS ,GURGAON ,HARYANA
2

PRACTICAL TRAINING I/II REPORT


DATA SCIENCE AND MACHINE LEARNING
Submitted in partial fulfillment of the
Requirements for the award of

Degree of Bachelor of Technology in Computer Science Engineering

Submitted By

Name: Rishabh Pandey


University Roll No.24140

SUBMITTED TO: Dr Ashima Mehta


(HOD)
3

Department of Computer Science & Engineering


MAHARISHI DAYANAND UNIVERSITY ROHTAK
4

certificate
5

STUDENT DECLARATION
I hereby declare that the Practical Training Report entitled “data science and
machine learning” is an authentic record of my own work as requirements of 8-
weeks Industrial Training during the period from 26-06-2023 to 26-08-2023 for
the award of degree of B.Tech. (Computer Science & Engineering), Dronacharya
College of Engineering.

Signature of student

Rishabh Pandey

24140

Date: 28-08-2023

Certified that the above statement made by the student is correct to the best of
our knowledge and belief.

Signatures

Examined by: Head of Department


(Signature and Seal)
6

Acknowledgement

I take this opportunity to express my profound gratitude and deep regards to my


teachers Prof. Dr. Ashima Mehta, Prof. Ms. Vimmi Malhotra of Dronacharya
College of Engineering for her exemplary guidance, monitoring and constant
encouragement throughout the course of this project. The blessing, help and
guidance given by them time to time shall carry me a long way in the journey of
life on which I am about to embark.

I must acknowledge the faculties and staffs of Dronacharya College of


Engineering for their continuous guidance and teaching support due to which I
am able to successfully complete this training report.

It’s my great pleasure to acknowledge my colleagues for providing constant


support and motivation to complete this training report. I am especially grateful
to my friends for constantly supporting me and helping out me to complete this
entire training report.

Rishabh Pandey

Computer Science Engineering

Roll No. : 24140


7

About the Company

“YBI Foundation”

Company Background

YBI Foundation is a Delhi-based not-for-profit edutech company that aims to


enable the youth to grow in the world of emerging technologies. They offer a mix
of online and offline approaches to bring new skills, education, technologies for
students, academicians and practitioners. They believe in the learning anywhere
and anytime approach to reach out to learners.

The platform provides free online instructor-led classes for students to excel in
data science, business analytics, machine learning, cloud computing and big
data. They aim to focus on innovation, creativity, technology approach and keep
themselves in sync with the present industry requirements. They endeavor to
support learners to achieve the highest possible goals in their academics and
professions.

They offers Free programs , scholarships for girls , dual internship program , full
stack dual certificate program and guaranteed placement assistance program for
students, freshers and working professionals.

anyone who wants to learn machine learning, data science and other emerging
technologies for industry 4.0 to make a career in it, whether they are beginners
or professionals, are welcome to enroll to our programs
8

( TABLE OF CONTENTS)
1.Chapter-1-introduction

1.1:scope of data science(10-11)

2.chapter-2-introduction to python(12-16)

2.1:introduction to google collab(17)

2.2:python libraries for data science and machine learning(18-23)

2.3:read data as dataframe(24-27)

3.chapter-3-data science and machine learning

3.1:train test split(28-31)

3.2:linear regression model(32-38)

3.3:logistic regression model(39-52)

4.chapter-4-fundamental projects

4.1:Fraud detection(54)

4.2:car price prediction prediction(54)

4.3:thyroid disease prediction(54)

4.4:Black Friday sales prediction(55-56)

4.5:Data science jobs(56-57)


9
10

Chapter-1
SCOPE OF DATA SCIENCE
The field of Data Science is one of the fastest growing in India. In
recent years, there has been a surge in the amount of data
available, and businesses are increasingly looking for ways to
make Science is a relatively new field, covering a wide range of
topics, from machine learning and artificial intelligence to
statistics and cloud computing.

Data Science is a relatively new field in India, so there is still a


lot of excitement and interest surrounding it.

The potential applications of data science are vast, and


Indian businesses are just beginning to scratch the surface of
what is possible.

Many Indian companies are investing heavily in Data Science as


they realize the competitive advantage that it can provide.

The Indian government also supports Data Science careers in


India, investing in infrastructure and initiatives to promote
adopting data-driven practices.

The talent pool of data scientists in India is rapidly growing as


more people see data science's future scope in India.

There are already many success stories of Data Science


applications in India, and this will only likely continue in the
future.
11
12

Data Science is one of the fastest-growing fields in India. Data


Science job opportunities in India are increasing as organizations
seek to harness the power of data to drive decision-making.
Some of the most common job roles in data science include data
analysts, data scientists, and big data engineers. Data scientist
eligibility in India requires knowledge of extracting, cleaning,
and analyzing data. Data scientists use mathematical and
statistical methods to mine data for insights.

Big data engineers build and maintain the infrastructure to store


and process large amounts of data. With the rapid growth of
data-driven businesses in India, these job roles are in high
demand. As more organizations seek to leverage data to gain a
competitive edge, the demand for skilled workers in the data
science domain is only expected to increase.
13

Chapter-2
Introduction to python
Python is a widely used general-purpose, high level programming
language. It was created by Guido van Rossum in 1991 and further
developed by the Python Software Foundation. It was designed with
an emphasis on code readability, and its syntax allows
programmers to express their concepts in fewer lines of code.
Python is a programming language that lets you work quickly and
integrate systems more efficiently.
There are two major Python versions: Python 2 and Python 3.
Both are quite different.

Python is an easy-to-learn yet powerful and versatile scripting


language, which makes it attractive for Application Development.

With its interpreted nature, Python's syntax and dynamic typing


make it an ideal language for scripting and rapid application
development.

Python supports multiple programming patterns, including object-


oriented, imperative, and functional or procedural programming
styles.

Python is not intended to work in a particular area, such as web


programming. It is a multipurpose programming language
because it can be used with web, enterprise, 3D CAD, etc.

We don't need to use data types to declare variable because it is


dynamically typed, so we can write a=10 to assign an integer
value in an integer variable.

Python makes development and debugging fast because no


compilation step is included in P ython development, and the
edit-test-debug cycle is very fast.
14


15

WHY LEARN PYTHON?

o Easy to use and Learn: Python has a simple and easy-to-


understand syntax, unlike traditional languages like C, C++, Java,
etc., making it easy for beginners to learn.
o Expressive Language: It allows programmers to express complex
concepts in just a few lines of code or reduces Developer's Time.
o Interpreted Language: Python does not require compilation,
allowing rapid development and testing. It uses Interpreter instead
of Compiler.
o Object-Oriented Language: It supports object-oriented
programming, making writing reusable and modular code easy.
o Open Source Language: Python is open source and free to use,
distribute and modify.
o Extensible: Python can be extended with modules written in C, C+
+, or other languages.
o Learn Standard Library: Python's standard library contains many
modules and functions that can be used for various tasks, such as
string manipulation, web programming, and more.
o GUI Programming Support: Python provides several GUI
frameworks, such as Tkinter and PyQt, allowing developers to create
desktop applications easily.
o Integrated: Python can easily integrate with other languages and
technologies, such as C/C++, Java, and . NET.
o Embeddable: Python code can be embedded into other
applications as a scripting language.
o Dynamic Memory Allocation: Python automatically manages
memory allocation, making it easier for developers to write complex
programs without worrying about memory management.
o Wide Range of Libraries and Frameworks: Python has a vast
collection of libraries and frameworks, such as NumPy, Pandas,
Django, and Flask, that can be used to solve a wide range of
problems.
16

o Versatility: Python is a universal language in various domains such


as web development, machine learning, data analysis, scientific
computing, and more.
o Large Community: Python has a vast and active community of
developers contributing to its development and offering support.
This makes it easy for beginners to get help and learn from
experienced developers.
o Career Opportunities: Python is a highly popular language in the
job market. Learning Python can open up several career
opportunities in data science, artificial intelligence, web
development, and more.
o High Demand: With the growing demand for automation and
digital transformation, the need for Python developers is rising.
Many industries seek skilled Python developers to help build their
digital infrastructure.
o Increased Productivity: Python has a simple syntax and powerful
libraries that can help developers write code faster and more
efficiently. This can increase productivity and save time for
developers and organizations.
o Big Data and Machine Learning: Python has become the go-to
language for big data and machine learning. Python has become
popular among data scientists and machine learning engineers with
libraries like NumPy, Pandas, Scikit-learn, TensorFlow, and more.
17

WHERE IS PYTHON USED


o Data Science: Data Science is a vast field, and Python is an
important language for this field because of its simplicity, ease of
use, and availability of powerful data analysis and visualization
libraries like NumPy, Pandas, and Matplotlib.
o Desktop Applications: PyQt and Tkinter are useful libraries that
can be used in GUI - Graphical User Interface-based Desktop
Applications. There are better languages for this field, but it can be
used with other languages for making Applications.
o Console-based Applications: Python is also commonly used to
create command-line or console-based applications because of its
ease of use and support for advanced features such as input/output
redirection and piping.
o Mobile Applications: While Python is not commonly used for
creating mobile applications, it can still be combined with
frameworks like Kivy or BeeWare to create cross-platform mobile
applications.
o Software Development: Python is considered one of the best
software-making languages. Python is easily compatible with both
from Small Scale to Large Scale software.
o Artificial Intelligence: AI is an emerging Technology, and Python
is a perfect language for artificial intelligence and machine learning
because of the availability of powerful libraries such as TensorFlow,
Keras, and PyTorch.
o Web Applications: Python is commonly used in web development
on the backend with frameworks like Django and Flask and on the
front end with tools like JavaScript and HTML.
o Enterprise Applications: Python can be used to develop large-
scale enterprise applications with features such as distributed
computing, networking, and parallel processing.
o 3D CAD Applications: Python can be used for 3D computer-aided
design (CAD) applications through libraries such as Blender.
18

o Machine Learning: Python is widely used for machine learning due


to its simplicity, ease of use, and availability of powerful machine
learning libraries.
o Computer Vision or Image Processing Applications: Python
can be used for computer vision and image processing applications
through powerful libraries such as OpenCV and Scikit-image.
o Speech Recognition: Python can be used for speech recognition
applications through libraries such as SpeechRecognition and
PyAudio.
o Scientific computing: Libraries like NumPy, SciPy, and Pandas
provide advanced numerical computing capabilities for tasks like
data analysis, machine learning, and more.
o Education: Python's easy-to-learn syntax and availability of many
resources make it an ideal language for teaching programming to
beginners.
o Testing: Python is used for writing automated tests, providing
frameworks like unit tests and pytest that help write test cases and
generate reports.
o Gaming: Python has libraries like Pygame, which provide a platform
for developing games using Python.
o IoT: Python is used in IoT for developing scripts and applications for
devices like Raspberry Pi, Arduino, and others.
o Networking: Python is used in networking for developing scripts
and applications for network automation, monitoring, and
management.
o DevOps: Python is widely used in DevOps for automation and
scripting of infrastructure management, configuration management,
and deployment processes.
o Finance: Python has libraries like Pandas, Scikit-learn, and
Statsmodels for financial modeling and analysis.
o Audio and Music: Python has libraries like Pyaudio, which is used
for audio processing, synthesis, and analysis, and Music21, which is
used for music analysis and generation.
19

o Writing scripts: Python is used for writing utility scripts to


automate tasks like file operations, web scraping, and data
processing.
20

INTRODUCTION TO GOOGLE
COLLAB
google is quite aggressive in AI research. Over many years, Google
developed AI framework called TensorFlow and a development tool
called Colaboratory. Today TensorFlow is open-sourced and since 2017,
Google made Colaboratory free for public use. Colaboratory is now known
as Google Colab or simply Colab.
Another attractive feature that Google offers to the developers is the use
of GPU. Colab supports GPU and it is totally free. The reasons for making it
free for public could be to make its software a standard in the academics
for teaching machine learning and data science. It may also have a long
term perspective of building a customer base for Google Cloud APIs which
are sold per-use basis.
Irrespective of the reasons, the introduction of Colab has eased the
learning and development of machine learning applications.

What collab offers you

 Write and execute code in Python


 Document your code that supports mathematical
equations
 Create/Upload/Share notebooks
 Import/Save notebooks from/to Google Drive
 Import/Publish notebooks from GitHub
 Import external datasets e.g. from Kaggle
 Integrate PyTorch, TensorFlow, Keras, OpenCV
 Free Cloud service with free GPU
21

Python libraries for data science and


machine learning

1. Pandas
All of us can do data analysis using pen and paper on small data sets. We
require specialized tools and techniques to analyze and derive meaningful
information from massive datasets. Pandas Python is one of those libraries
for data analysis that contains high-level data structures and tools to
manipulate data in a simple way. Providing an effortless yet effective way
to analyze data requires the ability to index, retrieve, split, join,
restructure, and various other analyses on both multi and single-
dimensional data.

Key Features of Pandas


Pandas data analysis library has some unique features that provide these
capabilities-
i) The Series and DataFrame Objects
These two are high-performance array and table structures for
representing the heterogeneous and homogeneous data sets in Pandas
Python.
ii) Restructuring of Data Sets
Pandas python provides the flexibility for reshaping the data structures to
be inserted in both rows and columns of tabular data.
iii) Labelling
To allow automatic data alignment and indexing, pandas provide labeling
on series and tabular data.
iv) Multiple Labels for a Data Item
Heterogeneous indexing of data spread across multiple axes, which helps
in creating more than one label on each data item.
v) Grouping
The functionality to perform split-apply-combine on series as well on
tabular data.
vi) Identify and Fix Missing Data
Programmers can quickly identify and mix missing data floating and non-
floating pointing numbers using pandas.
vii) Powerful capabilities to load and save data from various formats such
as JSON, CSV, HDF5, etc.
viii) Conversion from NumPy and Python data structures to pandas
objects.
ix) Slicing and sub-setting of datasets, including merging and joining data
sets with SQL- like constructs.
Although pandas provide many statistical methods, it is not enough to do
data science in Python. Pandas depend upon other python libraries for
data science like NumPy, SciPy, Sci-Kit Learn, Matplotlib, ggvis in the
22

Python ecosystem to conclude from large data sets. Thus, making it


possible for Pandas applications to take advantage of the robust and
extensive Python framework.
23

Pros of using Pandas


 Pandas allow you to represent data effortlessly and in a simpler
manner, improving data analysis and comprehension. For data
science projects, such a simple data representation helps glean
better insights.
 Pandas is highly efficient as it enables you to perform any task by
writing only a few lines of code.
 Pandas provide users with a broad range of commands to analyze
data quickly.

Cons of using Pandas


 The learning curve for Pandas may appear to be simple at first, but
as you start working with it, you may find it challenging to grasp.
 One of the most evident flaws of Pandas is that it isn’t suitable for
working with 3D matrices.

2. NumPy
Numerical Python code name: - NumPy is a Python library for numerical
calculations and scientific computations. NumPy provides numerous
features which Python enthusiasts and programmers can use to work with
high-performing arrays and matrices. NumPy arrays provide vectorization
of mathematical operations, which gives it a performance boost over
Python’s looping constructs.
Pandas Series and DataFrame objects rely primarily on NumPy arrays for
all the mathematical calculations like slicing elements and performing
vector operations.

Key Features of NumPy


Below are some of the features provided by NumPy-
1. Integration with legacy languages.
2. Mathematical Operations: It provides all the standard functions
required to perform operations on large data sets swiftly and
efficiently, which otherwise have to be achieved through looping
constructs.
3. ndarray: It is a fast and efficient multidimensional array that can
perform vector-based arithmetic operations and has powerful
broadcasting capabilities.
4. I/O Operations: It provides various tools which can be used to
write/read huge data sets from disk. It also supports I/O operations
on memory-based file mappings.
24

5. Fourier transform capabilities, Linear Algebra, and Random Number


Generation.

Pros of using NumPy


 NumPy provides efficient and scalable data storage and better data
management for mathematical calculations.
 The Numpy array contains a variety of functions, methods, and
variables that make computing matrices simpler.

Cons of using NumPy


 "Nan" is an acronym for "not a number” intended to deal with the
issue of missing values. Although NumPy supports "nan," Python's
lack of cross-platform compatibility makes it challenging for users.
As a result, we may run into issues while comparing values within
the Python interpreter.
 When data is stored in contiguous memory addresses, insertion and
deletion processes become expensive since shifting.
25

3) SciPy
Scientific Python code name, SciPy-It is an assortment of mathematical
functions and algorithms built on Python’s extension NumPy. SciPy
provides various high-level commands and classes for manipulating and
visualizing data. SciPy is useful for data-processing and prototyping
systems.
Apart from this, SciPy provides other advantages for building scientific
applications and many specialized, sophisticated applications backed by a
robust and fast-growing Python community.

Pros of using SciPy

 Visualizing and manipulating data with high-level commands and


classes.
 Python sessions that are both robust and interactive.
 For parallel programming, there are classes and web and database
procedures.

Cons of using SciPy

 SciPy does not provide any plotting function because its focus is on
numerical objects and algorithms.

4. Sci-Kit Learn
For machine learning practitioners, Sci-Kit Learn is the savior. It has
supervised and unsupervised machine learning algorithms for production
applications. Sci-Kit Learn focuses on code quality, documentation, ease
of use, and performance as this library provides learning algorithms. Sci-
Kit Learn has a steep learning curve.

Pros of using Sci-Kit Learn


 The scikit learn library is a helpful tool to predict customer behavior,
develop neuroimages, and more.
 It's simple to use and completely free.

Cons of using Sci-Kit Learn


 It isn't designed to work with graph algorithms.
 It isn't very adept at handling strings.
26

5. PyCaret
PyCaret is a fully accessible machine learning package for model
deployment and data processing. It allows you to save time because it is a
low-code library. It's a user-friendly machine learning library that will help
you run end-to-end machine learning tests, whether you're trying to
suggest missing values, analyzing categorical data, engineering features,
tuning hyperparameters, or generating ensemble models.

Key Features of PyCaret


 PyCaret is a low-code library that can help you save time.
 It's a basic and easy machine learning library.
 It allows you to design quickly and efficiently from the comfort of
your notebook.
 It gives a ready-to-use solution.

Pros of using PyCaret


 Pycaret has 60 plots to analyze and interpret model performance
and offer instant results without creating complex coding.
 It works with a high degree of automation in several data
preprocessing phases.

Cons of using PyCaret


 PyCaret isn't well-suited to deep learning, and it doesn't support
Keras or PyTorch models.
 More advanced machine learning tasks, such as image
categorization and text creation, are impossible with PyCaret.

6)Tensorflow
TensorFlow is a free end-to-end open-source platform for Machine
Learning that includes a wide range of tools, libraries, and resources. The
Google Brain team first released it on November 9, 2015. TensorFlow
makes it simple to design and train Machine Learning models using high-
level APIs like Keras. It also offers various abstraction levels, allowing you
to select the best approach for your model. TensorFlow also enables you
to deploy Machine Learning models in multiple environments, including
the cloud, browser, and your device. If you want the complete experience,
choose TensorFlow Extended (TFX); TensorFlow Lite if you're going to use
TensorFlow on mobile devices; and TensorFlow.js if you're going to train
and deploy models in JavaScript contexts.

Key Features of TensorFlow


 It is a Google-developed open-source framework.
27

 Deep learning networks and machine learning principles are


supported.
 It's simple to use and provides for rapid debugging.

Pros of using TensorFlow


 TensorFlow offers smooth performance, quick upgrades, and regular
new releases with additional features.
 Tensorflow allows you to run subparts of a graph, giving it an
advantage because it can insert and retrieve data samples onto an
edge, making it an excellent debugging tool.
 When compared to others libraries like Torch and Theano,
Tensorflow offers higher computational graph visualizations that are
native.
 TensorFlow is intended to explore a variety of backend software
(GPU, ASIC, etc.).

Cons of using TensorFlow


 TensorFlow does not have the symbolic loops feature, although a
workaround involves finite unfolding (bucketing).
 When compared to its competitors, TensorFlow lags in terms of
speed and usability.

7) OpenCV
Licensed under the BSD, OpenCV is a free machine learning and computer
vision library. It offers a shared architecture for computer vision
applications to streamline the implementation of computer vision in
commercial products.

Key Features of OpenCV


 OpenCV's source code is open to modification and customization to
meet the customer's needs.
 OpenCV was initially written in C++. It has the same performance
as C++, and the Python wrappers run C++ code in the background.
 It uses Numpy arrays to perform operations.
 You can easily create prototypes using the Python OpenCV module.

Pros of using OpenCV


 OpenCV is a much faster program to use. In the case of OpenCV, the
speed-to-cost ratio can sometimes exceed 80 percent.
 It includes more than 2500 optimized algorithms, covering many
traditional and cutting-edge computer vision and machine learning
techniques.
28

Cons of using OpenCV


 Due to the absence of documentation and error handling codes,
OpenCV is difficult to comprehend.
29

READ DATA AS DATAFRAME


There are various instances when the functioning of a program
requires a large amount of data; this kind of data is often so large
that it’s practically impossible to input using the keyboard and
console.

Now, it is the programming norm to read data from files. Three of


the most popular file formats for data are:

 comma-separated variables (.csv)


 pickle (.pkl)
 Excel (.xlsx)

Data that consists of thousands of records are mostly stored in


one of the three file formats mentioned above.
The Pandas library in Python is one of the most widely used
libraries for data reading, cleaning, and analysis.

1. . Reading a CSV file

The read_csv method of the Pandas library takes a CSV file as a parameter and
returns a dataframe.

import pandas as pd
df = pd.read_csv('my_csv.csv')

2. Reading a pickle file

The read_pickle method of the Pandas library takes a pickle file as a parameter and
returns a dataframe.

import pandas as pd
df = pd.read_pickle('my_pkl.pkl')

3. Reading an Excel file

The read_excel method of the Pandas library takes an excel file as a parameter and
returns a dataframe.

import pandas as pd
df = pd.read_excel('my_excel.xlsx')

Once the data has been read into a data frame, display the data frame to see if the
data has been read
30

correctly.
31

The DataFrame.to_csv() is a function in pandas that we use to write object


to CSV file.

Syntax

DataFrame.to_csv(path_or_buf=None, sep=',', na_rep='', float_format=None, columns=N


one, header=True, index=True, index_label=None, mode='w', encoding=None, compres
sion='infer', quoting=None, quotechar='"', line_terminator=None, chunksize=None, date
_format=None, doublequote=True, escapechar=None, decimal='.', errors='strict', storag
e_options=None)

Parameters

The function doesn’t have a non-optional parameter.

Below are the commonly used non-optional parameters:

 path_or_buf: File path or object.


 sep: It is a string of size 1. The default value is ,.
 na_rep: Missing data representation. The default value is ,.
 float_format: It is a format string for floating-point numbers. The default
value is None.
 columns: columns to write.
 header: It writes out the column names. The default value is True.

To see the list of the rest of the optional parameters, click here.

Return value

The function doesn’t return anything.

If the path_or_buf is None, the DataFrame.to_csv()function returns the resulting CSV


format in form of string.

CODE

import pandas as pd

df = pd.DataFrame({'Name': ['Raphael', 'Donatello'],


'Colors': ['red', 'purple']})

# the dataframe is written to file_name.csv


df.to_csv('file_name.csv')
32
33

Chapter-3
Train test split
The train_test_split() method is used to split our data into train and
test sets.

First, we need to divide our data into features (X) and labels (y).
The dataframe gets divided into X_train,X_test , y_train and y_test.
X_train and y_train sets are used for training and fitting the model.
The X_test and y_test sets are used for testing the model if it’s
predicting the right outputs/labels. we can explicitly test the size of
the train and test sets. It is suggested to keep our train sets larger
than the test sets.
train set: The training dataset is a set of data that was utilized to
fit the model. The dataset on which the model is trained. This data
is seen and learned by the model.

test set: The test dataset is a subset of the training dataset that is
utilized to give an accurate evaluation of a final model fit.

validation set: A validation dataset is a sample of data from your


model’s training set that is used to estimate model performance
while tuning the model’s hyperparameters.
Syntax: sklearn.model_selection.train_test_split()
parameters:
 *arrays: sequence of indexables. Lists, numpy arrays, scipy-
sparse matrices, and pandas dataframes are all valid inputs.
 test_size: int or float, by default None. If float, it should be
between 0.0 and 1.0 and represent the percentage of the
dataset to test split. If int is used, it refers to the total number of
test samples. If the value is None, the complement of the train
size is used. It will be set to 0.25 if train size is also None.
 train_size: int or float, by default None.
 random_state : int,by default None. Controls how the data is
shuffled before the split is implemented. For repeatable output
across several function calls, pass an int.

 shuffle: boolean object , by default True. Whether or not


the data should be shuffled before splitting. Stratify must be
None if shuffle=False.
 stratify: array-like object , by default it is None. If None is
selected, the data is stratified using these as class labels.
returns: splitting: list
34
35
36
37
38

Linear regression
Linear Regression is an algorithm that belongs to
supervised Machine Learning. It tries to apply relations that will predict the
outcome of an event based on the independent variable data points. The
relation is usually a straight line that best fits the different data points as close
as possible. The output is of a continuous form, i.e., numerical value. For
example, the output could be revenue or sales in currency, the number of
products sold, etc. In the above example, the independent variable can be
single or multiple.
1. Linear Regression Equation

Linear Regression Line


Linear regression can be expressed mathematically as:

y= β0+ β 1x+ ε

Here,

 Y= Dependent Variable
 X= Independent Variable
 β 0= intercept of the line
 β1 = Linear regression coefficient (slope of the line)
 ε = random error
39

The last parameter, random error ε, is required as the best fit


line also doesn't include the data points perfectly.

2. Linear Regression Model

Since the Linear Regression algorithm represents a linear


relationship between a dependent (y) and one or more
independent (y) variables, it is known as Linear Regression. This
means it finds how the value of the dependent variable changes
according to the change in the value of the independent
variable. The relation between independent and dependent
variables is a straight line with a slope.

Types of Linear Regression

Linear Regression can be broadly classified into two types of


algorithms:

1. Simple Linear Regression

A simple straight-line equation involving slope (dy/dx) and


intercept (an integer/continuous value) is utilized in
simple Linear Regression. Here a simple form is:

y=mx+c where y denotes the output x is the independent


variable, and c is the intercept when x=0. With this equation,
the algorithm trains the model of machine learning and gives
the most accurate output

2. Multiple Linear Regression

When a number of independent variables more than one, the


governing linear equation applicable to regression takes a
different form like:
40

y= c+m1x1+m2x2… mnxn where represents the coefficient


responsible for impact of different independent variables
x1, x2 etc. This machine learning algorithm, when applied, finds
the values of coefficients m1, m2, etc., and gives the best fitting
line.

3. Non-Linear Regression

When the best fitting line is not a straight line but a curve, it is
referred to as Non-Linear Regression.

How Does Linear Regression Work?

After understanding the concept of Linear Regression and its


adoption to solve many engineering and business problems, we
now will consider the process of applying Linear Regression in
a Machine Learning project. Let us import the necessary
libraries:

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn import metrics

We will load the dataset using the following command:

# Loading the data

car_data = pd.read_csv('/content/car_data.csv')
41

Let us check the first few rows of the dataset:

car_data.head()

We can describe the dataset using .info() command

# Getting some information about the dataset

car_data.info()
Assumptions of Linear Regression

Normally most statistical tests and results rely upon some


specific assumptions regarding
the variables involved. Naturally, if these assumptions are
not considered, the results will not be reliable. Linear
Regression also comes under same consideration. There are
some common assumptions to be considered while using Linear
Regression:

1. Linearity: The models of Linear Regression models must be linear in the


sense that the output must have a linear association with the input values,
and it only suits data that has a linear relationship between the two
entities.
2. Homoscedasticity: Homoscedasticity means the standard
deviation and the variance of the residuals (difference of (y-yˆ) 2 ) must
be the same for any value of x. Multiple Linear Regression assumes that
the amount of error in the residuals is similar at each point of the linear
model. We can check the Homoscedasticity using Scatter plots.
3. Non-multicollinearity: The data should not have multicollinearity, which
means the independent variables should not be highly correlated with
each other. If this occurs, it will be difficult to identify those specific
42

variables which actually contribute to the variance in the dependent


variable. We can check the data for this using a correlation matrix.
4. No Autocorrelation: When data are obtained across time, we assume
that successive values of the disturbance component are momentarily
independent in the conventional Linear Regression model. When this
assumption is not followed, the situation is referred to be autocorrelation.
5. Not applicable to Outliers: The value of the dependent variable cannot
be estimated for a value of an independent variable which lies outside the
range of values in the sample data.
All the above assumptions are critical because if they are not
followed, they can lead to drawing conclusions that may become
invalid and unreliable. You can check out Data Science
Bootcamp course KnowledgeHut for a better understanding of
the course.

Advantages of Linear Regression


1. For linear datasets, Linear Regression performs well to find the nature of
the relationship among different variables.
2. Linear Regression algorithms are easy to train and the Linear
Regression models are easy to implement.
3. Although, the Linear Regression models are likely to over-fit, but can be
avoided using dimensionality reduction techniques such as regularization
(L1 and L2) and cross-validation.
Disadvantages of Linear Regression
1. An important disadvantage of Linear Regression is that it assumes
linearity between the dependent and independent variables, which is
rarely represented in real-world data. It assumes a straight-line
relationship between the dependent and independent variables, which is
unlikely many times.
2. It is prone to noise and overfitting. In datasets where the number of
observations is lesser than the attributes, Linear Regression might not be
a good choice as it can lead to overfitting. This is because the algorithm
can start considering the noise while building the model.
3. Sensitive to outliers, it is essential to pre-process the dataset and remove
the outliers before applying Linear Regression to the data.
4. It does not assume multicollinearity. If there is any relationship between
the independent variables, i.e., multicollinearity, then it needs to be
removed using dimensionality reduction techniques before applying Linear
Regression as the algorithm assumes that there is no relationship among
independent variables.
Key Benefits of Linear Regression

Linear Regression is popular in statistics. It offers several


benefits in Data Science as follows:
43

1. Easy to Implement

A Linear Regression machine learning model is computationally


simple and does not require much engineering overhead. Hence,
it is easy to implement and maintain.

2. Scalability

Since Linear Regression is computationally inexpensive, it can


be applied to cases where scaling is needed, such as
applications that handle big data.

3. Interpretability

Linear Regression is easy to interpret and very efficient to train.


It is relatively simple, unlike deep learning neural networks
which require more data and time to efficiently train.

4. Applicability in real-time

As Linear Regression models are easy to train and do not require


much computational power, these can be retrained quickly with
new data and hence, can be applied to scenarios where real-
time predictions are important.
44
45

Logistic regression
o logistic regression is one of the most popular Machine Learning
algorithms, which comes under the Supervised Learning technique. It is
used for predicting the categorical dependent variable using a given set of
independent variables.
o Logistic regression predicts the output of a categorical dependent
variable. Therefore the outcome must be a categorical or discrete value. It
can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the
exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that
how they are used. Linear Regression is used for solving Regression
problems, whereas Logistic regression is used for solving the
classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S"
shaped logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something
such as whether the cells are cancerous or not, a mouse is obese or not
based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it
has the ability to provide probabilities and classify new data using
continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using
different types of data and can easily determine the most effective
variables used for the classification. The below image is showing the
logistic function:
46

Logistic regression uses the concept of predictive modeling as


regression; therefore, it is called logistic regression, but is used to
classify samples; Therefore, it falls under the classification
algorithm.
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the
predicted values to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which
cannot go beyond this limit, so it forms a curve like the "S" form. The S-
form curve is called the Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.

Assumptions for Logistic Regression:


o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear
Regression equation. The mathematical steps to get Logistic Regression
equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's


divide the above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of


the equation it will become:
47

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into
three types:

4.7M

507

Difference between JDK, JRE, and JVM

o Binomial: In binomial Logistic regression, there can be only two possible


types of the dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as "cat", "dogs",
or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as "low", "Medium", or "High".

To understand the implementation of Logistic Regression in Python, we


will use the below example:

Example: There is a dataset given which contains the information of


various users obtained from the social networking sites. There is a car
making company that has recently launched a new SUV car. So the
company wanted to check how many users from the dataset, wants to
purchase the car.

For this problem, we will build a Machine Learning model using the
Logistic regression algorithm. The dataset is shown in the below image. In
this problem, we will predict the purchased variable (Dependent
Variable) by using age and salary (Independent variables).
48

Steps in Logistic Regression: To implement the Logistic Regression


using Python, we will use the same steps as we have done in previous
topics of Regression. Below are the steps:

o Data Pre-processing step


o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1. Data Pre-processing step: In this step, we will pre-process/prepare


the data so that we can use it in our code efficiently. It will be the same as
we have done in Data pre-processing topic. The code for this is given
below:

1. #Data Pre-procesing Step


2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')

By executing the above lines of code, we will get the dataset as the
output. Consider the given image:
49

Now, we will extract the dependent and independent variables from the
given dataset. Below is the code for it:

1. #Extracting Independent and dependent Variable


2. x= data_set.iloc[:, [2,3]].values
3. y= data_set.iloc[:, 4].values

In the above code, we have taken [2, 3] for x because our independent
variables are age and salary, which are at index 2, 3. And we have taken 4
for y variable because our dependent variable is at index 4. The output
will be:
50

Now we will split the dataset into a training set and test set. Below is the
code for it:
1. # Splitting the dataset into training and test set.
2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, rando
m_state=0)

In logistic regression, we will do feature scaling because we want accurate


result of predictions. Here we will only scale the independent variable
because dependent variable have only 0 and 1 values. Below is the code
for it:
1. #feature Scaling
2. from sklearn.preprocessing import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test)

The scaled output is given below:


51

2. Fitting Logistic Regression to the Training set:

We have well prepared our dataset, and now we will train the dataset
using the training set. For providing training or fitting the model to the
training set, we will import the LogisticRegression class of
the sklearn library.

After importing the class, we will create a classifier object and use it to fit
the model to the logistic regression. Below is the code for it:

1. #Fitting Logistic Regression to the training set


2. from sklearn.linear_model import LogisticRegression
3. classifier= LogisticRegression(random_state=0)
4. classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the below output:

Out[5]:

1. LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=


True,
2. intercept_scaling=1, l1_ratio=None, max_iter=100,
3. multi_class='warn', n_jobs=None, penalty='l2',
4. random_state=0, solver='warn', tol=0.0001, verbose=0,
5. warm_start=False)

Hence our model is well fitted to the training set.

3. Predicting the Test Result


52

Our model is well trained on the training set, so we will now predict the
result by using test set data. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

In the above code, we have created a y_pred vector to predict the test set
result.

Output: By executing the above code, a new vector (y_pred) will be


created under the variable explorer option. It can be seen as:

The above output image shows the corresponding predicted users who
want to purchase or not purchase the car.

4. Test Accuracy of the result

Now we will create the confusion matrix here to check the accuracy of the
classification. To create it, we need to import
the confusion_matrix function of the sklearn library. After importing the
function, we will call it using a new variable cm. The function takes two
parameters, mainly y_true( the actual values) and y_pred (the targeted
value return by the classifier). Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix()
53

Output:

By executing the above code, a new confusion matrix will be created.


Consider the below image:

We can find the accuracy of the predicted result by interpreting the


confusion matrix. By above output, we can interpret that 65+24= 89
(Correct Output) and 8+3= 11(Incorrect Output).

5. Visualizing the training set result

Finally, we will visualize the training set result. To visualize the result, we
will use ListedColormap class of matplotlib library. Below is the code for
it:

1. #Visualizing the training set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_se
t[:, 0].max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step
= 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).
reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
54

10. for i, j in enumerate(nm.unique(y_set)):


11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

In the above code, we have imported the ListedColormap class of


Matplotlib library to create the colormap for visualizing the result. We
have created two new variables x_set and y_set to
replace x_train and y_train. After that, we have used
the nm.meshgrid command to create a rectangular grid, which has a
range of -1(minimum) to 1 (maximum). The pixel points we have taken
are of 0.01 resolution.

To create a filled contour, we have used mtp.contourf command, it will


create regions of provided colors (purple and green). In this function, we
have passed the classifier.predict to show the predicted data points
predicted by the classifier.

Output: By executing the above code, we will get the below output:

The graph can be explained in the below points:

o In the above graph, we can see that there are some Green
points within the green region and Purple points within the purple
region.
55

o All these data points are the observation points from the training
set, which shows the result for purchased variables.
o This graph is made by using two independent variables i.e., Age on
the x-axis and Estimated salary on the y-axis.
o The purple point observations are for which purchased
(dependent variable) is probably 0, i.e., users who did not purchase
the SUV car.
o The green point observations are for which purchased
(dependent variable) is probably 1 means user who purchased the
SUV car.
o We can also estimate from the graph that the users who are
younger with low salary, did not purchase the car, whereas older
users with high estimated salary purchased the car.
o But there are some purple points in the green region (Buying the
car) and some green points in the purple region(Not buying the car).
So we can say that younger users with a high estimated salary
purchased the car, whereas an older user with a low estimated
salary did not purchase the car.

The goal of the classifier:

We have successfully visualized the training set result for the logistic
regression, and our goal for this classification is to divide the users who
purchased the SUV car and who did not purchase the car. So from the
output graph, we can clearly see the two regions (Purple and Green) with
the observation points. The Purple region is for those users who didn't buy
the car, and Green Region is for those users who purchased the car.

Linear Classifier:

As we can see from the graph, the classifier is a Straight line or linear in
nature as we have used the Linear model for Logistic Regression. In
further topics, we will learn for non-linear Classifiers.

Visualizing the test set result:

Our model is well trained using the training dataset. Now, we will visualize
the result for new observations (Test set). The code for the test set will
remain same as above except that here we will use x_test and
y_test instead of x_train and y_train. Below is the code for it:
56

1. #Visulaizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_se
t[:, 0].max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step
= 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).
reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

The above graph


shows the test set
result. As we can see, the graph is divided into two regions (Purple and
Green). And Green observations are in the green region, and Purple
observations are in the purple region. So we can say it is a good prediction
and model. Some of the green and purple data points are in different
regions, which can be ignored as we have already calculated this error
57

using the confusion matrix (11 Incorrect output).Hence our model is pretty
good and ready to make new predictions for this classification problem.
58
59

Project 1

Fraud Detection
Frauds in credit card transactions are common today, thanks to the
advancement of technology and increase in online transaction but there is
also increase in credit card frauds causing huge loss.

Therefore, there’s need for effective methods to detect and prevent


frauds. You are encouraged to use the multiple algorithms of Machine
learning like support vector machine (SVM), k-nearest neighbor (Knn) and
artificial neural network (ANN) in predicting the occurrence of the fraud.

Further, analyze and comment on the results achieved by different


machine learning and deep learning techniques to classify fraud and non-
fraud transactions

Project 2

Car Price Prediction


Car Price Prediction is a really an interesting machine learning problem as
there are many factors that influence the price of a car in the second-hand
market. In this competition, we will be looking at a dataset based on
sale/purchase of cars where our end goal will be to predict the price of the
car given its features to maximize the profit.

Project 3

Thyroid Disease Prediction


Thyroid disease is a major cause of formation in medical diagnosis and in
the prediction, onset to which it is a difficult axiom in the medical
research. Thyroid gland is one of the most important organs in our body.
The secretions of thyroid hormones are culpable in controlling the
metabolism. Hyperthyroidism and hypothyroidism are one of the two
common diseases of the thyroid that releases thyroid hormones in
regulating the rate of body's metabolism. Data cleansing techniques were
applied to make the data primitive enough for performing analytics to
show the risk of patients obtaining thyroid. The machine learning plays a
decisive role in the process of disease prediction.
60

Project 4

Black Friday Sales Prediction


This dataset comprises of sales transactions captured at a retail store. It’s
a classic dataset to explore and expand your feature engineering skills
and day to day understanding from multiple shopping experiences. This is
a regression problem. The dataset has 550,069 rows and 12 columns.
Problem: Predict purchase amount.

Data Overview : Dataset has 537577 rows (transactions) and 10 columns


as described below:

User_ID: Unique ID of the user. There are a total of 5891 users in the
dataset.

Product_ID: Unique ID of the product. There are a total of 3623 products in


the dataset.

Gender: indicates the gender of the person making the transaction.

Age: indicates the age group of the person making the transaction.

Occupation: shows the occupation of the user, already labeled with


numbers 0 to 20.

City_Category: User's living city category. Cities are categorized into 3


different categories 'A', 'B' and 'C'.

7.Stay_In_Current_City_Years: Indicates how long the users has lived in


this city.

Marital_Status: is 0 if the user is not married and 1 otherwise.

Product_Category_1 to _3: Category of the product. All 3 are already


labeled with numbers.

Purchase: Purchase amount.

Project 5

Data Science Jobs


A company which is active in Big Data and Data Science wants to hire
data scientists among people who successfully pass some courses which
conduct by the company. Many people signup for their training. Company
wants to know which of these candidates are really wants to work for the
company after training or looking for a new employment because it helps
61

to reduce the cost and time as well as the quality of training or planning
the courses and categorization of candidates. Information related to
demographics, education, experience are in hands from candidates signup
and enrollment. This dataset designed to understand the factors that lead
a person to leave current job for HR researches too. By model(s) that uses
the current credentials, demographics, experience data you will predict
the probability of a candidate to look for a new job or will work for the
company, as well as interpreting affected factors on employee decision.
The whole data divided to train and test . Target isn't included in test but
the test target values data file is in hands for related tasks. A sample
submission correspond to enrollee_id of test set provided too with
columns : enrollee _id , target

Note: The dataset is imbalanced. Most features are categorical (Nominal,


Ordinal, Binary), some with high cardinality. Missing imputation can be a
part of your pipeline as well.

Features enrollee_id : Unique ID for candidate

city: City code

city_ development _index : Developement index of the city (scaled)

gender: Gender of candidate

relevent_experience: Relevant experience of candidate

enrolled_university: Type of University course enrolled if any

education_level: Education level of candidate

major_discipline :Education major discipline of candidate

experience: Candidate total experience in years

company_size: No of employees in current employer's company

company_type : Type of current employer

lastnewjob: Difference in years between previous job and current job

training_hours: training hours completed

target: 0 – Not looking for job change, 1 – Looking for a job change

Inspiration

Predict the probability of a candidate will work for the company

Interpret model(s) such a way that illustrate which features affect


candidate decision
62

You might also like