0% found this document useful (0 votes)
30 views

Unit 1 Notes

Uploaded by

bronew601
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Unit 1 Notes

Uploaded by

bronew601
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Course Name: Machine Learning

with Python
Course Code: BCA 311
Syllabus - UNIT–I

 Introduction to Machine Learning, Why Machine


learning, Types of Machine Learning Problems,
Applications of Machine Learning. Supervised
Machine Learning- Regression and Classification.
Binary Classifier, Multiclass Classification, Multilabel
Classification. Performance Measures, Confusion
Matrix, Accuracy, Precision & recall, ROC Curve.
 Advanced Python- NumPy, Pandas.
 Python Machine Learning Library Scikit-Learn,
Linear Regression with one Variable, Linear
Regression with Multiple Variables, Logistic
Regression.
Why Machine learning
 Business organizations use huge amount of data for their daily
activities.
 Earlier, the full potential of this data was not utilized due to
two reasons.
 One reason was data being scattered across different archive
systems and organizations not being able to integrate these
sources fully.
 Secondly, the lack of awareness about software tools that
could help to unearth the useful information from data.
 Not anymore! Business organizations have now started to use
the latest technology, machine learning, for this purpose.
 Machine learning has become so popular because of three
reasons:
 1. High volume of available data to manage: Big companies
such as Facebook, Twitter, and YouTube generate huge amount
of data that grows at a phenomenal rate. It is estimated that
the data approximately gets doubled every year.
 2. Second reason is that the cost of storage has reduced. The
hardware cost has also dropped. Therefore, it is easier now to
capture, process, store, distribute, and transmit the digital
information.
 3. Third reason for popularity of machine learning is the
availability of complex algorithms now. Especially with the
advent of deep learning, many algorithms are available for
machine learning.
 With the popularity and ready adaption of machine learning by
business organizations, it has become a dominant technology
trend now.
Knowledge Pyramid

All facts are data. Data can be numbers or text that can be processed by a computer. Today,
organizations are accumulating vast and growing amounts of data with data sources such as flat
files, databases, or data warehouses in different storage formats. Processed data is called
information. This includes patterns, associations, or relationships among data. For example,
sales data can be analyzed to extract information like which is the fast selling product.
Condensed information is called knowledge. For example, the historical patterns and future
trends obtained in the above sales data can be called knowledge. Unless knowledge is
extracted, data is of no use. Similarly, knowledge is not useful unless it is put into action.
Intelligence is the applied knowledge for actions. An actionable form of knowledge is called
intelligence. Computer systems have been successful till this stage. The ultimate objective of
knowledge pyramid is wisdom that represents the maturity of mind that is, so far, exhibited only
by humans. Here comes the need for machine learning. The objective of machine learning is to
process these archival data for organizations to take better decisions to design new products,
improve the business processes, and to develop effective decision support systems.
Introduction to Machine
Learning
 The term machine learning was first introduced by Arthur Samuel in 1959. We can define
it as : ” Machine learning enables a machine to automatically learn from data, improve
performance from experiences, and predict things without being explicitly programmed.”
 Machine learning is a growing technology which enables computers to learn
automatically from past data. Machine learning uses various algorithms for building
mathematical models and making predictions using historical data or information.
Currently, it is being used for various tasks such as image recognition, speech
recognition, email filtering, Facebook auto-tagging, recommender system, and many
more.
 As humans take decisions based on an experience, computers make models based on
extracted patterns in the input data and then use these data-filled models for prediction
and to take decisions.
 In statistical learning, the relationship between the input x and output y is modeled
as a function in the form y = f(x). Here, f is the learning function that maps the input
x to output y. Learning of function f is the crucial aspect of forming a model in
statistical learning. In machine learning, this is simply called mapping of input to
output. The learning program summarizes the raw data in a model.
 Formally stated, a model is an explicit description of patterns within the data in the
form of:
 1. Mathematical equation
 2. Relational diagrams like trees/graphs
 3. Logical if/else rules, or
 4. Groupings called clusters
 The difference between pattern and model is that the former is local and applicable
only to certain attributes but the latter is global and fits the entire dataset. For
example, a model can be helpful to examine whether a given email is spam or not.
The point is that the model is generated automatically from the given data.
 Another pioneer of AI, Tom Mitchell’s definition of machine learning states that, “A
computer program is said to learn from experience E, with respect to task T
and some performance measure P, if its performance on T measured by P
improves with experience E.”
 For example, while creating a program for playing chess:
 T : To play the game
 E : Experience of games played by previous players
 P : Probability of win or lose
How does Machine Learning work
 A Machine Learning system learns from historical data, builds the
prediction models, and whenever it receives new data, predicts the
output for it. The accuracy of predicted output depends upon the amount
of data, as the huge amount of data helps to build a better model which
predicts the output more accurately.
 Suppose we have a complex problem, where we need to perform some
predictions, so instead of writing a code for it, we just need to feed the data
to generic algorithms, and with the help of these algorithms, machine builds
the logic as per the data and predict the output. Machine learning has
changed our way of thinking about the problem. The below block diagram
explains the working of Machine Learning algorithm:
Machine Learning and Artificial Intelligence
 Machine learning is an important branch of AI, which is a much broader
subject. The aim of AI is to develop intelligent agents. An agent can be a
robot, humans, or any autonomous systems. Machine learning is the
subbranch of AI, whose aim is to extract the patterns for prediction. It is a
broad field that includes learning from examples and other areas like
reinforcement learning.
 AI is a bigger concept to create intelligent machines that can simulate human thinking
capability and behavior, whereas, machine learning is an application or subset of AI that
allows machines to learn from data without being programmed explicitly.
 Artificial intelligence
• AI allows a machine to simulate human intelligence to solve problems
• The goal is to develop an intelligent system that can perform complex tasks
• We build systems that can solve complex tasks like a human
• AI has a wide scope of applications
• AI uses technologies in a system so that it mimics human decision-making
• AI works with all types of data: structured, semi-structured, and unstructured
• AI systems use logic and decision trees to learn, reason, and self-correct
 Machine learning
• ML allows a machine to learn autonomously from past data
• The goal is to build machines that can learn from data to increase the accuracy of the
output
• We train machines with data to perform specific tasks and deliver accurate results
• Machine learning has a limited scope of applications
• ML uses self-learning algorithms to produce predictive models
• ML can only use structured and semi-structured data
• ML systems rely on statistical models to learn and can self-correct when provided with
new data
Types of Machine Learning Problems
 Based on the methods and way of learning, machine learning is
divided into mainly four types, which are:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
Supervised Machine Learning
 Supervised learning is the types of machine learning in which machines are
trained using well "labelled" training data, and on basis of that data, machines
predict the output. The labelled data means some input data is already tagged with
the correct output.
 In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly.
 Imagine a teacher supervising a class. The teacher already knows the correct
answers but the learning process doesn’t stop until the students learn the answers
as well. This is the essence of Supervised Machine Learning Algorithms. Here, the
algorithm learns from a training dataset and makes predictions that are compared
with the actual output values. If the predictions are not correct, then the algorithm
is modified until it is satisfactory. This learning process continues until the algorithm
achieves the required level of performance.
 Supervised learning is a process of providing input data as well as correct output
data to the machine learning model. The aim of a supervised learning algorithm is
to find a mapping function to map the input variable(x) with the output
variable(y).
 In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
How Supervised Learning Works?
In supervised learning, models are trained using labelled dataset, where
the model learns about each type of data. Once the training process is
completed, the model is tested on the basis of test data (a subset of the
training set), and then it predicts the output.

In the example given below, the machine is already trained on all types
of shapes, and when it finds a new shape, it classifies the shape on the
basis of number of sides, and predicts the output.
Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems


based upon the type of output they produce.
Classification
 Classification is a type of supervised learning that where you have data that can be
divided into categories or classes. Classification is used to predict a discrete label. The
outputs fall under a finite set of possible outcomes.
 For example:
 A bank may have a customer dataset containing credit history, loans, investment
details, etc. and they may want to know if any customer will default. In the historical
data, we will have Features and Target.
• Features will be attributes of a customer such as credit history, loans, investments,
etc.
• Target will represent whether a particular customer has defaulted in the past
(normally represented by 1 or 0 / True or False / Yes or No.
 There are many machine learning algorithms that can be used for classification tasks.
Some of them are:
• Logistic Regression
• Decision Tree Classifier
• K Nearest Neighbor Classifier
• Random Forest Classifier
• Support Vector Machine(SVM)
Binary Classifier
 A binary classifier is a type of supervised learning algorithm that can
predict whether a new observation belongs to one of two possible
classes, such as yes or no, spam or not spam, etc. A binary classifier
learns from a labeled dataset, where each observation has a known
class, and then applies a classification rule to assign a class to a new
observation.
 It can only have 2 outcomes. E.g.
• Predict whether an email is spam or not
• Predict whether it will rain or not
• Predict whether a user is a power user or a casual user
Multiclass Classification
 Multi-class classification has the same idea behind binary
classification, except instead of two possible outcomes, there are
three or more.
 For example:
• Predict whether a photo contains a pear, apple, or peach
• Predict what letter of the alphabet a handwritten character is
• Predict whether a piece of fruit is small, medium, or large
Multilabel Classification
 An important note about binary and multi-class classification is that in both, each outcome
has one specific label. However, in multi-label classification, there are multiple possible
labels for each outcome.
 Multi-label classification is a type of supervised learning technique where an algorithm is
trained on a labeled dataset to predict multiple non-exclusive labels for each instance. For
example, an algorithm can learn to predict which traffic signs are present in an image, as
illustrated below.
 This is useful for customer segmentation, image categorization, and sentiment analysis for
understanding text.
 In multi-label classification, each instance can belong to none, one, or more than one class.
For example, an email can be classified as spam or not spam (binary classification), a fruit
can be classified as apple, banana, or orange (multi-class classification), but a movie can
be classified as comedy, drama, action, romance, etc. (multi-label classification).
 However, these algorithms are originally designed for single-label problems and need to be
adapted for multi-label problems. There are two main approaches to adapt single-label
algorithms for multi-label problems: problem transformation and algorithm
adaptation.
 Problem transformation methods convert the multi-label problem into one or more single-
label problems that can be solved by existing algorithms.
 Algorithm adaptation methods modify the existing algorithms to directly handle multi-label
data without transforming the problem.
Regression
 Regression is a type of supervised machine learning where algorithms
learn from the data to predict continuous values such as sales, salary,
weight, or temperature. For example:
 A dataset containing features of the house such as lot size, number of
bedrooms, number of baths, neighborhood, etc. and the price of the
house, a Regression algorithm can be trained to learn the relationship
between the features and the price of the house.
 Below are some popular Regression algorithms which come under
supervised learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
Model y=mx + b
where m is the slope of the regression line, and b is the
y-intercept
 Advantages of Supervised learning:
• With the help of supervised learning, the model can predict the
output on the basis of prior experiences.
• In supervised learning, we can have an exact idea about the classes
of objects.
• Supervised learning model helps us to solve various real-world
problems such as fraud detection, spam filtering, etc.
 Disadvantages of supervised learning:
• Supervised learning models are not suitable for handling the complex
tasks.
• Supervised learning cannot predict the correct output if the test data
is different from the training dataset.
• Training required lots of computation times.
• In supervised learning, we need enough knowledge about the classes
of object.
Unsupervised Machine Learning
 Unsupervised learning is the training of a machine using information that is neither
classified nor labeled and allowing the algorithm to act on that information without
guidance. Here the task of the machine is to group unsorted information according to
similarities, patterns, and differences without any prior training of data.
 Unlike supervised learning, no teacher is provided that means no training will be given
to the machine. Therefore the machine is restricted to find the hidden structure in
unlabeled data by itself.
 Unsupervised learning cannot be directly applied to a regression or classification
problem because unlike supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised learning is to find the underlying
structure of dataset, group that data according to similarities, and represent that
dataset in a compressed format.
 Some commonly used unsupervised algorithms are:
• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Neural Networks
• Principle Component Analysis
• Apriori algorithm
 Some commonly applications of unsupervised algorithms are network analysis,
Recommendation systems, anomaly detection, singular value
decomposition(extract individual valued from DB)
Types of Unsupervised Learning Algorithm:

 The unsupervised learning algorithm can be further categorized


into two types of problems:
• Clustering: Clustering is a method of grouping the objects into
clusters such that objects with most similarities remains into a
group and has less or no similarities with the objects of another
group. Cluster analysis finds the commonalities between the data
objects and categorizes them as per the presence and absence of
those commonalities.
• Association: An association rule is an unsupervised learning
method which is used for finding the relationships between
variables in the large database. It determines the set of items that
occurs together in the dataset. Association rule makes marketing
strategy more effective. Such as people who buy X item (suppose
a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.
 Advantages of unsupervised learning:
• It does not require training data to be labeled.
• Dimensionality reduction can be easily accomplished using unsupervised learning.
• Capable of finding previously unknown patterns in data.
• Flexibility: Unsupervised learning is flexible in that it can be applied to a wide variety of
problems, including clustering, anomaly detection, and association rule mining.
• Exploration: Unsupervised learning allows for the exploration of data and the discovery of
novel and potentially useful patterns that may not be apparent from the outset.
• Low cost: Unsupervised learning is often less expensive than supervised learning because it
doesn’t require labeled data, which can be time-consuming and costly to obtain.
 Disadvantages of unsupervised learning :
• Difficult to measure accuracy or effectiveness due to lack of predefined answers during
training.
• The results often have lesser accuracy.
• The user needs to spend time interpreting and label the classes which follow that classification.
• Lack of guidance: Unsupervised learning lacks the guidance and feedback provided by
labeled data, which can make it difficult to know whether the discovered patterns are relevant
or useful.
• Sensitivity to data quality: Unsupervised learning can be sensitive to data quality, including
missing values, outliers, and noisy data.
• Scalability: Unsupervised learning can be computationally expensive, particularly for large
datasets or complex algorithms, which can limit its scalability.
Semi-supervised machine learning
 Semi-Supervised learning is a type of Machine Learning
algorithm that lies between Supervised and Unsupervised
machine learning. It represents the intermediate ground between
Supervised (With Labelled training data) and Unsupervised learning
(with no labelled training data) algorithms and uses the combination
of labelled and unlabeled datasets during the training period.
 It solves classification problems, which means you’ll ultimately need a
supervised learning algorithm for the task. But at the same time, you
want to train your model without labeling every single training
example, for which you’ll get help from unsupervised machine
learning techniques.
 The main aim of semi-supervised learning is to effectively use all the
available data, rather than only labelled data like in supervised
learning. Initially, similar data is clustered along with an unsupervised
learning algorithm, and further, it helps to label the unlabeled data
into labelled data. It is because labelled data is a comparatively more
expensive acquisition than unlabeled data.
 The real-world example of semi-supervised machine learning for a
text document classifier involves the following steps:
 Start with a small amount of labeled data and a large amount of
unlabeled data.
 Convert the text documents into numerical features using feature
extraction techniques.
 Apply a semi-supervised learning algorithm to train a model using
both the labeled and unlabeled data.
 Evaluate the model's performance on a validation set.
 Use the trained model to make predictions on new, unseen data.
 The text document classifier can be used to classify documents into
different categories, such as spam or non-spam emails, news articles,
or customer reviews. By leveraging the large amount of unlabeled
data, semi-supervised learning can help improve the accuracy and
efficiency of the text document classifier, making it more effective in
real-world applications.
Advantages and disadvantages of Semi-supervised Learning

 Advantages:
• It is simple and easy to understand the algorithm.
• It is highly efficient.
• It is used to solve drawbacks of Supervised and Unsupervised
Learning algorithms.
 Disadvantages:
• Iterations results may not be stable.
• We cannot apply these algorithms to network-level data.
• Accuracy is low.
Reinforcement learning
• Reinforcement Learning is a feedback-based Machine learning technique in
which an agent learns to behave in an environment by performing the actions
and seeing the results of actions. For each good action, the agent gets positive
feedback, and for each bad action, the agent gets negative feedback or penalty.
• In Reinforcement Learning, the agent learns automatically using feedbacks
without any labeled data, unlike supervised learning.
• Since there is no labeled data, so the agent is bound to learn by its experience
only.
• RL solves a specific type of problem where decision making is sequential, and
the goal is long-term, such as game-playing, robotics, etc.
• The agent interacts with the environment and explores it by itself. The primary
goal of an agent in reinforcement learning is to improve the performance by
getting the maximum positive rewards.
• The agent learns with the process of hit and trial, and based on the experience,
it learns to perform the task in a better way. Hence, we can say
that "Reinforcement learning is a type of machine learning method
where an intelligent agent (computer program) interacts with the
environment and learns to act within that." How a Robotic dog learns the
movement of his arms is an example of Reinforcement learning.
Terms used in reinforcement learning
•Agent(): An entity that can perceive/explore the environment and
act upon it.
•Environment(): A situation in which an agent is present or
surrounded by. In RL, we assume the stochastic environment, which
means it is random in nature.
•Action(): Actions are the moves taken by an agent within the
environment.
•State(): State is a situation returned by the environment after each
action taken by the agent.
•Reward(): A feedback returned to the agent from the environment
to evaluate the action of the agent.
•Policy(): Policy is a strategy applied by the agent for the next
action based on the current state.
•Value(): It is expected long-term retuned with the discount factor
and opposite to the short-term reward.
•Q-value(): It is mostly similar to the value, but it takes one
additional parameter as a current action (a).
Working of Reinforcement learning
Performance Measures
 Evaluating the performance of a Machine learning model is one of the important
steps while building an effective ML model.
 To evaluate the performance or quality of the model, different metrics are used,
and these metrics are known as performance metrics or evaluation metrics.
 These performance metrics help us understand how well our model has
performed for the given data. In this way, we can improve the model's
performance by tuning the hyper-parameters.
 Each ML model aims to generalize well on unseen/new data, and performance
metrics help determine how well the model generalizes on the new dataset.
 In machine learning, each task or problem is divided into classification and
Regression. Different evaluation metrics are used for both Regression and
Classification tasks. They are as follows:
 Performance Metrics for Classification : Confusion Matrix, Accuracy, Precision
& recall, ROC Curve.
 Performance Metrics for Regression : Mean Absolute Error, Mean Squared
Error,
 R2 Score
Confusion Matrix
 A confusion matrix is a tabular representation of prediction outcomes of
any binary classifier, which is used to describe the performance of the
classification model on a set of test data when true values are known.
 It is a tabular visualization of the ground-truth labels versus model
predictions.
 Each row of the confusion matrix represents the instances in a predicted
class and each column represents the instances in an actual class.
 Confusion Matrix is not exactly a performance metric but sort of a basis
on which other metrics evaluate the results.
 For binary classification, the matrix will be of a 2X2 table, For multi-class
classification, the matrix shape will be equal to the number of classes i.e
for n classes it will be nXn.
 The following 4 are the basic terminology which will help us in determining
the metrics we are looking for.
• True Positives (TP): when the actual value is Positive and predicted is
also Positive.
• True negatives (TN): when the actual value is Negative and prediction is
also Negative.
• False positives (FP): When the actual is negative but prediction is
Positive. Also known as the Type 1 error
• False negatives (FN): When the actual is Positive but the prediction is
Negative. Also known as the Type 2 error
A good model is
one which
has high TP and
TN rates,
while low FP and
FN rates.
Understanding Confusion Matrix in an easier way:
 We have a total of 20 cats and dogs and our model predicts
whether it is a cat or not.
 Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’,
‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’,
‘dog’, ‘dog’, ‘cat’]
Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’,
‘cat’, ‘cat’, ‘cat’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’,
‘cat’, ‘dog’, ‘dog’, ‘cat’]
Accuracy
 Using the confusion matrix we calculate the
classification measures.
 Accuracy defines how often the model
predicts the correct output.
 It can be calculated as the ratio of the number
of correct predictions made by the classifier to
all number of predictions made by the
classifiers.
 It is a measure of correctness that is
achieved in true prediction. In simple words, it
tells us how many predictions are actually
positive out of all the total positive
predicted.
 Accuracy is a valid choice of evaluation for
classification problems which are well
balanced and not skewed or there is no
class imbalance.
Precision
 It is a measure of correctness that is
achieved in true prediction.
 In simple words, it tells us how many
predictions are actually positive out of all
the total positive predicted.
 Precision is defined as the ratio of the total
number of correctly classified positive
classes divided by the total number of
predicted positive classes. Or, out of all the
predictive positive classes, how much we
predicted correctly. Precision should be
high(ideally 1).
 “Precision is a useful metric in cases
where False Positive is a higher concern
than False Negatives”
 Ex 1:- In Spam Detection : Need to focus
on precision
 Suppose mail is not a spam but model
is predicted as spam : FP (False Positive). We
always try to reduce FP.
Recall
 It is a measure of actual
observations which are predicted correctly,
i.e. how many observations of positive class
are actually predicted as positive. It is also
known as Sensitivity. Recall is a valid
choice of evaluation metric when we want to
capture as many positives as possible.
 Recall is defined as the ratio of the total
number of correctly classified positive
classes divide by the total number of positive
classes. Or, out of all the positive classes,
how much we have predicted
correctly. Recall should be high(ideally 1).
 “Recall is a useful metric in cases
where False Negative trumps False
Positive”
 Ex 1:- suppose person having cancer (or)
not? He is suffering from cancer but model
predicted as not suffering from cancer
F-measure / F1-Score
 The F1 score is a number between 0 and 1 and is the harmonic
mean of precision and recall. We use harmonic mean because it is
not sensitive to extremely large values, unlike simple averages.
 F1 score sort of maintains a balance between the precision and
recall for your classifier.
 In practice, when we try to increase the precision of our model, the
recall goes down and vice-versa. The F1-score captures both the
trends in a single value.
 F-score should be high(ideally 1).
ROC (Receiver Operating Characteristic) Curve
 An ROC curve (receiver operating characteristic curve) is a graph showing the
performance of a classification model at all classification thresholds. This curve plots two
parameters:
 True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
TPR=TP/(TP+FN)
 False Positive Rate (FPR) is defined as follows: FPR=FP/(FP+TN)
 An ROC curve plots TPR vs. FPR at different classification thresholds.
 Lowering the classification threshold classifies more items as positive, thus increasing
both False Positives and True Positives. The following figure shows a typical ROC curve.
AUC: Area Under the ROC curve
 To compute the points in an ROC curve, we could evaluate a logistic
regression model many times with different classification thresholds,
but this would be inefficient. Fortunately, there's an efficient, sorting-
based algorithm that can provide this information for us, called AUC.
 AUC stands for "Area under the ROC Curve." That is, AUC measures
the entire two-dimensional area underneath the entire ROC curve.
 AUC is as the probability that the model ranks a random positive
example more highly than a random negative example. For example,
given the following examples, which are arranged from left to right in
ascending order of logistic regression predictions:

 AUC represents the probability that a random positive (green)


example is positioned to the right of a random negative (red)
example.
 AUC ranges in value from 0 to 1. A model whose predictions are 100%
wrong has an AUC of 0.0; one whose predictions are 100% correct has
an AUC of 1.0.
Python Machine Learning Library Scikit-Learn
 scikit-learn is a general-purpose open-source library for data analysis
written in python. It is based on other python libraries: NumPy, SciPy,
and matplotlib.
 scikit-learn contains a number of implementation for different popular
algorithms of machine learning.
 It implements a range of machine learning, pre-processing, cross-
validation, and visualization algorithms using a unified interface.
 For most installation pip python package manager can install python
and all of its dependencies:
 pip install scikit-learn
 To check that you have scikit-learn, execute in shell:
 python -c 'import sklearn; print(sklearn.__version__)’

Important features of scikit-learn:

• Simple and efficient tools for data mining and data analysis. It
features various classification, regression and clustering algorithms
including support vector machines, random forests, gradient
boosting, k-means, etc.
• Accessible to everybody and reusable in various contexts.
• Built on the top of NumPy, SciPy, and matplotlib.
• Open source, commercially usable – BSD license.
• Sklearn is an community project and anyone can contribute to it.
• Currently, there are more than 2058 contributors on its github
repository.
Steps to buld ML models using scikit learn
 Step 1: Loading the dataset – We can either load an external
dataset using pandas dataframe and series. Another way is to use the
toy sets available in sklearn library.
1. Syntax:
# Importing the dataset from the datasets module of sklearn
2. from sklearn.datasets import load_iris
3. # Loading the dataset
4. iris = load_iris()
5. # Creating the dataframe of the dataset
6. df = pd.DataFrame(iris.data, columns = iris.feature_names)
 Step 2 : Summarize the dataset
 In this step we are going to take a look at the data a few different
ways:
 Dimensions of the dataset. print(dataset.shape)
 Peek at the data itself. print(dataset.head(20))
 Statistical summary of all attributes. print(dataset.describe())
 Breakdown of the data by the class variable.
print(dataset.groupby('class').size())
 Step 3 : Data Visualization
 We are going to look at two types of plots:
 Univariate plots to better understand each attribute. dataset.hist()
 Multivariate plots to better understand the relationships between
attributes.
 scatter_matrix(dataset) (scatterplots of all pairs of attributes)
 Step 4 : Evaluate Some Algorithms
1. Separate out a validation dataset. - The loaded dataset is split into two, one
portion which will be used to train, evaluate and select among the models, and
other one to be used as a validation dataset.
 array = dataset.values
 X = array[:,0:4]
 y = array[:,4]
 X_train, X_validation, Y_train, Y_validation = train_test_split(X, y,
test_size=0.20, random_state=1)
2. Set-up the test harness to use 10-fold cross validation- We will use stratified
10-fold cross validation to estimate model accuracy. This will split our dataset into
10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.
Stratified means that each fold or split of the dataset will aim to have the same
distribution of example by class as exist in the whole training dataset.
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy’)
3. Build multiple different models to predict labels.
4. Select the best model.
Step 5: Make predictions - model.fit(X, y) (trains the model on X and y)
 X_test = [[...], [...]] (test data)
 y_test = model.predict(X_test) (making prediction
on test data)
Linear Regression with one Variable
 Linear regression is an approach for predicting a response using a
single feature.
 In linear regression, we assume that the two variables i.e. dependent
and independent variables are linearly related. Hence, we try to find a
linear function that predicts the response value(y) as accurately as
possible as a function of the feature or independent variable(x).
 For example predicting the price of a house based on the area of
house. Given below is a sample of data showing area of the house
and corresponding price for houses.
In our house prices example, the variable y in the line equation is the
price of the house and the variable x is area of the house. Area is
called an independent variable (generally the variable on x-axis)
and price is called a dependent variable (generally the variable on
y-axis) because we are calculating the price based on area.
Example of Linear regression with one variable for house price prediction
 import pandas as pd
 import numpy as np
 import matplotlib.pyplot as plt
 from sklearn.linear_model import LinearRegression
 df=pd.DataFrame({'Area':[2600,3000,3200,3600,4000],'Price':
[550000,565000,610000,680000,725000]})
 print(df)
 plt.xlabel("Area(sq.ft.)")
 plt.ylabel("Price($)")
 plt.scatter(df.Area, df.Price)
 reg = LinearRegression()
 reg.fit(df[['Area']], df.Price)
 print(reg.predict([[3300]]))
 print("The regression equation is ",reg.coef_," * Area +
",reg.intercept_) # mx+b where m is coefficient and b is intercept
 plt.plot(df.Area, reg.predict(df[['Area']]), color='blue')
Output
Practice question

 Load the Canada_pci dataset.


 Visualize the data
 Generate a Linear Regression model
 Predict the per capita income for 2021
Linear Regression with Multiple Variables
 When we try to predict a value (dependent variable) based on more
than one independent variables, it is called as linear regression
with multiple variables or Multiple Regression.
 For e.g. this dataset includes two new columns: bedrooms (number of
bedrooms) and age (age of the house). You may already know that
the price of a house is not simply dependent only on area of the
house.
 So, we will try to predict the price of a house (dependent variable)
based on number of bedrooms, age of house and area of the house
(independent variables). Since there are multiple independent
variables, this type of linear regression is called multiple regression.
 We will try to predict the price of houses area
with following properties:
bedroo ag price
ms e
• 3000 sq. ft. area, 3 bedrooms, 40 years old
• 2500 sq. ft. area, 4 bedrooms, 5 years old2600 3 20 550000
3000 4 15 565000
3200 18 610000

3600 3 30 595000
 So, our linear equation with the three independent variables becomes:
 price = m1*area + m2*bedrooms + m3*age + b
 Note: Data has missing values. So we first try to preprocess our data and handle the
missing value by replacing it with the mean value.
 import pandas as pd
 import numpy as np
 import matplotlib.pyplot as plt
 import math
 from sklearn.linear_model import LinearRegression
 df=pd.read_csv("house_prices_mv.csv")
 print(df)
 median_bedrooms=math.floor(df.bedrooms.median())
 df.bedrooms=df.bedrooms.fillna(median_bedrooms)
 print(df)
 reg=LinearRegression()
 reg.fit(df[['area', 'bedrooms', 'age']], df.price)
 print(reg.predict([[3000, 3, 40]]))
 print(reg.predict([[2500, 4, 5]]))
 print("The regression coefficients are ",reg.coef_)
 print (“The intercept is ",reg.intercept_)
Exercise for multiple regression
Experienc Score Int_score salary
e
8 9 50000
8 6 45000
5 6 7 60000
2 10 10 65000
7 9 6 70000
3 7 10 62000
10 7 72000
11 7 8 80000

 Q1. Preprocess the data


 Q2. Build a multiple regression model
 Q3. Test the model with the following data:
• 2 yrs experience, 9 test score, 6 interview score
• 12 yrs experience, 10 test score, 10 interview score
Logistic Regression
 Logistic regression aims to solve classification problems.
 It does this by predicting categorical outcomes, unlike linear regression that
predicts a continuous outcome.
 In logistic regression, the dependent variable is a binary variable that contains data
coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic
regression model predicts the probability P(Y=1) as a function of X.
 If the problem has two outcomes it is called Binomial logistic regression.
Example, if a tumor is malignant or benign.
 Cases having more than 2 outcomes to classify are called Multinomial Logistic
Regression. For example, checking the shape of 3 different objects.
 Logistic regression is a method we can use to fit a regression model when the
response variable is binary.
 Logistic regression uses a method known as maximum likelihood estimation to find an
equation of the following form:
 log[p(X) / (1-p(X))] = β0 + β1X1 + β2X2 + … + βpXp
 where:
• Xj: The jth predictor variable
• βj: The coefficient estimate for the jth predictor variable
 The formula on the right side of the equation predicts the log odds of the response variable
taking on a value of 1.
 Thus, when we fit a logistic regression model we can use the following equation
to calculate the probability that a given observation takes on a value of 1:

 We then use some probability threshold to classify the observation as either 1


or 0.
 For example, we might say that observations with a probability greater than or
equal to 0.5 will be classified as “1” and all other observations will be classified
as “0.”
Example
 import numpy
 from sklearn.linear_model import LogisticRegression
 import pandas as pd
 #Reshaped for Logistic function
 # To check whether tumor is cancer with size of tumour

# Reshape in 1 column with 12 outputs
 X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69,
5.88]).reshape(-1,1) # -1 means this dimension is unknown
 y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
 print(X)
 logr = LogisticRegression()
 logr.fit(X,y)
 predicted = logr.predict(numpy.array([3.46]).reshape(-1,1))
 print(predicted)
 log_odds = logr.coef_
 odds = numpy.exp(log_odds)
 print(odds)
 #This tells us that as the size of a tumor increases by 1mm in length and width the odds
 of it being a cancerous tumor increases by 4x.
Example with train-test splitting
 Data: Pima Indians Diabetes Database (kaggle.com)
 The objective of the dataset is to diagnostically predict whether or
not a patient has diabetes, based on certain diagnostic
measurements included in the dataset.
 from sklearn.linear_model import LogisticRegression
 import pandas as pd
 from sklearn.model_selection import train_test_split
 from sklearn import metrics
 import matplotlib.pyplot as plt

df=pd.read_csv("diabetes.csv")
 print(df.info())
# Split data in feature and target variables
 X=df.drop('Outcome',axis=1)
 y=df['Outcome']
 print(y.value_counts()) #check the label distribution
 # spliiting data into training and test data
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=16)
 print(y_test.value_counts()) #check the label distribution in test class
 logr = LogisticRegression(max_iter=1000) # if we do not write max_iter it gives warning. So as
size of data increases we inc number
 logr.fit(X_train.values,y_train) #train the mode
 predicted=logr.predict(X_test.values) #predict and store the predicted values
 cm=metrics.confusion_matrix(y_test,predicted) #generate a confusion matrix
 print(cm)
 cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = cm, display_labels
= [0,1]) #plot confusion matrix
print(metrics.accuracy_score(y_test,predicted)) #print the accuracy
 print(metrics.classification_report(y_test,predicted)) #generate classification report
cm_display.plot()
 plt.show()
Types of Logistic Regression

• Binary Logistic Regression: The target variable has only two possible
outcomes such as Spam or Not Spam, Cancer or No Cancer.
• Multinomial Logistic Regression: The target variable has three or more
nominal categories such as predicting the type of Wine.
• Ordinal Logistic Regression: the target variable has three or more
ordinal categories such as restaurant or product rating from 1 to 5.

You might also like