Unit 1 Notes
Unit 1 Notes
with Python
Course Code: BCA 311
Syllabus - UNIT–I
All facts are data. Data can be numbers or text that can be processed by a computer. Today,
organizations are accumulating vast and growing amounts of data with data sources such as flat
files, databases, or data warehouses in different storage formats. Processed data is called
information. This includes patterns, associations, or relationships among data. For example,
sales data can be analyzed to extract information like which is the fast selling product.
Condensed information is called knowledge. For example, the historical patterns and future
trends obtained in the above sales data can be called knowledge. Unless knowledge is
extracted, data is of no use. Similarly, knowledge is not useful unless it is put into action.
Intelligence is the applied knowledge for actions. An actionable form of knowledge is called
intelligence. Computer systems have been successful till this stage. The ultimate objective of
knowledge pyramid is wisdom that represents the maturity of mind that is, so far, exhibited only
by humans. Here comes the need for machine learning. The objective of machine learning is to
process these archival data for organizations to take better decisions to design new products,
improve the business processes, and to develop effective decision support systems.
Introduction to Machine
Learning
The term machine learning was first introduced by Arthur Samuel in 1959. We can define
it as : ” Machine learning enables a machine to automatically learn from data, improve
performance from experiences, and predict things without being explicitly programmed.”
Machine learning is a growing technology which enables computers to learn
automatically from past data. Machine learning uses various algorithms for building
mathematical models and making predictions using historical data or information.
Currently, it is being used for various tasks such as image recognition, speech
recognition, email filtering, Facebook auto-tagging, recommender system, and many
more.
As humans take decisions based on an experience, computers make models based on
extracted patterns in the input data and then use these data-filled models for prediction
and to take decisions.
In statistical learning, the relationship between the input x and output y is modeled
as a function in the form y = f(x). Here, f is the learning function that maps the input
x to output y. Learning of function f is the crucial aspect of forming a model in
statistical learning. In machine learning, this is simply called mapping of input to
output. The learning program summarizes the raw data in a model.
Formally stated, a model is an explicit description of patterns within the data in the
form of:
1. Mathematical equation
2. Relational diagrams like trees/graphs
3. Logical if/else rules, or
4. Groupings called clusters
The difference between pattern and model is that the former is local and applicable
only to certain attributes but the latter is global and fits the entire dataset. For
example, a model can be helpful to examine whether a given email is spam or not.
The point is that the model is generated automatically from the given data.
Another pioneer of AI, Tom Mitchell’s definition of machine learning states that, “A
computer program is said to learn from experience E, with respect to task T
and some performance measure P, if its performance on T measured by P
improves with experience E.”
For example, while creating a program for playing chess:
T : To play the game
E : Experience of games played by previous players
P : Probability of win or lose
How does Machine Learning work
A Machine Learning system learns from historical data, builds the
prediction models, and whenever it receives new data, predicts the
output for it. The accuracy of predicted output depends upon the amount
of data, as the huge amount of data helps to build a better model which
predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some
predictions, so instead of writing a code for it, we just need to feed the data
to generic algorithms, and with the help of these algorithms, machine builds
the logic as per the data and predict the output. Machine learning has
changed our way of thinking about the problem. The below block diagram
explains the working of Machine Learning algorithm:
Machine Learning and Artificial Intelligence
Machine learning is an important branch of AI, which is a much broader
subject. The aim of AI is to develop intelligent agents. An agent can be a
robot, humans, or any autonomous systems. Machine learning is the
subbranch of AI, whose aim is to extract the patterns for prediction. It is a
broad field that includes learning from examples and other areas like
reinforcement learning.
AI is a bigger concept to create intelligent machines that can simulate human thinking
capability and behavior, whereas, machine learning is an application or subset of AI that
allows machines to learn from data without being programmed explicitly.
Artificial intelligence
• AI allows a machine to simulate human intelligence to solve problems
• The goal is to develop an intelligent system that can perform complex tasks
• We build systems that can solve complex tasks like a human
• AI has a wide scope of applications
• AI uses technologies in a system so that it mimics human decision-making
• AI works with all types of data: structured, semi-structured, and unstructured
• AI systems use logic and decision trees to learn, reason, and self-correct
Machine learning
• ML allows a machine to learn autonomously from past data
• The goal is to build machines that can learn from data to increase the accuracy of the
output
• We train machines with data to perform specific tasks and deliver accurate results
• Machine learning has a limited scope of applications
• ML uses self-learning algorithms to produce predictive models
• ML can only use structured and semi-structured data
• ML systems rely on statistical models to learn and can self-correct when provided with
new data
Types of Machine Learning Problems
Based on the methods and way of learning, machine learning is
divided into mainly four types, which are:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
Supervised Machine Learning
Supervised learning is the types of machine learning in which machines are
trained using well "labelled" training data, and on basis of that data, machines
predict the output. The labelled data means some input data is already tagged with
the correct output.
In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly.
Imagine a teacher supervising a class. The teacher already knows the correct
answers but the learning process doesn’t stop until the students learn the answers
as well. This is the essence of Supervised Machine Learning Algorithms. Here, the
algorithm learns from a training dataset and makes predictions that are compared
with the actual output values. If the predictions are not correct, then the algorithm
is modified until it is satisfactory. This learning process continues until the algorithm
achieves the required level of performance.
Supervised learning is a process of providing input data as well as correct output
data to the machine learning model. The aim of a supervised learning algorithm is
to find a mapping function to map the input variable(x) with the output
variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
How Supervised Learning Works?
In supervised learning, models are trained using labelled dataset, where
the model learns about each type of data. Once the training process is
completed, the model is tested on the basis of test data (a subset of the
training set), and then it predicts the output.
In the example given below, the machine is already trained on all types
of shapes, and when it finds a new shape, it classifies the shape on the
basis of number of sides, and predicts the output.
Types of supervised Machine learning Algorithms:
Advantages:
• It is simple and easy to understand the algorithm.
• It is highly efficient.
• It is used to solve drawbacks of Supervised and Unsupervised
Learning algorithms.
Disadvantages:
• Iterations results may not be stable.
• We cannot apply these algorithms to network-level data.
• Accuracy is low.
Reinforcement learning
• Reinforcement Learning is a feedback-based Machine learning technique in
which an agent learns to behave in an environment by performing the actions
and seeing the results of actions. For each good action, the agent gets positive
feedback, and for each bad action, the agent gets negative feedback or penalty.
• In Reinforcement Learning, the agent learns automatically using feedbacks
without any labeled data, unlike supervised learning.
• Since there is no labeled data, so the agent is bound to learn by its experience
only.
• RL solves a specific type of problem where decision making is sequential, and
the goal is long-term, such as game-playing, robotics, etc.
• The agent interacts with the environment and explores it by itself. The primary
goal of an agent in reinforcement learning is to improve the performance by
getting the maximum positive rewards.
• The agent learns with the process of hit and trial, and based on the experience,
it learns to perform the task in a better way. Hence, we can say
that "Reinforcement learning is a type of machine learning method
where an intelligent agent (computer program) interacts with the
environment and learns to act within that." How a Robotic dog learns the
movement of his arms is an example of Reinforcement learning.
Terms used in reinforcement learning
•Agent(): An entity that can perceive/explore the environment and
act upon it.
•Environment(): A situation in which an agent is present or
surrounded by. In RL, we assume the stochastic environment, which
means it is random in nature.
•Action(): Actions are the moves taken by an agent within the
environment.
•State(): State is a situation returned by the environment after each
action taken by the agent.
•Reward(): A feedback returned to the agent from the environment
to evaluate the action of the agent.
•Policy(): Policy is a strategy applied by the agent for the next
action based on the current state.
•Value(): It is expected long-term retuned with the discount factor
and opposite to the short-term reward.
•Q-value(): It is mostly similar to the value, but it takes one
additional parameter as a current action (a).
Working of Reinforcement learning
Performance Measures
Evaluating the performance of a Machine learning model is one of the important
steps while building an effective ML model.
To evaluate the performance or quality of the model, different metrics are used,
and these metrics are known as performance metrics or evaluation metrics.
These performance metrics help us understand how well our model has
performed for the given data. In this way, we can improve the model's
performance by tuning the hyper-parameters.
Each ML model aims to generalize well on unseen/new data, and performance
metrics help determine how well the model generalizes on the new dataset.
In machine learning, each task or problem is divided into classification and
Regression. Different evaluation metrics are used for both Regression and
Classification tasks. They are as follows:
Performance Metrics for Classification : Confusion Matrix, Accuracy, Precision
& recall, ROC Curve.
Performance Metrics for Regression : Mean Absolute Error, Mean Squared
Error,
R2 Score
Confusion Matrix
A confusion matrix is a tabular representation of prediction outcomes of
any binary classifier, which is used to describe the performance of the
classification model on a set of test data when true values are known.
It is a tabular visualization of the ground-truth labels versus model
predictions.
Each row of the confusion matrix represents the instances in a predicted
class and each column represents the instances in an actual class.
Confusion Matrix is not exactly a performance metric but sort of a basis
on which other metrics evaluate the results.
For binary classification, the matrix will be of a 2X2 table, For multi-class
classification, the matrix shape will be equal to the number of classes i.e
for n classes it will be nXn.
The following 4 are the basic terminology which will help us in determining
the metrics we are looking for.
• True Positives (TP): when the actual value is Positive and predicted is
also Positive.
• True negatives (TN): when the actual value is Negative and prediction is
also Negative.
• False positives (FP): When the actual is negative but prediction is
Positive. Also known as the Type 1 error
• False negatives (FN): When the actual is Positive but the prediction is
Negative. Also known as the Type 2 error
A good model is
one which
has high TP and
TN rates,
while low FP and
FN rates.
Understanding Confusion Matrix in an easier way:
We have a total of 20 cats and dogs and our model predicts
whether it is a cat or not.
Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’,
‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’,
‘dog’, ‘dog’, ‘cat’]
Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’,
‘cat’, ‘cat’, ‘cat’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’,
‘cat’, ‘dog’, ‘dog’, ‘cat’]
Accuracy
Using the confusion matrix we calculate the
classification measures.
Accuracy defines how often the model
predicts the correct output.
It can be calculated as the ratio of the number
of correct predictions made by the classifier to
all number of predictions made by the
classifiers.
It is a measure of correctness that is
achieved in true prediction. In simple words, it
tells us how many predictions are actually
positive out of all the total positive
predicted.
Accuracy is a valid choice of evaluation for
classification problems which are well
balanced and not skewed or there is no
class imbalance.
Precision
It is a measure of correctness that is
achieved in true prediction.
In simple words, it tells us how many
predictions are actually positive out of all
the total positive predicted.
Precision is defined as the ratio of the total
number of correctly classified positive
classes divided by the total number of
predicted positive classes. Or, out of all the
predictive positive classes, how much we
predicted correctly. Precision should be
high(ideally 1).
“Precision is a useful metric in cases
where False Positive is a higher concern
than False Negatives”
Ex 1:- In Spam Detection : Need to focus
on precision
Suppose mail is not a spam but model
is predicted as spam : FP (False Positive). We
always try to reduce FP.
Recall
It is a measure of actual
observations which are predicted correctly,
i.e. how many observations of positive class
are actually predicted as positive. It is also
known as Sensitivity. Recall is a valid
choice of evaluation metric when we want to
capture as many positives as possible.
Recall is defined as the ratio of the total
number of correctly classified positive
classes divide by the total number of positive
classes. Or, out of all the positive classes,
how much we have predicted
correctly. Recall should be high(ideally 1).
“Recall is a useful metric in cases
where False Negative trumps False
Positive”
Ex 1:- suppose person having cancer (or)
not? He is suffering from cancer but model
predicted as not suffering from cancer
F-measure / F1-Score
The F1 score is a number between 0 and 1 and is the harmonic
mean of precision and recall. We use harmonic mean because it is
not sensitive to extremely large values, unlike simple averages.
F1 score sort of maintains a balance between the precision and
recall for your classifier.
In practice, when we try to increase the precision of our model, the
recall goes down and vice-versa. The F1-score captures both the
trends in a single value.
F-score should be high(ideally 1).
ROC (Receiver Operating Characteristic) Curve
An ROC curve (receiver operating characteristic curve) is a graph showing the
performance of a classification model at all classification thresholds. This curve plots two
parameters:
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
TPR=TP/(TP+FN)
False Positive Rate (FPR) is defined as follows: FPR=FP/(FP+TN)
An ROC curve plots TPR vs. FPR at different classification thresholds.
Lowering the classification threshold classifies more items as positive, thus increasing
both False Positives and True Positives. The following figure shows a typical ROC curve.
AUC: Area Under the ROC curve
To compute the points in an ROC curve, we could evaluate a logistic
regression model many times with different classification thresholds,
but this would be inefficient. Fortunately, there's an efficient, sorting-
based algorithm that can provide this information for us, called AUC.
AUC stands for "Area under the ROC Curve." That is, AUC measures
the entire two-dimensional area underneath the entire ROC curve.
AUC is as the probability that the model ranks a random positive
example more highly than a random negative example. For example,
given the following examples, which are arranged from left to right in
ascending order of logistic regression predictions:
• Simple and efficient tools for data mining and data analysis. It
features various classification, regression and clustering algorithms
including support vector machines, random forests, gradient
boosting, k-means, etc.
• Accessible to everybody and reusable in various contexts.
• Built on the top of NumPy, SciPy, and matplotlib.
• Open source, commercially usable – BSD license.
• Sklearn is an community project and anyone can contribute to it.
• Currently, there are more than 2058 contributors on its github
repository.
Steps to buld ML models using scikit learn
Step 1: Loading the dataset – We can either load an external
dataset using pandas dataframe and series. Another way is to use the
toy sets available in sklearn library.
1. Syntax:
# Importing the dataset from the datasets module of sklearn
2. from sklearn.datasets import load_iris
3. # Loading the dataset
4. iris = load_iris()
5. # Creating the dataframe of the dataset
6. df = pd.DataFrame(iris.data, columns = iris.feature_names)
Step 2 : Summarize the dataset
In this step we are going to take a look at the data a few different
ways:
Dimensions of the dataset. print(dataset.shape)
Peek at the data itself. print(dataset.head(20))
Statistical summary of all attributes. print(dataset.describe())
Breakdown of the data by the class variable.
print(dataset.groupby('class').size())
Step 3 : Data Visualization
We are going to look at two types of plots:
Univariate plots to better understand each attribute. dataset.hist()
Multivariate plots to better understand the relationships between
attributes.
scatter_matrix(dataset) (scatterplots of all pairs of attributes)
Step 4 : Evaluate Some Algorithms
1. Separate out a validation dataset. - The loaded dataset is split into two, one
portion which will be used to train, evaluate and select among the models, and
other one to be used as a validation dataset.
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y,
test_size=0.20, random_state=1)
2. Set-up the test harness to use 10-fold cross validation- We will use stratified
10-fold cross validation to estimate model accuracy. This will split our dataset into
10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.
Stratified means that each fold or split of the dataset will aim to have the same
distribution of example by class as exist in the whole training dataset.
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy’)
3. Build multiple different models to predict labels.
4. Select the best model.
Step 5: Make predictions - model.fit(X, y) (trains the model on X and y)
X_test = [[...], [...]] (test data)
y_test = model.predict(X_test) (making prediction
on test data)
Linear Regression with one Variable
Linear regression is an approach for predicting a response using a
single feature.
In linear regression, we assume that the two variables i.e. dependent
and independent variables are linearly related. Hence, we try to find a
linear function that predicts the response value(y) as accurately as
possible as a function of the feature or independent variable(x).
For example predicting the price of a house based on the area of
house. Given below is a sample of data showing area of the house
and corresponding price for houses.
In our house prices example, the variable y in the line equation is the
price of the house and the variable x is area of the house. Area is
called an independent variable (generally the variable on x-axis)
and price is called a dependent variable (generally the variable on
y-axis) because we are calculating the price based on area.
Example of Linear regression with one variable for house price prediction
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
df=pd.DataFrame({'Area':[2600,3000,3200,3600,4000],'Price':
[550000,565000,610000,680000,725000]})
print(df)
plt.xlabel("Area(sq.ft.)")
plt.ylabel("Price($)")
plt.scatter(df.Area, df.Price)
reg = LinearRegression()
reg.fit(df[['Area']], df.Price)
print(reg.predict([[3300]]))
print("The regression equation is ",reg.coef_," * Area +
",reg.intercept_) # mx+b where m is coefficient and b is intercept
plt.plot(df.Area, reg.predict(df[['Area']]), color='blue')
Output
Practice question
3600 3 30 595000
So, our linear equation with the three independent variables becomes:
price = m1*area + m2*bedrooms + m3*age + b
Note: Data has missing values. So we first try to preprocess our data and handle the
missing value by replacing it with the mean value.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from sklearn.linear_model import LinearRegression
df=pd.read_csv("house_prices_mv.csv")
print(df)
median_bedrooms=math.floor(df.bedrooms.median())
df.bedrooms=df.bedrooms.fillna(median_bedrooms)
print(df)
reg=LinearRegression()
reg.fit(df[['area', 'bedrooms', 'age']], df.price)
print(reg.predict([[3000, 3, 40]]))
print(reg.predict([[2500, 4, 5]]))
print("The regression coefficients are ",reg.coef_)
print (“The intercept is ",reg.intercept_)
Exercise for multiple regression
Experienc Score Int_score salary
e
8 9 50000
8 6 45000
5 6 7 60000
2 10 10 65000
7 9 6 70000
3 7 10 62000
10 7 72000
11 7 8 80000
• Binary Logistic Regression: The target variable has only two possible
outcomes such as Spam or Not Spam, Cancer or No Cancer.
• Multinomial Logistic Regression: The target variable has three or more
nominal categories such as predicting the type of Wine.
• Ordinal Logistic Regression: the target variable has three or more
ordinal categories such as restaurant or product rating from 1 to 5.