0% found this document useful (0 votes)

63 views

2017 Machine Learning-Part

Uploaded by

bastian tio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views

2017 Machine Learning-Part

Uploaded by

bastian tio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Copyright © 2017. CRC Press. All rights reserved.

May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or
applicable copyright law.

Machine Learning
Algorithms and Applications
applicable copyright law.

Adaptive, Dynamic, and Exploring Neural Networks

Resilient Systems with C#

Edited by Niranjan Suri and Ryszard Tadeusiewicz,
Giacomo Cabri Rituparna Chaki, and
ISBN 978-1-4398-6848-5 Nabendu Chaki
ISBN 978-1-4822-3339-1
Anti-Spam Techniques Based
on Artificial Immune System Generic and Energy-Efficient
Context-Aware Mobile Sensing
Ying Tan
Ozgur Yurur and Chi Harold Liu
ISBN 978-1-4987-2518-7
ISBN 978-1-4987-0010-8
Case Studies in
Network Anomaly Detection:
Secure Computing: A Machine Learning Perspective
Achievements and Trends Dhruba Kumar Bhattacharyya
Edited by Biju Issac and Jugal Kumar Kalita
and Nauman Israr ISBN 978-1-4665-8208-8
ISBN 978-1-4822-0706-4
Risks of Artificial Intelligence
Cognitive Robotics Vincent C. Müller
Edited by Hooman Samani ISBN 978-1-4987-3482-0
ISBN 978-1-4822-4456-4
The Cognitive Early Warning
Computational Intelligent Predictive System Using
Data Analysis for the Smart Vaccine:
Sustainable Development The New Digital Immunity
Edited by Ting Yu, Nitesh Chawla, Paradigm for Smart Cities
and Simeon Simoff and Critical Infrastructure
Rocky Termanini
ISBN 978-1-4398-9594-8
ISBN 978-1-4987-2651-1
Computational Trust Models
The State of the Art in Intrusion
and Machine Learning
Prevention and Detection
Xin Liu, Anwitaman Datta, Edited by Al-Sakib Khan Pathan
and Ee-Peng Lim ISBN 978-1-4822-0351-6
ISBN 978-1-4822-2666-9
Zeroing Dynamics, Gradient
applicable copyright law.

Enhancing Computer Security Dynamics, and Newton Iterations

with Smart Technology Yunong Zhang, Lin Xiao,
V. Rao Vemuri Zhengli Xiao, and Mingzhi Mao
ISBN 978-0-8493-3045-2 ISBN 978-1-4987-5376-0

Algorithms and Applications

Mohssen Mohammed
Muhammad Badruddin Khan
Eihab Bashier Mohammed Bashier
applicable copyright law.

CRC Press
CRC Taylor & Francis Group
Boca Raton London New York

CRC Press is an imprint of the

Taylor & Francis Group, an informa business

EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:15 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not
warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® soft-
ware or related products does not constitute endorsement or sponsorship by The MathWorks of a particular
Copyright © 2017. CRC Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or

pedagogical approach or particular use of the MATLAB® software.

CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2017 by Taylor & Francis Group, LLC

CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper

Version Date: 20160428

International Standard Book Number-13: 978-1-4987-0538-7 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.

Library of Congress Cataloging‑in‑Publication Data

Names: Mohammed, Mohssen, 1982- author. | Khan, Muhammad Badruddin, author. |

Bashier, Eihab Bashier Mohammed, author.
Title: Machine learning : algorithms and applications / Mohssen Mohammed,
Muhammad Badruddin Khan, and Eihab Bashier Mohammed Bashier.
Description: Boca Raton : CRC Press, 2017. | Includes bibliographical
references and index.
Identifiers: LCCN 2016015290 | ISBN 9781498705387 (hardcover : alk. paper)
Subjects: LCSH: Machine learning. | Computer algorithms.
applicable copyright law.

Classification: LCC Q325.5 .M63 2017 | DDC 006.3/12--dc23

LC record available at https://lccn.loc.gov/2016015290

Visit the Taylor & Francis Web site at

http://www.taylorandfrancis.com

and the CRC Press Web site at

http://www.crcpress.com

To our parents, families, brothers and sisters, and

to our students, we dedicate this book.
applicable copyright law.

Contents

Preface................................................................................xiii
Acknowledgments ............................................................. xv
Authors .............................................................................. xvii
Introduction ...................................................................... xix
1 Introduction to Machine Learning...........................1
1.1 Introduction ................................................................ 1
1.2 Preliminaries ............................................................... 2
1.2.1 Machine Learning: Where Several
Disciplines Meet ............................................... 4
1.2.2 Supervised Learning ........................................ 7
1.2.3 Unsupervised Learning.................................... 9
1.2.4 Semi-Supervised Learning ..............................10
1.2.5 Reinforcement Learning..................................11
1.2.6 Validation and Evaluation ...............................11
1.3 Applications of Machine Learning Algorithms .........14
1.3.1 Automatic Recognition of Handwritten
Postal Codes....................................................15
1.3.2 Computer-Aided Diagnosis .............................17
1.3.3 Computer Vision .............................................19
applicable copyright law.

1.3.3.1 Driverless Cars ....................................20

1.3.3.2 Face Recognition and Security...........22
1.3.4 Speech Recognition ........................................22

EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:15 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
vii
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
viii ◾ Contents

1.3.5 Text Mining .....................................................23

1.3.5.1 Where Text and Image Data Can

Be Used Together ...............................24
1.4 The Present and the Future ......................................25
1.4.1 Thinking Machines .........................................25
1.4.2 Smart Machines .............................................. 28
1.4.3 Deep Blue .......................................................30
1.4.4 IBM’s Watson ..................................................31
1.4.5 Google Now ....................................................32
1.4.6 Apple’s Siri ......................................................32
1.4.7 Microsoft’s Cortana .........................................32
1.5 Objective of This Book .............................................33
References ..........................................................................34

SeCtion i SUPeRViSeD LeARninG ALGoRitHMS

2 Decision Trees .......................................................37
2.1 Introduction ...............................................................37
2.2 Entropy ......................................................................38
2.2.1 Example ..........................................................38
2.2.2 Understanding the Concept of Number
of Bits ..............................................................40
2.3 Attribute Selection Measure ......................................41
2.3.1 Information Gain of ID3.................................41
2.3.2 The Problem with Information Gain ............ 44
2.4 Implementation in MATLAB® .................................. 46
2.4.1 Gain Ratio of C4.5 ..........................................49
2.4.2 Implementation in MATLAB ..........................51
References ..........................................................................52
3 Rule-Based Classifiers............................................53
3.1 Introduction to Rule-Based Classifiers ......................53
applicable copyright law.

3.2 Sequential Covering Algorithm .................................54

3.3 Algorithm ...................................................................54
3.4 Visualization ..............................................................55
3.5 Ripper ........................................................................55
3.5.1 Algorithm ........................................................56
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:15 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
Contents ◾ ix

3.5.2 Understanding Rule Growing Process ...........58

3.5.3 Information Gain ............................................65

3.5.4 Pruning............................................................66
3.5.5 Optimization .................................................. 68
References ..........................................................................72
4 Naïve Bayesian Classification.................................73
4.1 Introduction ...............................................................73
4.2 Example .....................................................................74
4.3 Prior Probability ........................................................75
4.4 Likelihood ..................................................................75
4.5 Laplace Estimator...................................................... 77
4.6 Posterior Probability ..................................................78
4.7 MATLAB Implementation .........................................79
References ..........................................................................82
5 The k-Nearest Neighbors Classifiers ......................83
5.1 Introduction ...............................................................83
5.2 Example .................................................................... 84
5.3 k-Nearest Neighbors in MATLAB®........................... 86
References ......................................................................... 88
6 Neural Networks ....................................................89
6.1 Perceptron Neural Network ......................................89
6.1.1 Perceptrons .................................................... 90
6.2 MATLAB Implementation of the Perceptron
Training and Testing Algorithms ..............................94
6.3 Multilayer Perceptron Networks .............................. 96
6.4 The Backpropagation Algorithm.............................. 99
6.4.1 Weights Updates in Neural Networks .......... 101
6.5 Neural Networks in MATLAB .................................102
References ........................................................................105
7 Linear Discriminant Analysis ..............................107
applicable copyright law.

7.1 Introduction .............................................................107

7.2 Example ...................................................................108
References ........................................................................ 114

8 Support Vector Machine ...................................... 115

8.1 Introduction ........................................................... 115

8.2 Definition of the Problem ..................................... 116
8.2.1 Design of the SVM ....................................120
8.2.2 The Case of Nonlinear Kernel ..................126
8.3 The SVM in MATLAB® ..........................................127
References ........................................................................128

SeCtion ii UnSUPeRViSeD LeARninG ALGoRitHMS

9 k-Means Clustering ..............................................131
9.1 Introduction ........................................................... 131
9.2 Description of the Method ....................................132
9.3 The k-Means Clustering Algorithm .......................133
9.4 The k-Means Clustering in MATLAB® ..................134
10 Gaussian Mixture Model ......................................137
10.1 Introduction ...........................................................137
10.2 Learning the Concept by Example .......................138
References ........................................................................143
11 Hidden Markov Model ......................................... 145
11.1 Introduction ........................................................... 145
11.2 Example .................................................................146
11.3 MATLAB Code ......................................................148
References ........................................................................ 152
12 Principal Component Analysis............................. 153
12.1 Introduction ........................................................... 153
12.2 Description of the Problem ................................... 154
12.3 The Idea behind the PCA ..................................... 155
12.3.1 The SVD and Dimensionality
Reduction .............................................. 157
12.4 PCA Implementation ............................................. 158
applicable copyright law.

12.4.1 Number of Principal Components

to Choose .................................................. 159
12.4.2 Data Reconstruction Error ........................160

12.5 The Following MATLAB® Code Applies

the PCA ............................................................... 161

12.6 Principal Component Methods in Weka ...............163
12.7 Example: Polymorphic Worms Detection
Using PCA .............................................................. 167
12.7.1 Introduction ............................................... 167
12.7.2 SEA, MKMP, and PCA ...............................168
12.7.3 Overview and Motivation for Using
String Matching .........................................169
12.7.4 The KMP Algorithm .................................. 170
12.7.5 Proposed SEA ............................................ 171
12.7.6 An MKMP Algorithm ................................ 173
12.7.6.1 Testing the Quality of the
Generated Signature for
Polymorphic Worm A ................. 174
12.7.7 A Modified Principal Component
Analysis ..................................................... 174
12.7.7.1 Our Contributions in the PCA..... 174
12.7.7.2 Testing the Quality of
Generated Signature for
Polymorphic Worm A ................. 178
12.7.7.3 Clustering Method for Different
Types of Polymorphic Worms .....179
12.7.8 Signature Generation Algorithms
Pseudo-Codes............................................ 179
12.7.8.1 Signature Generation Process .....180
References ........................................................................187

Appendix I: Transcript of Conversations

with Chatbot ...........................................189
Appendix II: Creative Chatbot.................................... 193
applicable copyright law.

Index ..........................................................................195

Preface

If you are new to machine learning and you do not know

which book to start from, then the answer is this book. If you
know some of the theories in machine learning, but you do
not know how to write your own algorithms, then again you
should start from this book.
This book focuses on the supervised and unsupervised
machine learning methods. The main objective of this book is
to introduce these methods in a simple and practical way, so
that they can be understood even by beginners to get benefit
from them.
In each chapter, we discuss the algorithms through which
the chapter methods work, and implement the algorithms in
MATLAB®. We chose MATLAB to be the main programming
language of the book because it is simple and widely used
among scientists; at the same time, it supports the machine
learning methods through its statistics toolbox.
The book consists of 12 chapters, divided into two sections:

I: Supervised Learning Algorithms

II: Unsupervised Learning Algorithms
applicable copyright law.

In the first section, we discuss the decision trees, rule-based

classifiers, naïve Bayes classification, k-nearest neighbors,
neural networks, linear discriminant analysis, and support
vector machines.

EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:15 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
xiii
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
xiv ◾ Preface

In the second section, we discuss the k-means, Gaussian

mixture model, hidden Markov model, and principal compo-

nent analysis in the context of dimensionality reduction.
We have written the chapters in such a way that all are
independent of one another. That means the reader can start
from any chapter and understand it easily.

MATLAB® is a registered trademark of The MathWorks, Inc.

For product information, please contact:

The MathWorks, Inc.

3 Apple Hill Drive
Natick, MA 01760-2098 USA
Tel: 508-647-7000
Fax: 508-647-7001
E-mail: [email protected]
Web: www.mathworks.com
applicable copyright law.

Acknowledgments

We are deeply thankful to all those who have contributed

directly or indirectly to the publication of this book. Special
thanks go to Dr. Mohsin Hashim, University of Khartoum,
Sudan, for his valuable advice.
We would like to thank our colleagues at the Imam
Muhammad bin Saud University, Qatar University,
and University of Khartoum for their suggestions and
encouragement.
We are grateful to Richard O’Hanley of Taylor & Francis
Group for his guidance during the preparation of this book.
We would also like to thank all the teams from Taylor &
Francis Group/CRC Press for their help in the development and
editing of this book.
applicable copyright law.

EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:15 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
xv
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
Copyright © 2017. CRC Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or
applicable copyright law.

Authors

Mohssen Mohammed earned a BSc (Honors) in computer

science at Future University, Khartoum, Sudan, in 2003.
In 2006, he earned an MSc degree in computer science from
the Faculty of Mathematical Sciences, University of Khartoum,
Sudan. In 2012, he earned a PhD in network security from
the Electrical Engineering Department, Cape Town University,
South Africa. His PhD dissertation was titled “Automated
Signature Generation for Zero-Day Polymorphic Worms Using
a Double-Honeynet.” His areas of interest include network and
information security with a focus on malware detection and
analysis methods. Dr. Mohammed has published more than
15 papers at international conferences and in journals. His
first book, Automatic Defense against Zero-Day Polymorphic
Worms in Communication Networks, was classified by IEEE
as one of the best books in network security. He is an assis-
tant professor at the College of Computer and Information
Sciences, Al-Imam Muhammad Ibn Saud Islamic University,
Riyadh, Saudi Arabia.

Muhammad Badruddin Khan earned a PhD in 2011 at

the Tokyo Institute of Technology, Japan. Since 2012, he is a
applicable copyright law.

full-time assistant professor in the Department of Information

Systems of Al-Imam Muhammad Ibn Saud Islamic University.
His research interests are mainly focused on data and text
mining. He is currently involved in a number of research

EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:15 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
xvii
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
xviii ◾ Authors

projects related to machine learning and Arabic language

including Arabic sentiment analysis, improvement of Arabic

semantic resources, intelligent Arabic search engine, stylom-
etry, Arabic chatbots, trend analysis using Arabic Wikipedia,
Arabic proverbs classification, and violent/nonviolent video
categorization using YouTube video content and Arabic com-
ments. He has also published a number of research papers in
various conferences and journals.

Eihab Bashier Mohammed Bashier earned a BSc and

an MSc at the University of Khartoum, Sudan. He obtained
a postgraduate diploma in mathematical sciences from the
African Institute of Mathematical Sciences, Stellenbosch
University, South Africa. He then earned a PhD at the
University of the Western Cape in South Africa. He is an
associate professor of applied mathematics at the University
of Khartoum, Sudan. Recently, he has joined the Department
of Mathematics, Physics, and Statistics of Qatar University.
His research interests include numerical methods for differ-
ential equations with applications to biology, and information
and computer security. Dr. Bashier supervises postgraduate
students. He has also published several research articles in
international journals. Dr. Bashier received the African Union
and the Third World Academy of Science (AU-TWAS) Young
Scientists’ National Award in Basic Sciences, Technology, and
Innovation in 2011. He is a reviewer for many international
journals and is an IEEE member.
applicable copyright law.

introduction

Since their evolution, humans have been using many types

of tools to accomplish various tasks. The creativity of the
human brain led to the invention of different machines. These
machines made the human life easy by enabling people to
meet various life needs, including travelling, industries,
constructions, and computing.
Despite rapid developments in the machine industry, intel-
ligence has remained the fundamental difference between
humans and machines in performing their tasks. A human
uses his or her senses to gather information from the sur-
rounding atmosphere; the human brain works to analyze
that information and takes suitable decisions accordingly.
Machines, in contrast, are not intelligent by nature. A machine
does not have the ability to analyze data and take decisions.
For example, a machine is not expected to understand the
story of Harry Potter, jump over a hole in the street, or interact
with other machines through a common language.
The era of intelligent machines started in the mid-twentieth
century when Alan Turing thought whether it is possible for
machines to think. Since then, the artificial intelligence (AI)
branch of computer science has developed rapidly. Humans
applicable copyright law.

have had the dreams to create machines that have the same
level of intelligence as humans. Many science fiction movies
have expressed these dreams, such as Artificial Intelligence;
The Matrix; The Terminator; I, Robot; and Star Wars.

EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:15 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
xix
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
xx ◾ Introduction

The history of AI started in the year 1943 when Waren

McCulloch and Walter Pitts introduced the first neural network

model. Alan Turing introduced the next noticeable work in
the development of the AI in 1950 when he asked his famous
question: can machines think? He introduced the B-type neu-
ral networks and also the concept of test of intelligence. In
1955, Oliver Selfridge proposed the use of computers for pat-
tern recognition.
In 1956, John McCarthy, Marvin Minsky, Nathan Rochester
of IBM, and Claude Shannon organized the first summer AI
conference at Dartmouth College, the United States. In the
second Dartmouth conference, the term artificial intelligence
was used for the first time. The term cognitive science
originated in 1956, during a symposium in information science
at the MIT, the United States.
Rosenblatt invented the first perceptron in 1957. Then in
1959, John McCarthy invented the LISP programming lan-
guage. David Hubel and Torsten Wiesel proposed the use
of neural networks for the computer vision in 1962. Joseph
Weizenbaum developed the first expert system Eliza that
could diagnose a disease from its symptoms. The National
Research Council (NRC) of the United States founded the
Automatic Language Processing Advisory Committee (ALPAC)
in 1964 to advance the research in the natural language pro-
cessing. But after many years, the two organizations termi-
nated the research because of the high expenses and low
progress.
Marvin Minsky and Seymour Papert published their book
Perceptrons in 1969, in which they demonstrated the limita-
tions of neural networks. As a result, organizations stopped
funding research on neural networks. The period from 1969
applicable copyright law.

to 1979 witnessed a growth in the research of knowledge-

based systems. The developed programs Dendral and Mycin
are examples of this research. In 1979, Paul Werbos proposed
the first efficient neural network model with backpropagation.
However, in 1986, David Rumelhart, Geoffrey Hinton, and
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:15 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
Copyright © 2017. CRC Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or

Chapter 4

Naïve Bayesian
Classification

4.1 Introduction
Naïve Bayesian classifiers [1] are simple probabilistic classifiers
with their foundation on application of Bayes’ theorem with
the assumption of strong (naïve) independence among the
features. The following equation [2] states Bayes’ theorem in
mathematical terms:

P ( A ) P ( B| A )
P ( A|B ) =
P (B)

where:
A and B are events
P(A) and P(B) are the prior probabilities of A and B without
regard to each other
applicable copyright law.

P(A|B), also called posterior probability, is the probability of

observing event A given that B is true
P(B|A), also called likelihood, is the probability of observing
event B given that A is true

EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
73
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
74 ◾ Machine Learning

Suppose that vector X = (x1, x2, … xn) is an instance (with

n independent features) to be classified and cj denotes one

of K classes, then using Bayes’ theorem we can calculate the
posterior probability, P(cj|X), from P(cj), P(X), and P(X|cj). The
naïve Bayesian classifier makes a simplistic (naïve) assumption
called class conditional independence that the effect of the
value of a predictor (xi) on a given class cj is independent of the
values of other predictors.
Without going into mathematical details, for each of the K
classes, the calculation of P(cj|X) for j = 1 to K is performed.
The instance X will be assigned to class ck, if and only if

P (c k X ) > P (c j X ) for 1 ≤ j ≤ K , j ≠ k

The idea will be further clear when the example classifica-

tion using the naïve Bayesian classifier will be discussed along
with the implementation of MATLAB®.

4.2 Example
To demonstrate the concept of the naïve Bayesian classifier,
we will again use the following dataset:

Outlook Temperature Humidity Windy Play

Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
applicable copyright law.

Rainy Cool Normal True No

Overcast Cool Normal True Yes
Sunny Mild High False No
(Continued)
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
Naïve Bayesian Classification ◾ 75

Outlook Temperature Humidity Windy Play

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No

4.3 Prior Probability

Our task is to predict using different features whether tennis will
be played or not. Since there are almost twice as many examples
of “Play=yes”(9 examples) as compared to examples of
“Play=No”(5 examples), it is reasonable to believe that a
new unobserved case is almost twice as likely to have class of
“Yes” as compared to “No.” In the Bayesian paradigm, this belief,
based on previous experience, is known as the prior probability.
Since there are 14 available examples, 9 of which are Yes
and 5 are No, our prior probabilities for class membership are
as follows:

Prior Probability P(Play = Yes) = 9 / 14

Prior Probability P(Play = No) = 5 / 14

4.4 Likelihood
Let X be the new example for which we want to predict that
applicable copyright law.

tennis is going to be played or not. We can assume that the

more (Play = Yes) (or No) examples are closer to X, the more
likely that the new cases belong to (Play = Yes) (or No).
Let X = (Outlook = Overcast, Temperature = Mild,
Humidity = Normal, Windy = False), then we have to compute
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
76 ◾ Machine Learning

the conditional probabilities that are given as underlined text

in the following table:

Humidity
P(High| Play = Yes) = 3/9 P(High| Play = No) = 4/5
P(Normal| Play = Yes) = 6/9 P(Normal| Play = No) = 1/5

Windy
P(True| Play = Yes) = 3/9 P(True| Play = No) = 3/5
P(False| Play = Yes) = 6/9 P(False| Play = No) = 2/5

Using the above probabilities, we can obtain the two probabil-

ities of the likelihood of X belonging to any of the two classes:

1. P(X/Play = Yes)
2. P(X/Play = No)

The two probabilities can be obtained by the following

calculations:
P(X/Play = Yes) = P(Outlook = overcast| play = Yes) ×
applicable copyright law.

P(Temperature = mild| play = Yes) × P(Humidity =

normal| play = Yes) × P(Windy = false| play = Yes)
P(X/Play = No) = P(Outlook = overcast| play = No) ×
P(Temperature = mild| play = No) × P(Humidity =
normal| play = No) × P(Windy = false| play = No)
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
Naïve Bayesian Classification ◾ 77

4.5 Laplace Estimator

One of the evident problems in calculating P(X/Play = No) is

the presence of the value of zero for the conditional probabil-
ity P(Outlook = overcast/Play = No). This will make the whole
probability equivalent to zero. In order to handle this problem,
we will use the Laplace estimator.
The new Prior Probabilities will be as follows:

Prior Probability P(Play = Yes) = (9 + 1) / (14 + 2)

= 10/16
Prior Probability P(Play = No) = (5 + 1) / (14 + 2)
= 6/16

The following table describes the conditional probabilities after

the Laplace correction:

Humidity
P(High| Play = Yes) = 4/11 P(High| Play = No) = 5/7
applicable copyright law.

P(Normal| Play = Yes) = 7/11 P(Normal| Play = No) = 2/7

Windy
P(True| Play = Yes) = 4/11 P(True| Play = No) = 4/7
P(False| Play = Yes) = 7/11 P(False| Play = No) = 3/7

The two probabilities of likelihood can be calculated easily by

the following:

P(X/Play = Yes) = P(Outlook = Overcast| Play =

Yes) × P(Temperature = Mild| Play = Yes) × P(Humidity
= Normal| Play = Yes) × P(Windy = False| Play = Yes)
P(X/Play = Yes) = 5/12 × 5/12 × 7/11 × 7/11 = 0.070305
P(X/Play = No) = P(Outlook = Overcast| Play = No) ×
P(Temperature = Mild| Play = No) × P(Humidity =
Normal| Play = No) × P(Windy= False| Play = No)
P(X/Play = No) = 1/8 × 3/8 × 2/7 × 3/7 = 0.00574

4.6 Posterior Probability

In order to calculate the posterior probability, we need three
things.

1. Prior probability
2. Likelihood
3. Evidence

The following formula shows the relationship among the three

variables to calculate posterior probability:
Prior × Likelihood
Posterior =
Evidence
For the classification purpose, we are interested in calculating
and comparing the numerator of the above fraction because
the evidence in the denominator is same for both classes.
In other words, the posterior is proportional to the likelihood
times the prior.
applicable copyright law.

Posterior ∝ Prior × Likelihood

The numerator prior × likelihood for two classes can be calcu-
lated by simply multiplying the respective prior probabilities
and the probabilities of likelihood.
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
Naïve Bayesian Classification ◾ 79

P(Play = Yes/X) ∝ P(Play = Yes) × P(X/Play = Yes)

10 /16 ×0.070305 = 0.043941

P(Play = No/X) ∝ P(Play = No) × P(X/Play = No)

6 /16×0.00574 = 0.002152

Since the value of P(Play = Yes) × P(X/Play = Yes) > P(Play =

No) × P(X/Play = No), we will assign class “Yes” to the new
case “X.”

4.7 MATLAB Implementation

In MATLAB, one can perform calculations related to the naïve
Bayesian classifier easily.
We will first load the same dataset that we have discussed
in the chapter as an example in the MATLAB environment and
we will then calculate different parameters related to the naïve
Bayesian classifier.
The following code snippet loads the data from “data.csv”
into the MATLAB environment.

fid = fopen(‘C:\Naive Bayesian\data.csv’)’;

out = textscan(fid,’%s%s%s%s%s’,’delimiter’,’,’);
fclose(fid);
num_featureswithclass = size(out,2);
tot_rec = size(out{size(out,2)},1)-1;
for i = 1:tot_rec
yy{i} = out{num_featureswithclass}{i+1};
end
for i = 1: num_featureswithclass
applicable copyright law.

xx{i} = out{i};
end

For calculating the prior probabilities of the class variable, the

following code snippet will perform the job.
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
80 ◾ Machine Learning

% Calculation of Prior Probabilities

yu=unique(yy) %unique classes label

nc=length(yu) %number of classes

fy = zeros(nc,1);

num_of_rec_for_each_class = zeros(nc,1);

for i=1:nc

for j = 1:tot_rec

if (yy{j} == yu{i})

num_of_rec_for_each_class(i) = num_of_rec_for_each_class(i) +1;

end

In order to calculate the likelihood table, the following code

snippet works:
prob_table=zeros(num_featureswithclass-1,10,nc);
for col = 1:num_featureswithclass-1
unique_value = unique(xx{col});
rec_unique_value{col} = unique_value;
for i = 2:length(unique_value)
for j = 2:tot_rec+1
if strcmp(xx{col}{j}, unique_value{i}) == 1 &&
strcmp(xx{num_featureswithclass}{j}, yu{1}) ==1
prob_table(col, i-1,1) = prob_table(col,
i-1,1) + 1;
end
if strcmp(xx{col}{j}, unique_value{i}) == 1 &&
strcmp(xx{num_featureswithclass}{j}, yu{2}) ==1
prob_table(col, i-1,2) = prob_table(col,
i-1,2) + 1;
end
end
end
end
prob_table(:,:,1) = prob_table(:,:,1)./
applicable copyright law.

num_of_rec_for_each_class(1);
prob_table(:,:,2) = prob_table(:,:,2)./
num_of_rec_for_each_class(2);

The matrix “prob_table” used in the above code is a matrix

of 4 × 10 × 2 dimension where “4” is the number of
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
Naïve Bayesian Classification ◾ 81

attributes in the dataset. The number “10” is the possible

number of unique value in any attribute. In this example,

the maximum number was “3.” The number “2” refers to
the number of classes. If we see the values present in the
prob_table, the understanding will be further enhanced.

Class : P

overcast,Rain,Sunny
cool,hot,mild
Outlook high,normal
false,true
Temperature

Humidity

Windy

Class : N

overcast,Rain,Sunny
cool,hot,mild
high,normal
false,true

Predicting for an unlabeled record:

Now that we have a naïve Bayesian classifier in the form of
tables, we can use them to predict newly arriving unlabeled
records. The following code snippet describes the prediction
process in MATLAB.

A = {‘sunny’, ‘hot’,’high’,’false’};
A1 = find(ismember(rec_unique_value{1},A{1}));
A11 = 1;
applicable copyright law.

A2 = find(ismember(rec_unique_value{2},A{2}));
A21 = 2;
A3 = find(ismember(rec_unique_value{3},A{3}));
A31 = 3;
A4 = find(ismember(rec_unique_value{4},A{4}));
A41 = 4;
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
82 ◾ Machine Learning

ProbN = prob_table(A11,A1 − 1,1)*prob_

table(A21,A2 − 1,1) *prob_table(A31,A3 − 1,1)

*prob_table(A41,A4 − 1,1)*fy(1);
ProbP = prob_table(A11,A1 − 1,2)*prob_
table(A21,A2 − 1,2) *prob_table(A31,A3 − 1,2)
*prob_table(A41,A4 − 1,2) *fy(2);
if ProbN > ProbP
prediction = ‘N’
else
prediction = ‘P’
end

References
1. Good, I. J. The Estimation of Probabilities: An Essay on Modern
Bayesian Methods. Cambridge: MIT Press, 1965.
2. Kendall, M. G. and Stuart, A. The Advanced Theory of Statistics.
London: Griffin, 1968.
applicable copyright law.

EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
Copyright © 2017. CRC Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or

Chapter 5

The k-Nearest
Neighbors Classifiers

5.1 Introduction
In pattern recognition, the k-nearest neighbors algorithm
(or k-NN for short) is a nonparametric method used for
classification and regression [1]. In both cases, the input con-
sists of the k closest training examples in the feature space.
The output depends on whether k-NN is used for classification
or regression:

◾ In k-NN classification, the output is a class membership.

An object is classified by a majority vote of its neighbors,
with the object being assigned to the class most common
among its k-NN (k is a positive integer, typically small).
If k = 1, then the object is simply assigned to the class of
that single nearest neighbor.
◾ In k-NN regression, the output is the property value for
applicable copyright law.

the object. This value is the average of the values of its

k-NN.

EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
83
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
84 ◾ Machine Learning

k-NN is a type of instance-based learning, or lazy learning,

where the function is only approximated locally and all

computation is deferred until classification. The k-NN
algorithm is among the simplest of all machine learning
algorithms.
For both classification and regression, it can be useful to
assign weight to the contributions of the neighbors, so that
the nearer neighbors contribute more to the average than the
more distant ones. For example, a common weighing scheme
consists of giving each neighbor a weight of 1/d, where d is
the distance to the neighbor.
The neighbors are taken from a set of objects for which
the class (for k-NN classification) or the object property value
(for k-NN regression) is known. This can be thought of as the
training set for the algorithm, though no explicit training step
is required.
A shortcoming of the k-NN algorithm is that it is sensitive
to the local structure of the data. The algorithm has nothing
to do with and is not to be confused with k-means, another
popular machine learning technique.

5.2 Example
Suppose that we have a two-dimensional data, consisting of
circles, squares, and diamonds as in Figure 5.1.
Each of the diamonds is desired to be classified as either a
circle or a square. Then, the k-NN can be a good choice to do
the classification task.
The k-NN method is an instant-based learning method that
applicable copyright law.

can be used for both classification and regression.

Suppose that we are given a set of data points
{( x 1 , C 1 ),( x 2 , C 2 ),  ,( x N , C N )} , where each of the points
x j , j = 1,  , N has m attributes a j 1 , a j 2 ,  , a jm and C 1 ,  , C N are
taken from some discrete or continuous space K.
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
The k-Nearest Neighbors Classifiers ◾ 85

1.0
Copyright © 2017. CRC Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or

0.9
0.8
0.7
0.6
0.5
x2

0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x1

Figure 5.1 Two classes of squares and circle data, with unclassified
diamonds.

It is clear that
f
X 
→K
with f ( x j ) = C j , where X = {x 1 ,  , x N } is a subset of some
space Y .
Given an unclassified point x s = ( a s 1 , a s 2 ,  , a sm ) ∈ Y , we
would like to find C s ∈ K such that f ( x s ) = C s . At this point, we
have two scenarios [2]:

1. The space K is discrete and finite: in this case we have

a classification problem, C s ∈ K where K = {C 1 ,  , C N },
where and the k-NN method sets f ( x s ) to be the major
vote of the k-NN of x s .
2. The space K is continuous: In this case, we have
a regression problem and the k-NN method sets
f ( x s ) = (1/k )∑kj =1 f ( x s j ), where {x s1 , x s 2 ,  , x sk } is the set of
applicable copyright law.

k-NN of x s . That is, f ( x s ) is the average of the values of

the k-NN of point x s .

The k-NN method belongs to the class of instance-based

supervised learning methods. This class of methods does not
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
86 ◾ Machine Learning

create an approximating model from a training set as happens

in the model-based supervised learning methods.

To determine the k-NN of a point x s , a distant measure
must be used to determine the k closest points to point x s :
{x s1 , x s 2 ,  , x sk }. Assuming that d ( x s , x j ), j = 1,  , N , mea-
sures the distance between x s and x j , and {x s i : i = 1,  , k}
is the set of k-NN of x s according to the distant metric d .
Then, the approximation f ( x s ) = (1/k )∑kj =1 f ( x s j ) assumes that
all the k-neighboring points have the same contribution to the
classification of the point x s.

5.3 k-Nearest Neighbors in MATLAB®

MATLAB enables the construction of a k-NN method
through the method “ClassifyKNN.fit,” which receives a
matrix of attributes and a vector of corresponding classes.
The output of the ClassifyKNN.fit is a k-NN model.
The default number of neighbors is 1, but it is pos-
sible to change this number through setting the attribute
“NumNeighbors” to the desired value.
The following MATLAB script applies the k-NN classifier to
the ecoli dataset:

clear; clc;
EcoliData = load(‘ecoliData.txt’); % Loading the
ecoli dataset
EColiAttrib = EcoliData(:, 1:end-1); % ecoli
attributes
EColiClass = EcoliData(:, end); % ecoli classes
%knnmodel = ClassificationKNN.
applicable copyright law.

fit(EColiAttrib(1:280,:), EColiClass(1:280),...
% ‘NumNeighbors’, 5, ‘DistanceWeight’, ‘Inverse’);
% fitting the
% ecoli data with the k-nearest neighbors method
% The above line changes the number of neighbors
to 4
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
The k-Nearest Neighbors Classifiers ◾ 87

knnmodel = ClassificationKNN.
Copyright © 2017. CRC Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or

fit(EColiAttrib(1:280,:), EColiClass(1:280),...
‘NumNeighbors’, 5)
Pred = knnmodel.predict(EColiAttrib(281:end,:));
knnAccuracy = 1-find(length(EColiClass(281:end)-
Pred))/length(EColiClass(281:end));
knnAccuracy = knnAccuracy * 100

The results are as follows:

knnmodel =
ClassificationKNN
PredictorNames: {‘x1’ ‘x2’ ‘x3’ ‘x4’ ‘x5’
‘x6’ ‘x7’}
ResponseName: ‘Y’
ClassNames: [1 2 3 4 5 6 7 8]
ScoreTransform: ‘none’
NObservations: 280
Distance: ‘euclidean’
NumNeighbors: 5
Properties, Methods
knnAccuracy =

98.2143

Another approach assumes that a closer point to x s shall have

more contribution to the classification of x s. Therefore, came
the idea of the weighted weights, which is assumed to be
proportional to the closeness of the point from x s . Now, if
{x s1 , x s 2 ,  , x sk } denotes the k-NN of xs , let
−d ( x s , x s j )
e
w( x s , x s j ) =
∑
k
e −d ( x s , x s i )
i =1

It is obvious that ∑ki =1 w ( x s , x s i ) = 1. Finally, f ( x s ) is

applicable copyright law.

approximated as
k

∑ w( x , x
1
f (xs ) = s sj )⋅ f (xs j )
k j =1

MATLAB enables the use of weighted distance through changing

the attribute “DistanceWeights” from “equal” to either “Inverse” or

“SquaredInverse.” This is done by using

knnmodel = ClassificationKNN.
fit(EColiAttrib(1:280,:), EColiClass(1:280),...
‘NumNeighbors’, 5, ‘DistanceWeight’, ‘Inverse’);

Applying the KNN classifier with a weighted distance provides

the following results:

knnAccuracy =
98.2143

which is the same as the model with equal distance.

References
1. Hastie, T., Tibshirani, R., and Friedman, J. The Elements of
Statistical Learning: Data Mining, Inference, and Prediction,
2nd Ed., New York: Springer-Verlag, February 2009.
2. Vapnik, V. N. The Nature of Statistical Learning Theory. 2nd
Ed., New York: Springer-Verlag, 1999.
applicable copyright law.

Chapter 6

Neural Networks

6.1 Perceptron Neural Network

A neural network is a model of reasoning that is inspired
by biological neural networks, which is the central nervous
system in an animal brain. The human brain consists of a
huge number of interconnected nerve cells called neurons.
A neuron consists of a cell body, soma, a number of fibers
called dendrites, and a single long fiber called the axon [1]
(Figure 6.1).
The main function of dendrites is to receive messages from
other neurons. Then, the signal travels to the main cell body,
known as the soma. The signal leaves the soma and travels
down the axon to the synapse. The message then moves
through the axon to the other end of the neuron, then to the
tips of the axon and then into the space between neurons.
From there, the message can move to the next neuron.
The human brain incorporates nearly 10 billion neurons
and 60 trillion connections, synapses, between them. By using
applicable copyright law.

a massive number of neurons simultaneously, the brain can

process data and perform its function very fast [2].
The structure of the biological neural system, together
with how it performs its functions has inspired the idea of

EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
89
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
90 ◾ Machine Learning

Nucleus Soma Axon terminals

(cell body)

Myelin
Dendrites
sheaths

Axon

Figure 6.1 A biological neuron.

artificial neural networks (ANNs). The first neural network

conceptual model was introduced in 1943 by Warren
McCulloch and Walter Pitts. They described the concept
of neuron as an individual cell that communicates with
other cells in a network. This cell receives data from other
cells, processes the inputs, and passes the outputs to other
cells. Since then, scientists and researchers have made
intensive research to develop the ANNs. Nowadays, ANNs
are considered one of the most efficient pattern recognition,
regression, and classification tools [3].
The big developments in ANNs during the past few decades
have motivated human ambitions to create intelligent machines
with human-like brain. Many Hollywood movies are based
on the idea that human-like smart machines will aim at con-
trolling the universe (Artificial Intelligence; The Matrix; The
Terminator; I, Robot; Star Wars; Autómata; etc.). The robots of I,
Robot are designed to have artificial brains consisting of ANNs.
However, despite the fact that performance of the ANNs is
still so far from the human brain, to date the ANNs are one of
the leading computational intelligence tools.
applicable copyright law.

6.1.1 Perceptrons
A perceptron is the simplest kind of ANNs, invented in
1957 by Frank Rosenblatt. A perceptron is a neural network
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
Neural Networks ◾ 91

x2 w2

Processor y
w m−1
xm−1
wm

Figure 6.2 A perceptron with m input features ( a1,… , am ).

that consists of a single neuron that can receive multiple inputs

and produces a single output (Figure 6.2).
Perceptrons are used to classify linearly separable classes,
through finding any m-dimensional hyperplane in the feature
space that separates the instances of the two classes. In a
perceptron model, the weighted sum

∑wj =1
j ⋅ x j = w1 ⋅ x 1 +  + w m ⋅ x m

is evaluated and passed to an activation function, which

compares it to a predetermined threshold θ. If the weighted
sum is greater than the threshold θ, then the perceptron fires
and outputs 1, otherwise it outputs 0. There are many kinds
of activation functions that can be used with the perceptron,
but the step, sign, linear, and sigmoid functions are the most
popular ones. The step function is of the following form:

1 if x > 0
f ( x ) = 
0 if x < 0
applicable copyright law.

The sign function is given by

1 if x > 0
f ( x ) = 
−1 if x < 0
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
92 ◾ Machine Learning

1.5 1.5
1.0
1.0 0.5
0.5 0.0
y

y
0.0 −0.5
−1.0
−0.5 −1.5
−1.0 −2.0
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
(a) x (b) x

1.0 2.0
1.5
0.5
1.0
0.0 0.5
y

y
0.0
−0.5
−0.5
−1.0 −1.0
−1 −0.5 0 0.5 1 −5 0 5
(c) x (d) x

Figure 6.3 Activation functions: (a) step activation function, (b) sign
activation function, (c) linear activation function, and (d) sigmoid
activation function.

The linear function is f ( x ) = x and the sigmoid function is

(Figure 6.3):

1
f (x ) =
1 + e−x

All the above-mentioned activation functions are triggered at

a threshold θ = 0. However, it is more convenient to have a
threshold other than zero. For doing that, a bias b is introduced
to the perceptron in addition to the m inputs x 1 ,x 2 , ,x n. The
perceptron can have an additional input called the bias. The
role of the bias b is to move the threshold function to the left
applicable copyright law.

or right, in order to change the activation threshold (Figure 6.4).

Changing the value of the bias b does not change the
shape of the activation function, but together with the other
weights, it determines when the perceptron fires. It is worthy
to note that the input associated with the bias is always one.
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
Neural Networks ◾ 93

b
x1
w1

x2 w2

w m−1
y=f ( m
)
b + Σ wj . xj
j=1
y

xm−1
wm
xm

Figure 6.4 A perceptron with m inputs and a bias.

Now, training the perceptron aims at determining the optimal

weights and bias value at which the perceptron fires.
In the given classified data ( x j , y j ) , j = 1, …,N , each feature
vector x j has m features ( a j 1 ,…, a jm ). Each feature vector belongs
to either a class C 1 or a class C 2 and y j ∈ {−1,1}. If the two given
classes are linearly separable, then the perceptron can be used
to classify them. We will assume that instances that belong to
class C 1 are classified as 1, whereas instances that belong to
class C 2 are classified as −1. Therefore, we will consider the sign
activation function for the perceptron to fire. The MATLAB®’s
function sign (x) returns 1 if x is positive and −1 if x is negative.
Given a training feature vector x j = ( a j 1 ,…, a jm ) classified as
y j , we are going to show how the perceptron will deal with
the vector x j during the training stage. The first thing we will
do is to set

 w1  a j1 
    
w =  , xj =  
applicable copyright law.

w m  a jm 
   
b   1 
and generate random values for the (m + 1)-dimensional
vector w. We notice that
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
94 ◾ Machine Learning

w T ⋅ x j = w1a j 1 +  + w m a jm + b
Copyright © 2017. CRC Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or

is the weighted sum. Since we are interested in the sign of

the weighted sum, we can use the MATLAB’s sign function to
determine the sign of w T ⋅ x j . Now, we define the error in clas-
sifying x j to be the value:

E j = y j − sign ( w T ⋅ x j )

The error E j could be either 2,0 or −2. Both the first and third
values of the error (nonzero values) indicate the occurrence of
an error in classifying the feature vector x j . Knowing the error
helps us in the adjustment of weights.
To readjust the weight, we define a learning rate parameter
α ∈ ( 0,1). This parameter determines how fast the weights are
changed, and hence, how fast the perceptron learns during the
training phase. Given the learning rate α, the correction in the
weight vector is given by

∆w = αx j E j

and the new weight becomes

wnew = wold + ∆w

Applying the round of constructing the weighted sum,

evaluating the error, and adjusting the weight vector to
all the instances in the training set is called an epoch. The
perceptron learning process shall consist of an optimal
number of epochs.
applicable copyright law.

6.2 MATLAB Implementation of the Perceptron

Training and Testing Algorithms
The following MATLAB function applies the perceptron
learning:
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
Neural Networks ◾ 95

function w = Perceptron Learning(TrainingSet,

Class, Epochs, LearningRate)

%% The output is an (m+1)-dimensional vector of
weights
%% TrainingSet is an n by m matrix, where the rows
of the TrainingSet matrix
%% represent the instances, the columns represent
the features
%% Class is an n-dimensional vector of 1’s and
-1’s, corresponding to the
%% instances of the training set. Epochs determine
the number of epochs and
%% Learning rate determine the rate at which the
weights are corrected.
[n, m] = size(TrainingSet);
w = 0.5*rand(1, m); % initializing the weights
a = LearningRate;
for epoch = 1: Epochs
for j = 1: n
x = TrainingSet(j,:); % Picking an instance x_j
from the training
% set
wsum = sum(w.*x); % Constructing the weighted sum
if wsum > 0
y = 1;
else
y = -1;
end
Error = Class(j) - y; % Error is the difference
between the
% predicted class and actual class
w = w + Error*x*a; % Correcting the weights
according the the
% error
end
applicable copyright law.

end

For the testing of the algorithm, the PerceptronTesting function

can be used. Following is the code for the PerceptronTesting
function:
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
96 ◾ Machine Learning

function [PredictedClass, Accuracy] =

PerceptronTesting(TestingSet, Class, w)
%% The outputs are a vector of predicted classes
and the prediction
%% accuracy as a percentage. The Accuracy is the
percentage of the ratio
%% between the correctly classified instances in
the testing set and the
%% total number of instances in the testing set.
%% TestingSet is an N by m matrix and Class are the
corresponding classes
%% for the feature vectors in the testing set
matrix. The vector w is the
%% the vectors of weights, obtained during the
training phase
[N, m] = size(TestingSet);
PredictedClass = zeros(N, 1);
for j = 1: N
x = TestingSet(j,:);
wsum = sum(w.*x);
if wsum > 0
PredictedClass(j) = 1;
else
PredictedClass(j) = -1;
end
end
Error = Class - PredictedClass;
Accuracy = (1 - length(find(Error))/
length(Error))*100;

6.3 Multilayer Perceptron Networks

A single perceptron can solve any classification problem for
applicable copyright law.

linearly separable classes. If given two nonlinearly separable

classes, a single layer perceptron network will fail to solve the
problem of classifying them. The most common simple non-
linearly separable problem is the logical XOR problem. The
XOR logical operation takes two logical inputs x and y, then
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
Neural Networks ◾ 97

the output of the XOR operator is as given in the following

table (Table 6.1).

The output of the XOR operation is either 0 or 1, but the
two classes are not separable in the feature space (Figure 6.5).
From the Figure 6.5, it is obvious that the true instances can-
not be separated from the false instances by using a straight
line. Such a nonlinearly separable problem is solved by using a
multilayer perceptron network. Indeed, the decision boundaries

Table 6.1 XOR Logical Operation

x y x⊕ y
0 0 1
0 1 0
1 0 0
1 1 1

1.5

1.0

0.5

0.0
applicable copyright law.

−0.5
−0.5 0 0.5 1 1.5

Figure 6.5 The XOR binary operation: True instances are plotted in
circles and false instances are plotted with squares.

in a multilayer perceptron network have a more complex geo-

metric shape in the feature space than in a hyperplane.

In a multilayer perceptron neural network, each
perceptron receives a set of inputs from other perceptrons,
and according to whether the weighted sum of the inputs is
above some threshold value, it either fires or does not. As in
a single perceptron network, the bias (which determines the
threshold) in addition to the weights are adjusted during the
training phase (Figure 6.6).
At the final neural network model, for some input, a spe-
cific set of the neurons fire. Changing the input changes the
set of neurons that fires. The main purpose from the neural
network training is to learn when to fire each neuron as a
response to a specific input.
To learn a neural network, random weights and biases are
generated. Then, a training instance is passed to the neural
network, where the output of each layer is passed to the next
layer until computing the predicted output at the output layer,
according to the initial weights. The error at the output layer
is computed as the difference between the actual and pre-
dicted outputs. According to the error, the weights between
the output layer and the hidden layers are corrected, and then
Output layer
Input layer
applicable copyright law.

Hidden layers

Figure 6.6 A multilayer perceptron neural network.

the weights between the hidden layer and the input layer are
Copyright © 2017. CRC Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or

adjusted in a backward fashion. Another training instance is

passed to the neural network and to the process of evaluating
the error at the output layer, thereby correcting the weights
between the different layers from the output layer to the input
layer. Repeating this process for as many epochs will help in
learning the neural network.

6.4 The Backpropagation Algorithm

The backpropagation algorithm consists of two stages:

Step 1: This step is called the feed-forward stage; and at this

stage the inputs are fed to the network and the output is
computed at both the hidden and output layers.
Step 2: The prediction error is computed at the output
layer, and this error is propagated backward to adjust
the weights. This step is called the backpropagation. The
backpropagation algorithm uses deterministic optimiza-
tion to minimize the squared error sum using the gradi-
ent decent method. The gradient decent method requires
computing the partial derivatives of the activation func-
tion with respect to the weights of the inputs. Therefore,
it is not applicable to use the hard limit activation
functions (step and sign functions).

Generally, a function f : R → [ 0,1] is an activation function if it

satisfies the following properties:
applicable copyright law.

1. The function f has a first derivative f ′.

2. The function f is a nondecreasing function, that is,
f ′ ( x ) > 0 for all x ∈ R .
3. The function f has horizontal asymptotes at both 0 and 1.
4. Both f and f ′ are computable functions.
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
100 ◾ Machine Learning

The sigmoid functions are the most popular functions in the

neural networks. Two sigmoid functions are usually used as

activation functions:

1 (1 + e − x )
1. S1 ( x ) = /

2. S 2 ( x ) = (1 − e − x )/(1 + e − x )

The derivatives of the two functions S1 ( x ) and S 2 ( x ) are

given by

dS1 ( x )
= S1 ( x ) (1 − S1 ( x ) )
dx
and
dS 2 ( x )
= 1 − S2 ( x )
dx
The second sigmoid function S 2 ( x ) is the hyperbolic tangent
function. Choosing sigmoid functions guarantees the continuity
and differentiability of the error function. The graphs of the two
sigmoid functions S1 ( x ) and S 2 ( x ) are explained in Figure 6.7.

S1 =
1 1 − e−x
S2 =
1 + e−x 1 + e−x
1.5 1.5

1.0
1.0
0.5

y 0.5 y 0.0

−0.5
applicable copyright law.

0.0
−1.0

−0.5 −1.5
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
x x

Figure 6.7 The two sigmoid functions S 1( x ) and S 2 ( x ).

6.4.1 Weights Updates in Neural Networks

Given is the classified data ( x j ,y j ) , j = 1, …,N , where each

feature vector x j has m features ( a j 1 ,…, a jm ) . Let p j be the
predicted output by the neural network, when presenting a
feature vector x j . Then, the error is given as follows:

∑
1 2
E= yj − p j
2 j =1

The backpropagation algorithm works to find a local minimum

of the error function, where the optimization process runs over
the weights and biases of the neural network. It is noteworthy to
remember that the biases are embedded in weights. If the neural
network consists of a total of L weights, then the gradient of the
error function with respect to the network weights is

 ∂E 
 ∂w1 
 
∇E =   
 ∂E 
 
 ∂w L 
In the gradient decent method, the update in the weights
vector is proportional to negative of the gradient. That is,
∂E
∆w j = −α , j = 1, …, L
∂w j

where, α > 0 is a constant, representing the learning rate.

We will assume that the activation function S1 ( x ) is used
throughout the network. Then, at a unit i , that receives inputs
applicable copyright law.

(z 1 ,…,z M ) with weights (wi 1 ,…,wiM ) gives an output oi , where:

 M 

 k =1
∑
oi = S1  wikz k 
 

EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 11/1/2023 8:13 AM via PERPUSTAKAAN NASIONAL
REPUBLIK INDONESIA
AN: 1293656 ; Mohssen Mohammed, Muhammad Badruddin Khan, Eihab Bashier Mohammed Bashier.; Machine
Learning : Algorithms and Applications
Account: ns003914
102 ◾ Machine Learning

If the target output of unit i is u i , then by using the chain rule,

we get:

∂E ∂E ∂oi
= ⋅ = − ( u i − oi ) oi ( 1 − oi ) z j
∂wij ∂oi ∂wij

Therefore, the correction on the weight w ij is given as follows:

∆wij = α ( u i − oi ) oi (1 − oi ) z j

The weights are first adjusted at the output layer (weights

from the hidden layer to the output layer) and then adjusted
at the hidden layer (from the input layer to the hidden layer),
assuming that the network has one hidden layer.

6.5 Neural Networks in MATLAB

MATLAB enables the implementation of a neural network
through its neural network toolbox [4]. The initialization of
the neural network is done through the MATLAB command
feedforwardnet. The following MATLAB script is used to
train a network with the “ecoli” dataset (Figure 6.8).

clear; clc;
A = load(‘EColi1.txt’); % Loading the ecoli data,
with the classes at the
% last column
%B = A(1:end, 2:end);
C = A(1:end, 1:end-1)’; % C is the matrix of the
feature vectors
T = A(:, end)’; % T is the vector of classes
net = feedforwardnet; % Initializing a neural
applicable copyright law.

network ‘net’
net = configure(net, C, T);
hiddenLayerSize = 10; % Setting the number of
hidden layers to 10

net = patternnet(hiddenLayerSize); % Pattern

recognition network
net.divideParam.trainRatio = 0.7; % Ratio of
training data is 70%
net.divideParam.valRatio = 0.2; % Ratio of
validation data is 20%
net.divideParam.testRatio = 0.1; % Ratio of testing
data is 10%
[net, tr] = train(net, C, T); % Training the
network and the resulting
% model is the output net
outputs = net(C); % applying the model to the data
errors = gsubtract(T, outputs); % computing the
classification errors
performance = perform(net, T, outputs)
view(net)
The outputs of the above script are as follows:
>> performance =
0.7619
applicable copyright law.