0% found this document useful (0 votes)
254 views14 pages

DataScience - Project (Banknote Authentication) - SHILANJOY BHATTACHARJEE EE

This document is a project report submitted by Shilanjoy Bhattacharjee for a course on Data Science and Data Analytics. The report discusses various machine learning algorithms including Support Vector Machines, Random Forest Classifier, and K-Nearest Neighbors. It provides overview of these algorithms and their applications in areas such as text classification, image recognition, protein classification and more. The document was submitted as part of a course project to analyze a banknote authentication dataset using machine learning techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
254 views14 pages

DataScience - Project (Banknote Authentication) - SHILANJOY BHATTACHARJEE EE

This document is a project report submitted by Shilanjoy Bhattacharjee for a course on Data Science and Data Analytics. The report discusses various machine learning algorithms including Support Vector Machines, Random Forest Classifier, and K-Nearest Neighbors. It provides overview of these algorithms and their applications in areas such as text classification, image recognition, protein classification and more. The document was submitted as part of a course project to analyze a banknote authentication dataset using machine learning techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

BANKNOTE AUTHENTICATION

A PROJECT REPORT

Submitted by

SHILANJOY BHATTACHARJEE (EE 25)


Enrolment Id - 12017009005027

Data Science & Data Analytics Lab (CS – 695A)

Project Guide - Prof. Sankhadeep Chatterjee & Prof. Moumita Basu

B. Tech 3rd year


2020

University of Engineering & Management (UEM)


University Area, Plot No. III - B/5, New Town, Action Area - III, Kolkata, West Bengal 700156
CERTIFICATE

Certified that this project report “Bank Authentication Dataset” is the


bonafide work of
SHILANJOY BHATTACHARJEE (EE 25)
Enrolment Id - 12017009005027

Of B.Tech, EEE, who carried out the project work under our supervision.

………………………………..

………………………………..

SIGNATURE

Examiner:
ACKNOWLEDGEMENT

The completion of this project could not have been accomplished


Without the support of our teachers and guide Prof. Sankhadeep Chatterjee
& Prof. Moumita Basu. We are thankful to you for allowing us your time to
research and write.

We are also very thankful to our respected teachers for their co-operation and
suggestion regarding the project work.

Last but not the least we are very thankful to our HOD, Prof. Sanjoy Bhadra
for giving us an opportunity of doing such an interesting project work.

- Shilanjoy Bhattacharjee
OVERVIEW

Machine Learning is the field of study that gives computers the capability to
learn without being explicitly programmed. ML is one of the most exciting
technologies that one would have ever come across.

As it is evident from the name, it gives the computer that which makes it more
similar to humans: The ability to learn.

Machine learning is actively being used today, perhaps in many more places
than one would expect.

The basic premise of machine learning is to build algorithms that can receive
input data and use statistical analysis to predict an output while updating outputs
as new data becomes available.

TYPES OF LEARNING:
1. Supervised Learning
2. Unsupervised Learning
INTRODUCTION
Data mining is concerned with locating hidden relationships present in
commercial enterprise data presents groups to make predictions for eventual
use.

Data mining has risen as a key commercial enterprise intelligence technology.


The preference of data mining is to extract implicit, previously unexplored and
doubtlessly beneficial (or actionable) styles from statistics.

Data mining encompass many up to date tactics along with classification (neural
networks, k-nearest neighbor, naive Bayes classifier, and decision trees),
clustering (density based clustering, k-means, hierarchical clustering),
association (constraint-based association, multilevel association,
multidimensional, one-dimensional).

Years of education display that data mining is a technique, and it’s helpful
application calls for post processing (presentation, understanding ability,
summary), facts pre-processing (cleaning, noise/outlier removal, dimensionality
reduction) true knowledge of problem domains and domain facility.

All traditional algorithms are exaggerated to some amount by the class


imbalance crisis. Also, the accurate choice of the metric (or consolidation of
metrics) to assess – and ultimately improve, is essential for the accomplishment
of a data mining effort in such areas, since most of the time improving one
metric degrades others.
ALGORITHMS

SUPPORT VECTOR MACHINE (SVM):

A support-vector machine constructs a hyper plane or set of hyper planes in a high


or infinite-dimensional space, which can be used for classification,
regression, or other tasks like outliers detection. Intuitively, a good separation is
achieved by the hyper plane that has the largest distance to the nearest training-
data point of any class (so-called functional margin), since in general the larger the
margin, the lower the generalization error of the classifier.

Whereas the original problem may be stated in a finite-dimensional space, it often


happens that the sets to discriminate are not linearly separable in that space. For
this reason, it was proposed that the original finite-dimensional space be mapped
into a much higher-dimensional space, presumably making the separation easier
in that space.

To keep the computational load reasonable, the mappings used by SVM schemes
are designed to ensure that dot products of pairs of input data vectors may be
computed easily in terms of the variables in the original space, by defining them
in terms of a kernel function k(x, y) selected to suit the problem.

The hyper planes in the higher-dimensional space are defined as the set of points
whose dot product with a vector in that space is constant, where such a set of vector
is an orthogonal (and thus minimal) set of vectors that defines a hyper plane.

The vectors defining the hyper planes can be chosen to be linear combinations with
parameters of images of feature vectors x that occur in the data base. With this
choice of a hyper plane, the points x in the feature space that are mapped into the
hyper plane are defined by the relation.

Note that if k(x, y) becomes small as y grows further away from x, each term in
the sum measures the degree of closeness of the test point x to the corresponding
data base point. In this way, the sum of kernels above can be
Used to measure the relative nearness of each test point to the data points
originating in one or the other of the sets to be discriminated.

Note the fact that the set of points x mapped into any hyper plane can be quite
convoluted as a result, allowing much more complex discrimination between sets
that are not convex at all in the original space.
RANDOM FOREST CLASSIFIER:
Decision trees are a popular method for various machine learning tasks. Tree
learning "comes closest to meeting the requirements for serving as an off-the-
shelf procedure for data mining", because it is invariant under scaling and
various other transformations of feature values, is robust to inclusion of
irrelevant features, and produces inspect able models.

However, they are seldom accurate. In particular, trees that are grown very deep
tend to learn highly irregular patterns: they over fit their training sets, i.e. have
low bias, but very high variance.

Random forests are a way of averaging multiple deep decision trees, trained on
different parts of the same training set, with the goal of reducing the variance.

This comes at the expense of a small increase in the bias and some loss of
interpretability, but generally greatly boosts the performance in the final model.
K- NEAREST NEIGHBOUR:

The training examples are vectors in a multidimensional feature space, each


with a class label. The training phase of the algorithm consists only of storing
the feature vectors and class labels of the training samples.

In the classification phase, k is a user-defined constant, and an unlabeled vector


(a query or test point) is classified by assigning the label which is most frequent
among the k training samples nearest to that query point.

A commonly used distance metric for continuous variables is Euclidean


distance. For discrete variables, such as for text classification, another metric
can be used, such as the overlap metric (or Hamming distance).

In the context of gene expression microarray data, for example, k-NN has also
been employed with correlation coefficients such as Pearson and Spearman.

Often, the classification accuracy of k-NN can be improved significantly if the


distance metric is learned with specialized algorithms such as Large Margin
Nearest Neighbor or Neighborhood components analysis.

A drawback of the basic "majority voting" classification occurs when the class
distribution is skewed. That is, examples of a more frequent class tend to
dominate the prediction of the new example, because they tend to be common
among the k nearest neighbors due to their large number.

One way to overcome this problem is to weight the classification, taking into
account the distance from the test point to each of its k nearest neighbors. The
class (or value, in regression problems) of each of the k nearest points is
multiplied by a weight proportional to the inverse of the distance from that point
to the test point.
Another way to overcome skew is by abstraction in data representation.

For example, in a self-organizing map (SOM), each node is a representative (a


center) of a cluster of similar points, regardless of their density in the original
training data. K-NN can then be applied to the SOM.
APPLICATION

 SVMs are helpful in text and hypertext categorization, as their application can
significantly reduce the need for labelled training instances in both the standard
inductive and transductive settings. Some methods for shallow semantic
parsing are based on support vector machines.
 Classification of images can also be performed using SVMs. Experimental
results show that SVMs achieve significantly higher search accuracy than
traditional query refinement schemes after just three to four rounds of relevance
feedback. This is also true for image segmentation systems, including those
using a modified version SVM that uses the privileged approach as suggested
by Vapnik.
 Hand-written characters can be recognized using SVM.
 The SVM algorithm has been widely applied in the biological and other
sciences. They have been used to classify proteins with up to 90% of the
compounds classified correctly. Permutation tests based on SVM weights have
been suggested as a mechanism for interpretation of SVM models Support-
vector machine weights have also been used to interpret SVM models in the
past Posthoc interpretation of support-vector machine models in order to
identify features used by the model to make predictions is a relatively new area
of research with special significance in the biological sciences.
CONCLUSION
After successful completion of this project we can conclude that Banknote
authentication Dataset can be used in data science for good purpose.

Here we have used Support Vector Machine, Random Forest, K-Nearest


Neighbour Classifier for analysis of Bank Note Authentication.

You might also like