Review of Some Machine Learning Techniques For
Review of Some Machine Learning Techniques For
www.ijtrd.com
Abstract: Machine learning is a field of computer science Variety among different data representations in a given
which gives computers an ability to learn without being repository poses unique challenges with Big Data, which
explicitly programmed. Machine learning is used in a variety requires Big Data preprocessing of unstructured data in order
of computational tasks where designing and programming to extract structured/ordered representations of the data for
explicit algorithms with good performance is not easy.Big data human and/or downstream consumption. In today’s data-
are now rapidly expanding in all science and engineering intensive technology era, data Velocity – the increasing rate at
domains. The potentials of these increased volumes of data are which data is collected and obtained – is just as important as
obviously very significant to every aspect of our lives. To aid the Volume and Variety characteristics of Big Data. While the
us in decision making and future predictions requires new possibility of data loss exists with streaming data if it is
ways of thinking and new learning techniques to address the generally not immediately processed and analyzed, there is the
various challenges.Traditional analytical approaches are option to save fast-moving data into bulk storage for batch
insufficient to analyze big data because they are highly processing at a later time. However, the practical importance
scalable and unstructured data captured in real time. Machine of dealing with Velocity associated with Big Data is the
learning (ML), addresses this challenge, by enabling a system quickness of the feedback loop, that is, process of translating
to automatically learn patterns from data that can be leveraged data input into useable information. This is especially
in future predictions.This paper reviews some machine important in the case of time-sensitive information processing.
learning techniques for big data highlighting their Some companies such as Twitter, Yahoo, and IBM have
uses/applications, strengths and weaknesses in learning data developed algorithms that address the analysis of streaming
patterns. The techniques reviewed include Bayesian network, data (Wu et al, 2014). Veracity in Big Data deals with the
association rules, naïve bayes, decision trees, nearest neighbor trustworthiness or usefulness of results obtained from data
and super vector machines (SVM). analysis, and brings to light the old adage ―Garbage-In-
Garbage-Out‖ for decision making based on Big Data
Keywords: Machine Learning, Supervised Learning,
Analytics. As the number of data sources and types increases,
Unsupervised Learning, Classification, Big Data
sustaining trust in Big Data Analytics presents a practical
I. INTRODUCTION challenge.
Machine learning is a branch of artificial intelligence that Algorithm models for dealing with big data take different
allows computer systems to learn directly from examples, data, shapes, depending on their purpose. Using different algorithms
and experience. Through enabling computers to perform to provide comparisons can offer some surprising results about
specific tasks intelligently, machine learning systems can carry the data being used. They can come as a collection of
out complex processes by learning from data, rather than scenarios, an advanced mathematical analysis, or even a
following pre-programmed rules.Big Data generally refers to decision tree. Some models function best only for certain data
data that exceeds the typical storage, processing, and and analyses. For example, classification algorithms with
computing capacity of conventional databases and data decision rules can be used to screen out problems, such as a
analysis techniques. As a resource, Big Data requires tools and loan applicant with a high probability of defaulting.
methods that can be applied to analyze and extract patterns
Unsupervised clustering algorithms can be used to find
from large-scale data. The rise of Big Data has been caused by
relationships within an organization’s dataset. These
increased data storage capabilities, increased computational
algorithms can be used to find different kinds of groupings
processing power, and availability of increased volumes of
within a customer base, or to decide what customers and
data, which give organization more data than they have
services can be grouped together. An unsupervised clustering
computing resources and technologies to process. In addition
approach can offer some distinct advantages, as compared to
to the obvious great volumes of data, Big Data is also
the supervised learning approaches. One example is the way
associated with other specific complexities, often referred to as
novel applications can be discovered by studying how the
the four Vs: Volume, Variety, Velocity, and Veracity
connections are grouped when a new cluster is formed.
(Grolinger, et al 2014). The unmanageable large Volume of
data poses an immediate challenge to conventional computing They are different existing models for machine learning on big
environments and requires scalable storage and a distributed data and they include: Decision Tree based model, linear
strategy to data querying and analysis. However, this large regression based model, Neural Network, Bayesian Network,
Volume of data is also a major positive feature of Big Data. Nearest Neighbor and many others.
Many companies, such as Facebook, Yahoo, Google, already
Brief review of some machine learning techniques
have large amounts of data and have recently begun tapping
into its benefits (Almeida & Bernardino, 2015). A general Generally, the field of machine learning is divided into three
theme in Big Data systems is that the raw data is increasingly subdomains: supervised learning, unsupervised learning, and
diverse and complex, consisting of largely un- reinforcement learning (Adams et al, 2008). Briefly,
categorized/unsupervised data along with perhaps a small supervised learning requires training with labeled data which
quantity of categorized/ supervised data. Working with the has inputs and desired outputs. In contrast with the supervised
Weaknesses: Weaknesses
The addition of redundant attributes skews the learning process Choosing a ―good‖ kernel function is not easy.
because they are treated as though they were completely Long training time for large datasets.
independent. Difficult to understand and interpret the final model,
variable weights and individual impact.
Super Vector Machines (SVM)
Since the final model is not so easy to see, one cannot
SVM offers a principled approach to machine learning do small calibrations to the model hence it is tough to
problems because of its mathematical foundation in statistical incorporate business logic.
learning theory. SVM constructs its solution in terms of a
Association rules:
subset of the training input. SVM has been extensively used
for classification, regression, novelty detection tasks, and Association rules are often sought for very large datasets, and
feature reduction. efficient algorithms are highly valued. In practice, the amount
of computation needed to generate association rulesdepends
In classification tasks a discriminant machine learning
critically on the minimum coverage specified. The accuracy
technique aims at finding, based on an independent and
has lessinfluence because it does not affect the number of
identically distributed (iid) training dataset, a discriminant
passes that one must makethrough the dataset. In many
function that can correctly predict labels for newly acquired
situations one will want to obtain a certain numberof rules—
instances. Unlike generative machine learning approaches,
say 50—with the greatest possible coverage at a prespecified
which require computations of conditional probability
minimum accuracy level. One way to do this is to begin by
distributions, a discriminant classification function takes a data
specifying the coverageto be rather high and to then
point x and assigns it to one of the different classes that are a
successively reduce it, re-executing the entirerule-finding
part of the classification task. Less powerful than generative
algorithm for each coverage value and repeating this until
approaches, which are mostly used when prediction involves
thedesired number of rules has been generated.
outlier detection, discriminant approaches require fewer
computational resources and less training data, especially for a Association rules are often used when attributes are binary—
multidimensional feature space and when only posterior eitherpresent or absent—and most of the attribute values
probabilities are needed. From a geometric perspective, associated with a giveninstance are absent.
learning a classifier is equivalent to finding the equation for a
multidimensional surface that best separates the different Uses/applications:
classes in the feature space. Market Basket Analysis: given a database of customer
SVM is a discriminant technique, and, because it solves the transactions, where each transaction is a set of items
convex optimization problem analytically, it always returns the the goal is to find groups of items which are
same optimal hyperplane parameter—in contrast to genetic frequently purchased together.
algorithms (GAs) or perceptron, both of which are widely used
Simple technique that is easily implemented In terms of data preparation, there is not much to do. As long
Building model is cheap as the data is formalizedin something like comma separated
Extremely flexible classification scheme variables, then one can create a workingmodel. This also
makes it easy to validate the model using various tests.
Well suited forMulti-modal classes, records with
Withdecision trees one use white-box testing—meaning the
multiple class labels