A_Review_of_Machine_Learning_Techniques
A_Review_of_Machine_Learning_Techniques
ABSTRACT- In the recent years, Data has increased B. Characteristics of Big Data
exponentially and is termed as Big Data. Data Amount, Big Data is not just huge amount of data or data
Data Speed and Data Variation are three major parameters coming at high speed or data coming in various formats or
of Big Data. There are many challenges which have tuned data in doubt rather it is a combination of all. These define
up out of which Data Storage, Data Analysis and Data the V’s of Big Data. Big Data was initially characterized by
Management are the biggest ones. In order to deal with 3 V’s [5], then 4th V was added [6] and now a 5 V
these challenges, Machine Learning, a subset of Artificial Architecture of Big Data is defined [7]. There
Intelligence provides various tools and techniques. This characteristics are explained as below:
paper gives a detail about Big Data and Machine Learning.
i. Volume: The data is getting exponentially generated. It
It also includes detailed literature review on various Big
is almost doubling in every 12-18 months. Earlier data was
Data case studies which are solved by Machine Learning
measured in GBs and TBs but now the date has increased to
Techniques.
such an extent that it is measured in Petabyte (PB) and
Exabyte (EX). The data is so huge in nature that it is almost
KEYWORDS- Big Data, Data Analytics, Machine
impossible to lookup some information into it in a
Learning, Deep Neural Network, Supervised Learning,
reasonable period of time. Thus, Volume is one important
Neural Net, Data Mining, Computing
factor which has made Data as Big Data
I. INTRODUCTION TO BIG DATA ii. Velocity: The second V of Big Data refers to Velocity.
Velocity means that the data is arriving at a very high speed
A. Big Data Definition with no control over it. This speedy generation of data is
Big Data refers to extremely large, very fast, highly due to easy and speedy access to internet. Data is generated
diverse and complex data that cannot be managed with from many devices and communicated very fast. Managing
traditional data management tools [1],[2],[3], [4]. It is this highly speedy data and finding relevant information
being generated across the globe at an unexpected speed. form it is a difficult task.
Big Data has 5 dimensions associated with it. These are iii. Variety: Variety refers to data in different forms and
Volume, Variety, Velocity, Value and Veracity. The term formats. There are three aspects based on which data is
was evolved from 3 dimensions to 4 dimensions and now 5 categorized. These three aspects are Form of Data, Function
dimensions define Big Data. Big Data can be examined at of Data and Source of Data. Form of Data refers to data in
two levels Basic Level and Advanced Level. At the lower or different formats i.e Text, Audio, Images, Video, Graph,
basic level it is assumed as any other collection of data Map etc. or a combination of any two or more formats. Each
which can help in Business Analytics. On the other hand it format has its differ storage capacity and analysis
is assumed as a special type of data which has great complexity. Function of Data refers to data like Human
challenges and great benefits. Big Data is different from Conversation, Songs, Transaction data, Machine Operation
traditional data in every way i.e space, time and function. Data etc. All these data have to analyze in different way
Also, Big Data is not just data in the form of rows and with different result expectations. Lastly Source of Data
columns rather it includes text, audio, images, videos and refers to data coming from different sources such as
other varied data representation formats. Big Data is Structured, Semi Structured and Unstructured Data,
majorly unstructured in nature. Structured Data is the most organized form i.e. in the form
of rows and columns, Semi Structured is partially organized
data such as XML files, Log files etc., Unstructured Data is
the most unorganized form which includes Text, Audio,
Video, Images.
Manuscript Received April 21, 2020 iv. Veracity: Veracity refers to data in doubt. It takes Data
Dr. Yojna Arora, Assistant Professor, Department of Quality into consideration. Since, Big Data is huge in
Computer Science & Engineering, Amity School of nature so it difficult to maintain the truthfulness and quality
Engineering & Technology, Amity University, of data reducing the noise. The reason behind depreciation
Haryana, Manesar, Gurgaon, Haryana, India (email: of data quality may be unauthorized data source, human or
[email protected] machine generated errors or it can be an intentional attempt
Business Intelligence
Volume Variety
Big Data
Analytics
Velocity Veracity
C. Big Data Architecture Ingest Layer, File System or NO SQL. It process the data
Big Data Architecture contains three logical layers: using parallel processing techniques
Data Source as Input Layer, Data Processing as Middle iv. Stream Processing: It receives data only from the
Layer and Data Consumption for Result Analysis and Ingest Layer. It works on Real Time Data which is getting
Interpretation. This generalised Big Data Architecture is continuously generated and produce desired results
resistant, secure, cost effective and adaptive in nature. v. Data Organization Layer: This layer further
Major organizations have modified the architecture receives data from both Batch Processing and Stream
according to optimised infrastructure and requirements. Processing Layer. It is referred as NoSQL database. This
The function of each layer of Big Data Architecture is layer is added to organize the data for easy access.
explained as: vi. Infrastructure Layer: Infrastructure Layer provides
i. Source Layer: The source of data can be identified all the basic support including storage, computation and
based on the application over which analysis is to be communication support.
performed. It will greatly vary in its speed, size, form, vii. Distributed File System: This is the underlying data
function etc. source which can store huge amount of data. It provides
ii. Ingest Layer: This layer receives the data coming the data to all other layers.
from various sources, in different amount and at variable viii. Data Consumption Layer: This layer is the final
speed. It the decides whether the data has to be send for layer. It receives the output from organizing layer and
Batch Processing, Stream Processing or stored in provides the output in the form of reports, graphs and
underlying database. visualization methods.
iii. Batch Processing Layer: It receives data from Data
II. INTRODUCTION TO MACHINE LEARNING Classification method. It labels the data set and based on it
classifies the new data which come for analysis [12]
A. Basic Definition
Unsupervised Learning Algorithms do not follow
Machine Learning is an application of Artificial classification or labelling approach rather it can generate
Intelligence which allows the system to learn inferences.Semi Supervised Learning Algorithm is a
automatically with experience without being combination of both Supervised and Unsupervised
programmed. The learning process comes with Learning. It uses the approach of both labelled and
observation of data. It helps in decision making by unlabelled data for training Reinforcement Learning
studying patterns in datasets. Machine Learning Algorithm allow the system to learn from its environment
Algorithms can be categorized as Supervised, and gain knowledge
Unsupervised, Semi Supervised and Reinforcement [12]
Supervised Learning algorithms apply the predefined
knowledge on the new set of data. It follows the Data
Ordinary K Nearest Ridge Classifica Naïve k- Means Boosting Perceptrons Convolutio Principal
Least Square Neighbours Regression tion & Bayes nal Neural Component
Regression Regression Network Analysis
Tree
Linear Learning Least Iterative Gaussia K- Bootstrap Multilayer Recurrent Principal
Regression Vector Absolute Dichotomise n Naïve Medians ped Perceptrons Neural Component
Quantization Shrinkages r Bayes Aggregation Networks Regressio
and
Selection
Operator
Logistics Self Elastic C4.5 and Multin Expect AdaBoost Back-Propa Long Partial
Regression Organizing Net C5.0 omial ation gation Short-Term Least Squares
Maps Naïve Maximiza Memory Regression
Bayes tion Networks
Multivari Locally Least Chi-squar Averag Hierarc Weighted Stochastic
Stac Sam
ate Adaptive Weighted Angle ed ed hical Average Gradient ked mon Mapping
Regression Learning Regression Automatic One-Depe Clustering Descent Auto-Encoder
Interaction ndence s
Detection Estimators
Locally Support Decision Bayesia Stacked Hopfield Deep Multidime
Estimated Vector Stump n Belief Generalizati Network Boltzmann nsional
Scatterplot Machines Network on Machine Scaling
Smoothing
Step Wise Conditio Bayesia Gradient Radial Deep Belief Projection
Regression nal Decision n Network Boosting Basis Networks Pursuit
Trees Machines Function
Network
P. Y. Wu et al To analyze Logistic regression, Case study taken More accurate Big Data Analytics over
[25] Biomedical Big PCA, HMM, Local from real biomedical prediction biomedical data helped in
III. CONCLUSION
The paper addresses the problem of Big Data and [9] Prof. Neha Soni & Prof. Amit Ganatra,
mentions its tool as Machine Learning. Initially, the paper “Categorization of Several Clustering Algorithms
explains about basic Big Data terms, its definitions, its from Different Perspective: A Review”, International
basic characteristics as Volume, Variety and Velocity. It Journal of Advanced Research in Computer Science
further shows basic Big Data Architecture which can be and Software Engineering”, Volume 2, Issue 8, August
used by various organizations and modified according to 2012
their requirements. The later part of the paper explains [10]Amanpreet Singh ; Narina Thakur ; Aakanksha
about Machine Learning which is a subset of Artificial Sharma, “A Review of Supervised Machine Learning
Intelligence. Machine Learning provides various Algorithms”, 3rd International Conference on
algorithms which can be used to deal with Big Data Computing for Sustainable Global Development ,
problems. Lastly, a detailed literature on Big Data Case IEEE, 2016
Studies and its respective Machine Learning technique is [11] Ajay Shrestha & Ausif Mahmood, “Review Of Deep
mentioned. Learning Algorithms And Architectures”, Vol 7, Ieee
Access, 2019
REFERENCES [12] Ayon Dey, “Machine Learning Algorithms: A
Review”, International Journal of Computer Science
[1] Stephen Kaisler, Frank Arrmour, J. Alberto,” Big and Information Technologies, Vol 7, 2017
Data: Issues and Challenges Moving Forward”,46th [13] Yisheng Lv, Yanjie Duan, Wenwen Kang, Zhengxi
Hawaii International Conference on System Science, Li, and Fei-Yue Wang, Fellow, “Traffic Flow
IEEE,2012 Prediction With Big Data: A Deep Learning
[2] Sam Padden, “From database to Big Data,”, in IEEE Approach”, IEEE TRANSACTIONS ON
Computer Society, 2012 INTELLIGENT TRANSPORTATION SYSTEMS,
[3] Dan Garlasu, “Data Implementation Based on Grid VOL. 16, NO. 2, APRIL2015
Computing”,11th RoEdunet International [14] Breiman, L.; Friedman, J. H.; Olshen, R. A.; Stone,
Conference, IEEE, 2013. C. J. “Classification and regression trees. Monterey,
[4] Avita Katal, Mohammad Wazid and R H Goudar, “Big CA: Wadsworth & Brooks/Cole Advanced Books &
Data: Issues, Challenges, Tools and good Practices”, Software. 1984.
in IEEE 2013 [15]Altman, N. S. “An introduction to kernel and
[5] Doug Laney, “3 D Data Management : Controlling nearest-neighbor nonparametric regression”.The
Data Volume, Velocity and Variety”, in Application American Statistician 46 (3): 175–185. 1992
Delivery Stratergies, Meta Group, 2001 [16] Russell, S.; Norvig, P, “Artificial Intelligence: A
[6] First Tekiner and John A keane, “Big Data Modern Approach” (2nd edition). Prentice Hall,
Framework”, in IEEE international conference on 2003.
Systems, Man and cybernetics, IEEE, 2013 [17]Cortes, C.; Vapnik, “Support-vector networks.
[7] Parth Chandarana and M Vijayalakshmi, “Big Data Machine Learning” 20 (3): 273, 1995.
Analytics Framework”, in International Conference [18] Bishop, C.M,”Neural Networks for Pattern
onCircuits, System, Communication and Information Recognition”, Oxford: Oxford University Press. 1995.
Technology Applications”,IEEE, 2014 [19]Jianpeng Qi et al, “An effective and efficient
[8] Anuja Priyama, Abhijeeta , Rahul Guptaa , Anju hierarchical K-means clustering algorithm”,
Ratheeb and Saurabh Srivastavab, “Comparative International Journal of Distributed Sensor Network”,
Analysis of Decision Tree Classification 2017
Algorithms”, International Journal of Current [20]Syoj Kobashi, Belayat Hossain, Manabu Nii,
Engineering and Technology, Vol 3, No 2, June 2013 Syunichiro Kambara, Takatoshi Morooka, Makiko
Okuno & Shiichi Yoshya, “Prediction of Post