0% found this document useful (0 votes)
7 views

A_Review_of_Machine_Learning_Techniques

This document reviews machine learning techniques applied to big data case studies, highlighting the exponential growth and complexity of big data characterized by its volume, velocity, variety, value, and veracity. It discusses the architecture of big data, including various layers for data processing and consumption, and the interdependence of big data and machine learning for effective data analysis. The paper also presents a literature review of various machine learning algorithms used in big data applications, showcasing their advantages and results.

Uploaded by

zooriabdulrehman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

A_Review_of_Machine_Learning_Techniques

This document reviews machine learning techniques applied to big data case studies, highlighting the exponential growth and complexity of big data characterized by its volume, velocity, variety, value, and veracity. It discusses the architecture of big data, including various layers for data processing and consumption, and the interdependence of big data and machine learning for effective data analysis. The paper also presents a literature review of various machine learning algorithms used in big data applications, showcasing their advantages and results.

Uploaded by

zooriabdulrehman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Innovative Research in Computer Science & Technology (IJIRCST)

ISSN: 2347-5552, Volume-8, Issue-3, May 2020


https://doi.org/10.21276/ijircst.2020.8.3.34
www.ijircst.org

A Review of Machine Learning Techniques


over Big Data Case Studies
Dr Yojna Arora

ABSTRACT- In the recent years, Data has increased B. Characteristics of Big Data
exponentially and is termed as Big Data. Data Amount, Big Data is not just huge amount of data or data
Data Speed and Data Variation are three major parameters coming at high speed or data coming in various formats or
of Big Data. There are many challenges which have tuned data in doubt rather it is a combination of all. These define
up out of which Data Storage, Data Analysis and Data the V’s of Big Data. Big Data was initially characterized by
Management are the biggest ones. In order to deal with 3 V’s [5], then 4th V was added [6] and now a 5 V
these challenges, Machine Learning, a subset of Artificial Architecture of Big Data is defined [7]. There
Intelligence provides various tools and techniques. This characteristics are explained as below:
paper gives a detail about Big Data and Machine Learning.
i. Volume: The data is getting exponentially generated. It
It also includes detailed literature review on various Big
is almost doubling in every 12-18 months. Earlier data was
Data case studies which are solved by Machine Learning
measured in GBs and TBs but now the date has increased to
Techniques.
such an extent that it is measured in Petabyte (PB) and
Exabyte (EX). The data is so huge in nature that it is almost
KEYWORDS- Big Data, Data Analytics, Machine
impossible to lookup some information into it in a
Learning, Deep Neural Network, Supervised Learning,
reasonable period of time. Thus, Volume is one important
Neural Net, Data Mining, Computing
factor which has made Data as Big Data
I. INTRODUCTION TO BIG DATA ii. Velocity: The second V of Big Data refers to Velocity.
Velocity means that the data is arriving at a very high speed
A. Big Data Definition with no control over it. This speedy generation of data is
Big Data refers to extremely large, very fast, highly due to easy and speedy access to internet. Data is generated
diverse and complex data that cannot be managed with from many devices and communicated very fast. Managing
traditional data management tools [1],[2],[3], [4]. It is this highly speedy data and finding relevant information
being generated across the globe at an unexpected speed. form it is a difficult task.
Big Data has 5 dimensions associated with it. These are iii. Variety: Variety refers to data in different forms and
Volume, Variety, Velocity, Value and Veracity. The term formats. There are three aspects based on which data is
was evolved from 3 dimensions to 4 dimensions and now 5 categorized. These three aspects are Form of Data, Function
dimensions define Big Data. Big Data can be examined at of Data and Source of Data. Form of Data refers to data in
two levels Basic Level and Advanced Level. At the lower or different formats i.e Text, Audio, Images, Video, Graph,
basic level it is assumed as any other collection of data Map etc. or a combination of any two or more formats. Each
which can help in Business Analytics. On the other hand it format has its differ storage capacity and analysis
is assumed as a special type of data which has great complexity. Function of Data refers to data like Human
challenges and great benefits. Big Data is different from Conversation, Songs, Transaction data, Machine Operation
traditional data in every way i.e space, time and function. Data etc. All these data have to analyze in different way
Also, Big Data is not just data in the form of rows and with different result expectations. Lastly Source of Data
columns rather it includes text, audio, images, videos and refers to data coming from different sources such as
other varied data representation formats. Big Data is Structured, Semi Structured and Unstructured Data,
majorly unstructured in nature. Structured Data is the most organized form i.e. in the form
of rows and columns, Semi Structured is partially organized
data such as XML files, Log files etc., Unstructured Data is
the most unorganized form which includes Text, Audio,
Video, Images.
Manuscript Received April 21, 2020 iv. Veracity: Veracity refers to data in doubt. It takes Data
Dr. Yojna Arora, Assistant Professor, Department of Quality into consideration. Since, Big Data is huge in
Computer Science & Engineering, Amity School of nature so it difficult to maintain the truthfulness and quality
Engineering & Technology, Amity University, of data reducing the noise. The reason behind depreciation
Haryana, Manesar, Gurgaon, Haryana, India (email: of data quality may be unauthorized data source, human or
[email protected] machine generated errors or it can be an intentional attempt

Copyright © 2020. Innovative Research Publication. All Rights Reserve 225


A Review of Machine Learning Techniques over Big Data Case Studies

to hamper data. This characteristic has greater emphasis


because degree of data quality can ensure its applicability in
various domains.

Business Intelligence

Volume Variety

Big Data
Analytics

Velocity Veracity

Fig 1: Big Data Environment

C. Big Data Architecture Ingest Layer, File System or NO SQL. It process the data
Big Data Architecture contains three logical layers: using parallel processing techniques
Data Source as Input Layer, Data Processing as Middle iv. Stream Processing: It receives data only from the
Layer and Data Consumption for Result Analysis and Ingest Layer. It works on Real Time Data which is getting
Interpretation. This generalised Big Data Architecture is continuously generated and produce desired results
resistant, secure, cost effective and adaptive in nature. v. Data Organization Layer: This layer further
Major organizations have modified the architecture receives data from both Batch Processing and Stream
according to optimised infrastructure and requirements. Processing Layer. It is referred as NoSQL database. This
The function of each layer of Big Data Architecture is layer is added to organize the data for easy access.
explained as: vi. Infrastructure Layer: Infrastructure Layer provides
i. Source Layer: The source of data can be identified all the basic support including storage, computation and
based on the application over which analysis is to be communication support.
performed. It will greatly vary in its speed, size, form, vii. Distributed File System: This is the underlying data
function etc. source which can store huge amount of data. It provides
ii. Ingest Layer: This layer receives the data coming the data to all other layers.
from various sources, in different amount and at variable viii. Data Consumption Layer: This layer is the final
speed. It the decides whether the data has to be send for layer. It receives the output from organizing layer and
Batch Processing, Stream Processing or stored in provides the output in the form of reports, graphs and
underlying database. visualization methods.
iii. Batch Processing Layer: It receives data from Data

Big Data Ecosystem


Data Sources
Human-Human
Communication Stream
Human Machine Data Processing Data Data Consumption
Communication Ingest Organizing Data Mining
Machine Machine Data Visualization
Communication Batch
Reports
Business Processing
Mobile Access
Transactions Dashboards
Distributed File System

Compute, Storage, Network Infrastructure

Fig 2: Big Data Architecture

Copyright © 2020. Innovative Research Publication. All Rights Reserve 226


International Journal of Innovative Research in Computer Science & Technology (IJIRCST)
ISSN: 2347-5552, Volume-8, Issue-3, May 2020
https://doi.org/10.21276/ijircst.2020.8.3.34
www.ijircst.org

II. INTRODUCTION TO MACHINE LEARNING Classification method. It labels the data set and based on it
classifies the new data which come for analysis [12]
A. Basic Definition
Unsupervised Learning Algorithms do not follow
Machine Learning is an application of Artificial classification or labelling approach rather it can generate
Intelligence which allows the system to learn inferences.Semi Supervised Learning Algorithm is a
automatically with experience without being combination of both Supervised and Unsupervised
programmed. The learning process comes with Learning. It uses the approach of both labelled and
observation of data. It helps in decision making by unlabelled data for training Reinforcement Learning
studying patterns in datasets. Machine Learning Algorithm allow the system to learn from its environment
Algorithms can be categorized as Supervised, and gain knowledge
Unsupervised, Semi Supervised and Reinforcement [12]
Supervised Learning algorithms apply the predefined
knowledge on the new set of data. It follows the Data

Table 1: Machine Learning Algorithms


Regression Instance Regulari Decision Bayesian Clusterin Ensembl Artificial Deep Dimension
Algorithms Based zation Tree Algorith g e Neural Learning ality
Algorithms Algorithms Algorithms ms Algorith Algorith Network [10] Algorithms Reduction
[8] ms [9] ms [11]

Ordinary K Nearest Ridge Classifica Naïve k- Means Boosting Perceptrons Convolutio Principal
Least Square Neighbours Regression tion & Bayes nal Neural Component
Regression Regression Network Analysis
Tree
Linear Learning Least Iterative Gaussia K- Bootstrap Multilayer Recurrent Principal
Regression Vector Absolute Dichotomise n Naïve Medians ped Perceptrons Neural Component
Quantization Shrinkages r Bayes Aggregation Networks Regressio
and
Selection
Operator
Logistics Self Elastic C4.5 and Multin Expect AdaBoost Back-Propa Long Partial
Regression Organizing Net C5.0 omial ation gation Short-Term Least Squares
Maps Naïve Maximiza Memory Regression
Bayes tion Networks
Multivari Locally Least Chi-squar Averag Hierarc Weighted Stochastic 
Stac Sam
ate Adaptive Weighted Angle ed ed hical Average Gradient ked mon Mapping
Regression Learning Regression Automatic One-Depe Clustering Descent Auto-Encoder
Interaction ndence s
Detection Estimators
Locally Support Decision Bayesia Stacked Hopfield Deep Multidime
Estimated Vector Stump n Belief Generalizati Network Boltzmann nsional
Scatterplot Machines Network on Machine Scaling
Smoothing
Step Wise Conditio Bayesia Gradient Radial Deep Belief Projection
Regression nal Decision n Network Boosting Basis Networks Pursuit
Trees Machines Function
Network

B. Big Data and Machine Learning Altogether


Both Big Data and Machine Learning are mutually
Big Data is the huge amount of data which is getting dependent on each other. Big Data provides the datasets
generated at a very high speed and in various formats. and Machine Learning provides methods and techniques
Analyzing this varied data is the biggest challenge. This which can be applied to analyze that data. Big Data deals
data analysis helps in identifying hidden patterns which with storage, ingestion and extraction tools however,
can help in taking better business decisions. Machine machine Learning deals with prediction methods. A
Learning on the other hand is a subset of Artificial detailed literature review on application of Machine
Intelligence which helps the machine in taking future Learning techniques on Big Data by various researchers is
decisions based on the information which is already fed shown in table below.
into it.

Copyright © 2020. Innovative Research Publication. All Rights Reserve 227


A Review of Machine Learning Techniques over Big Data Case Studies

Table 2: Big Data & Machine Learning Case Studies

Author’s Aim Technique Applied Key Features Advantages Results Attatined


Name
Yisheng Lv et al To propose a Deep Learning Use of Auto Model can discover Proposed model
[13] Traffic Flow Approach with SAE encoders as building Latent traffic Flow feature performed superior than
Prediction Model Model blocks representation Back Propagation, Support
Greedy Layer wise Vector Machine and RBF
unsupervised Neural Network
algorithm

Machine Learning fro Data Mining (Paper 3)


Breiman, L. et To build a Recursive Partition Classification & Gain of information for Decision tree used for
al [14] Decision Tree Tree Regression Tree each feature Regression
Altman, N. S To implement K Nearest Non Parametric Not strongly dependent on Parametric models are
[15] algorithm for Neighbour Regression shape of Regression implemented for data
Memorizing new Function description
items
Russell S et al To compute Bayesian Networks Bayes Theorem Consideration of The probability of new
[16] probability of new independence or item is computed based on
item belonging to dependence of features the values of features of
a class each item belonging to
each class
Cortes C et To implement Support Vector Data Provide good support Respective hyper planes
al[17] Classification Machine Representation in for unknown data sets. that better divide different
Model Hyper Space Best suited for semi classes
structured and
unstructured data
Bishopp C.M To classify, Artificial Neural Connecting Ability to work with A self learning model is
[18] predict or label Network Neurons and incomplete information implemented to classify
data Connecting Layer and predict data
To implement K Means Random Selection Easy adaption to new Group Data by
Jianpeng Qi et Clustering data sets similarities and highlight
al [19] Algorithms differences and
similarities between
groups found
Syoji Kobashi To implement a Feature Extraction Principal Helps in Pre Operative The performance of
et al [20] postoperative using Support Vector Component Analysis Planning prediction model was
prediction model Machine evaluated based on
Prediction using correlation coefficient and
Machine Learning root-mean-squared error
Aras Can Onal To implement a Weather Clustering Implementation of k Integration of data Meaningful information
et al [21] framework for Sensor Anomaly means clustering retrieval, processing and is extracted using the
Weather Data Detection algorithm learning layer proposed framework
Analysis
To study Decision tree Execution Detailed description Analusis of machine
J. L. Berral- various machine algorithms, K- framework and tools, about all machine learning learning algorithm fr
Garcia [22] learning Nearest neighbor platforms and libraries algorithm for big data Classification, Prediction
algorithms algorithms, Bayesian are explained analytics and Modelling
algorithms,SVM,
ANN, K-means,
J. Qui, Q. Wu et To study various Gaussian Mixture Deep neural networks, Supports in Various traditional and
al [23] Machine Learning models, Hidden Deep belief networks identification og patterns new machine learning
algorithms on Big Markov Models, and trends algorithms over Big Data
Data SVM, logistic are analyzed
regression, Kernel
Rgression,
To propose a HDFS for Data Storage Layered Combination of Big A 3 layer architecture
M.U. Bokhari et model for Big data ANN, SVM for Architecture Model Data Technology for model is implemented.
al [24] Storage and analysis Storage and Machine
analysis Learning for analysis

P. Y. Wu et al To analyze Logistic regression, Case study taken More accurate Big Data Analytics over
[25] Biomedical Big PCA, HMM, Local from real biomedical prediction biomedical data helped in

Copyright © 2020. Innovative Research Publication. All Rights Reserve 228


International Journal of Innovative Research in Computer Science & Technology (IJIRCST)
ISSN: 2347-5552, Volume-8, Issue-3, May 2020
https://doi.org/10.21276/ijircst.2020.8.3.34
www.ijircst.org

Data regression, cox data precision medicine


regression
To implement a Map Reduce Supervised Model implementation A model is
M. R. Bendre et prediction model Linear Regression Learning Approach based on past records implemented for better
al [26] prediction of rainfall
Ananthi To develop a Apache Spark based Tree based machine Map Reduce Method of The proposed model
Sheshasaayee et al model for model learning algorithm for parallelizing is replaced optimizes the machine
[27] temperature training data learning technique in a
prediction distributed environment
Junfei Qiu et al To integrate The functionalities Regression, Open source, Scalable, Various qualitative an d
[28] Big Data Analysis of Apache Spark MLib Classisfication, Platform independent quantitative attributes of
using Machine Dimension Reduction machine learning libraray the library are analyzed
Learning method and Rule extraction using real world data sets

III. CONCLUSION
The paper addresses the problem of Big Data and [9] Prof. Neha Soni & Prof. Amit Ganatra,
mentions its tool as Machine Learning. Initially, the paper “Categorization of Several Clustering Algorithms
explains about basic Big Data terms, its definitions, its from Different Perspective: A Review”, International
basic characteristics as Volume, Variety and Velocity. It Journal of Advanced Research in Computer Science
further shows basic Big Data Architecture which can be and Software Engineering”, Volume 2, Issue 8, August
used by various organizations and modified according to 2012
their requirements. The later part of the paper explains [10]Amanpreet Singh ; Narina Thakur ; Aakanksha
about Machine Learning which is a subset of Artificial Sharma, “A Review of Supervised Machine Learning
Intelligence. Machine Learning provides various Algorithms”, 3rd International Conference on
algorithms which can be used to deal with Big Data Computing for Sustainable Global Development ,
problems. Lastly, a detailed literature on Big Data Case IEEE, 2016
Studies and its respective Machine Learning technique is [11] Ajay Shrestha & Ausif Mahmood, “Review Of Deep
mentioned. Learning Algorithms And Architectures”, Vol 7, Ieee
Access, 2019
REFERENCES [12] Ayon Dey, “Machine Learning Algorithms: A
Review”, International Journal of Computer Science
[1] Stephen Kaisler, Frank Arrmour, J. Alberto,” Big and Information Technologies, Vol 7, 2017
Data: Issues and Challenges Moving Forward”,46th [13] Yisheng Lv, Yanjie Duan, Wenwen Kang, Zhengxi
Hawaii International Conference on System Science, Li, and Fei-Yue Wang, Fellow, “Traffic Flow
IEEE,2012 Prediction With Big Data: A Deep Learning
[2] Sam Padden, “From database to Big Data,”, in IEEE Approach”, IEEE TRANSACTIONS ON
Computer Society, 2012 INTELLIGENT TRANSPORTATION SYSTEMS,
[3] Dan Garlasu, “Data Implementation Based on Grid VOL. 16, NO. 2, APRIL2015
Computing”,11th RoEdunet International [14] Breiman, L.; Friedman, J. H.; Olshen, R. A.; Stone,
Conference, IEEE, 2013. C. J. “Classification and regression trees. Monterey,
[4] Avita Katal, Mohammad Wazid and R H Goudar, “Big CA: Wadsworth & Brooks/Cole Advanced Books &
Data: Issues, Challenges, Tools and good Practices”, Software. 1984.
in IEEE 2013 [15]Altman, N. S. “An introduction to kernel and
[5] Doug Laney, “3 D Data Management : Controlling nearest-neighbor nonparametric regression”.The
Data Volume, Velocity and Variety”, in Application American Statistician 46 (3): 175–185. 1992
Delivery Stratergies, Meta Group, 2001 [16] Russell, S.; Norvig, P, “Artificial Intelligence: A
[6] First Tekiner and John A keane, “Big Data Modern Approach” (2nd edition). Prentice Hall,
Framework”, in IEEE international conference on 2003.
Systems, Man and cybernetics, IEEE, 2013 [17]Cortes, C.; Vapnik, “Support-vector networks.
[7] Parth Chandarana and M Vijayalakshmi, “Big Data Machine Learning” 20 (3): 273, 1995.
Analytics Framework”, in International Conference [18] Bishop, C.M,”Neural Networks for Pattern
onCircuits, System, Communication and Information Recognition”, Oxford: Oxford University Press. 1995.
Technology Applications”,IEEE, 2014 [19]Jianpeng Qi et al, “An effective and efficient
[8] Anuja Priyama, Abhijeeta , Rahul Guptaa , Anju hierarchical K-means clustering algorithm”,
Ratheeb and Saurabh Srivastavab, “Comparative International Journal of Distributed Sensor Network”,
Analysis of Decision Tree Classification 2017
Algorithms”, International Journal of Current [20]Syoj Kobashi, Belayat Hossain, Manabu Nii,
Engineering and Technology, Vol 3, No 2, June 2013 Syunichiro Kambara, Takatoshi Morooka, Makiko
Okuno & Shiichi Yoshya, “Prediction of Post

Copyright © 2020. Innovative Research Publication. All Rights Reserve 229


A Review of Machine Learning Techniques over Big Data Case Studies

Operative Implanted Knee Function using Machine


Learning in Clinical Big Data, International
Conference on Machine Learning and Cybernatics,
2016
[21] Aras Can Onal, Omer Berat Sezer, Murat Ozbayoglu
&Erdogan Dogdu†. “Weather Data Analysis and
Sensor Fault Detection Using An Extended IoT
Framework with Semantics, Big Data, and Machine
Learning”, International Conference on Big Data,
2017.
[22] J. L. Berral-Garcia, “A quick view on current
techniques and machine learning algorithms for big
data analytics”, 18th International Conf. on
Transparent Optical Networks, pp.1-4, 2016
[23] J. Qui, Q. Wu, G. Ding, Y. Xu and S. Feng, “A
survey of machine learning for big data processing”,
EURASIP Journal on Advances in Signal Processing,
Springer, vol. 2016:67, pp. 1-16, 2016
[24] M. U. Bokhari, M. Zeyauddin and M. A. Siddiqui,
“An effective model for big data analytics”, 3rd
International Conference on Computing for
Sustainable Global Development, pp. 3980-3982,
2016
[25] P. Y. Wu, C. W. Cheng, C. D. Kaddi, J.
Venugopalan, R. Hoffman and M. D. Wang, “–Omic
and Electronic Health Record Big Data Analytics for
Precision Medicine”, IEEE Transactions on
Biomedical Engineering, vol. 64, issue 2, pp.
263-273, 2017
[26] M. R. Bendre, R. C. Thool and V. R. Thool, “Big data
in precision agriculture: Weather forecasting for
future farming”, 1st International Conf. on Next
Generation Computing Technologies, pp. 744-750,
2015.
[27] Ananthi Sheshasaayee & J V N Lakshmi, “An insight
into Tree Based Machine Learning techniques for Big
Data Analytics using Apache Spark”, International
Conference on Intelligent Computing,
Instrumentation and Control Technologies, 2017.
[28] Junfei Qiu, Qihui Wu, Guoru Ding, Yuhua Xu and
Shuo Feng, “A survey of Machine Learning for Big
Data Processing”, Journal of Advances in Signal
Processing, 2016

Copyright © 2020. Innovative Research Publication. All Rights Reserve 230

You might also like