A Review on Big Data
A Review on Big Data
Abstract: Big Data is a storage system used to store huge amount of data. The data includes text, audio, video, images etc.
The large amount of data is used in many applications such as Fraud detection, Telecommunications, Health and life sciences,
E-commerce and customer service etc. Big data is most powerful for modern business which uses the intelligent automation.
Machine learning is based on algorithm which can be learnt from data. Big data is supplied to analytical system of ML. Apart
from machine learning big data is also plays a major role in artificial intelligence, deep learning an IoT. Big data challenges
include data storage, data analysis, capturing data, visualization, querying, search, sharing, transfer, updating, information
privacy and data source.
Keywords: 6 V‘s of big data, Machine learning, Deep learning, Data Visualization, HADOOP, Issues of big data,
Applications of big data.
Records,
1.3. Value: It is mainly for business growth. The value
Variety Velocity
Tables, of the data is related to the cost of collecting and
Batch
Structured analyzing the data to guarantee that the data can be
Real-time
Unstructured monetized. Link between the data and insights is not
Multi-factor 6 V’s of Processes
always mean that the data is value.
Big Data
1.4. Veracity: It provides ultra-reliable data sets.
Veracity Value
Veracity also deals with noise and abnormality of
Trustworthiness Statistical
data. In big data strategy the data should be keep in
Authenticity Events
Variability clean. Gathering loads and loads of data is not use if it
Origin,reputation Correlations
Changing data
is not a good quality.
Changing model
1.5. Variability: variability: to what extent, and how
fast, is the structure of your data changing? And how
often does the meaning or shape of your data change?
Figure-1: 6 V’s of Big Data
1.6. Variety: It found new forms for investigation.
1.1. Volume: It offers an increased amount of data. Variety means different data types and categories of
Because of the large volume of data distributed big data repository. Nowadays all the data are not
systems can be used to manage. In distributed system structured in data table. Approximately 80% of data is
data can be stored in different locations and brought unstructured. Latest technology of big data allows both
together by using any software when required. For structured and unstructured.
example in face book there are 10 billion messages, 4.5
billion likes and 350 million pictures are uploaded 2. CONCEPTS
every day. Analyzing this kind of data is really a very Most of the organizations and industries are facing a
big challenge to protecting and analyzing the increasing classified into two different algorithms. One is
volume of data. Big data analytics is a strategy used to classification and another one is regression. Classification
analyzing the large data set and also uncover the hidden algorithm is suitable where the output is categorized. The
patterns and connections among those data. Big data regression algorithm is used where the output value is
analytics used to support business to achieve more profit real. In other terminology unsupervised learning, the
and also discovers the new revenue opportunities, information is neither classified nor labeled and allows
improves the efficiency of customer service delivery etc. the algorithm to act on the information without guidance.
Big data analytics deals with the challenges of Unlike supervised learning no training and teacher is
unstructured and vast data. Hadoop is a best framework provided for learning. Unsupervised learning is again
for big data analytics. It takes the incoming data and classified into two algorithms. One is clustering and
divides it into cheaper disk. This technology is used to another is association. Clustering is an algorithm where
take better decisions in businesses. the related data grouped into several different types.
Data science is a recent area which helps to collect, Association is suitable when we describe the large
analyze, visualize, manage and preservation of huge data. portions of data.
Data mining is the process of examining the important Deep learning [1] is a subset of machine learning. It is
patterns from the large data set. uses multi-layered artificial neural networks to deliver the
Artificial Intelligence and its sub undergrowth (For tasks such as speech recognition, object detection,
example Machine Leaning, Deep Learning, Neutral language translation etc. Deep learning is helped to
Networks), all are algorithm based. These algorithmic predictive analysis in most accurate manner. NVIDIA
methods are used on vast amount of Data (Big Data) to GUI-accelerated framework is best suitable to implement
produce desired results and to find trends, patterns and deep learning. This framework also provides the
predictions. Composite analytical tasks faster than human interfaces to the programming languages like c, c++ and
imagination are done on Big Data with the help of python. Tensor Flow and Picher are other frameworks
Machine Learning and Artificial Intelligence. used for scientists, researchers to improve productivity
Machine learning classified as supervised and [2].
unsupervised learning. In supervised learning, training
data includes both inputs and desired results. This kind of
learning is fast and accurate. Supervised learning is
Artificial Intelligence
Machine Learning
Data Science
Figure- 2: Relationship between big data, Artificial Intelligence Machine Learning, Deep Learning
3. ADVANTAGES OF BIG DATA increase the productivity in business processes. Vendors
are predicting the stock of any product through social
Big data is mainly used for human beings. It also used in media data, weather forecasts and web search trends.
science, technology and business. Supply chain is one of the best big data analytics. Big
data analytics is also improves the HR businesses.
3.1. Customer services:
Big data helps the customers to create predictive models 3.3 Reduce costs:
for specific task. This kind of prediction is done by the Big data tools and automations are helped to reduce the
data analysis. It helps the customers to understand the cost. These big data analytics tools are used to automate
behavior of the entire process. Customer relationship the self- driving car with cameras, sensors, GPS and
management system is used to help the customers to powerful computers. By implementing the new
contact with the enterprises. technologies the cost can be reduced.
3.2 Increased productivity: Big data analytics is used to 3.2. Improved customer service:
Customer service is very much important for all the to do the batch processing for a long time. It gives the
businesses and organizations. Because the customer’s result as accurate, but it is still slow processing.
feedback on service is taken to the common repository
and later it should be analyzed that produces better
decision or results. Customers can meet the product 5. PLATFORM, TOOLS AND SOFTWARE USED
management team to improve the service better than IN BIG DATA HADOOP (High Availability
others in market. If the organization does not respond for Distributed Object- Oriented Platform)
the customer’s service, they lose the customers which will
affect the business. Hadoop is a framework for both formatted (structured)
and unformatted (unstructured) data in distributed servers.
3.3. Fraud detection: Characteristic of Hadoop Platform Stack – HDFS + Hive
Fraud is a false representation of normal data. These are + HBase + Pig.
frequently happens in financial industries. Anomalies can Hadoop Distributed File System (HDFS) is a file
be detected easily by the machine learning techniques of management framework for data distribution and storage
big data. This technique helped in banking and credit card of the system. In this storage system the files are stored
companies to spot the stolen cards easily. [3]Frauds can sequentially with same block size except the last block.
be detected easily in structured data. But it is a big This file system is easy for data handling and storage of
challenge to detect in unstructured data which cannot data.
follow any model. Another common use for big data
improved security: Big data tools are helped in police Hive – Apache Hive enables users to process data without
department to catch the criminals and detect the activities explicitly writing MapReduce code. Hive language,
of them. The national security agency also uses the big HiveQL (Hive Query Language), resembles Structured
data analytics to detect the terrorist who live with us. Query Language (SQL) .A Hive table structure consists
Some big data technique helps to detect the cyber- of rows and columns. The rows typically correspond to
securities. some record, transaction, or particular entity (for
example, customer) detail. The values of the
4. ISSUES IN BIG DATA corresponding columns represent the various attributes or
characteristics for each row. Additionally, a user may
4.1. Problem in managing data: Data from different consider using Hive if the user has experience with SQL
sectors are very huge which requires more space to store and the data is already in HDFS. Hive is not intended for
and need management tools to process those data. To real-time querying
manage the heterogeneous format [8] of data some tools
are used. It is tedious process. If it does not managed Hbase – HBase is built on top of HDFS. HBase uses a
properly, gives an unacceptable results. Many firms had key/value structure to store the contents of an HBase
chosen the business intelligence to manage the huge table. Each value is the data to be stored at the
amount of data. But this is difficult to change them from intersection of the row, column, and version.
traditional working platform into the new platform. Each key consists of the following elements.
Therefore still we need the advanced technology and Row length, Row (sometimes called the row key) Column
tools to manage this situation. family length, Column family, Column qualifier, Version,
Key type.
4.2. Storage issues:
For every business applications or any kind of firm’s Pig – Apache Pig consists of a data flow language, Pig
storage of the large amount of big data is major issues. Latin, and environment to execute the Pig code. The main
Normally big data volumes are measured in terms of benefit of using Pig is to utilize the power of MapReduce
Exabyte. That is we need 25000 disk spaces to store the in a distributed system, while simplifying the tasks of
data. It is not possible in single system. So we need to developing and executing a MapReduce job.
store the data in cloud [8]. Even if the data is stored on
cloud it take a long time to store from variety of data Mahout- Hadoop is an open-source framework from
collections and retrieving from the cloud. This is a major Apache that allows to store and process big data in a
issue to store the bulky data. distributed environment across clusters of computers
using simple programming models. Apache Mahout is an
4.3. Processing issues: open source project that is primarily used for creating
Most of the organizations are moving to the online mode scalable machine learning algorithms. It implements
of processing to boost their business or customer services. popular machine learning techniques such as:
For this mode storage is required in zeta byte. This huge Recommendation, Classification, Clustering
amount of data processing is still challenging task. Some
of the organizations use MapReduce tool [8] which helps YARN: The technology used for job scheduling and
resource management and one of the main components in key/value(tuple) pair. The reduce operation combines all
Hadoop is called Yarn. Yarn stands for Yet Another the tuples based on the key and modifies the key value
Resource Negotiator though it is called as Yarn by the accordingly.
developers. Yarn was previously called MapReduce2 and
Nextgen MapReduce. This enables Hadoop to support Spark: It is an open source parallel processing
different processing types. It runs interactive queries, framework Which is faster in memory operations. Spark
streaming data and real time applications. Also it is another big data processing engine which have the
supports broader range of different applications. Yarn capabilities of machine learning environment with 100
combines central resource manager with different times faster than Hadoop.
containers. It can combine the resources dynamically to
different applications and the operations are monitored Apache Hadoop : It is a free framework to store large
well. amount of data in a cluster using JAVA. It splits big data
and distributes those data across nodes in cluster.
MapReduce: A software structure for distributed
processing of huge database on computing clusters. Non-Relational Database: It stores massive set of data.
MapReduce is a core component of hadoop. MapReduce Many organizations uses non-relational database to
performed in two different operations. One is map replace XML and transmit structured data between web
operations which will convert the set of data into another app and the server.
set of data in which elements are broken up into
Sqo
op PIG HIVE Mahout Oozie
Zoo HBA
Kee SE
Flu per YARN/ Map Reduce V2
me
Hadoop Distributed File System (HDFS)