0% found this document useful (0 votes)
28 views

A Review on Big Data

The document provides a comprehensive review of Big Data, discussing its concept, challenges, algorithms, applications, and future scope. It highlights the significance of Big Data in various sectors such as healthcare, finance, and e-commerce, while addressing issues related to data management, storage, and processing. Additionally, it explores the role of machine learning and artificial intelligence in leveraging Big Data for improved decision-making and business efficiency.

Uploaded by

Kainjan Sanghavi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

A Review on Big Data

The document provides a comprehensive review of Big Data, discussing its concept, challenges, algorithms, applications, and future scope. It highlights the significance of Big Data in various sectors such as healthcare, finance, and e-commerce, while addressing issues related to data management, storage, and processing. Additionally, it explores the role of machine learning and artificial intelligence in leveraging Big Data for improved decision-making and business efficiency.

Uploaded by

Kainjan Sanghavi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

A Review on Big Data - Concept, Challenges, Algorithms,

Applications and Future Scope


Prof. Bhavana A. Khivsara1, Assistant Professor, Computer Engineering, SNJB’s Late Sau K.B. Jain , COE, Chandwad
Dr. M. R. Sanghavi2, Professor & Head, Computer Engineering, SNJB’s Late Sau K.B. Jain , COE, Chandwad
Prof. Kainjan M. Sanghavi3, Associate Professor, Computer Engineering, SNJB’s Late Sau K.B. Jain , COE, Chandwad

Abstract: Big Data is a storage system used to store huge amount of data. The data includes text, audio, video, images etc.
The large amount of data is used in many applications such as Fraud detection, Telecommunications, Health and life sciences,
E-commerce and customer service etc. Big data is most powerful for modern business which uses the intelligent automation.
Machine learning is based on algorithm which can be learnt from data. Big data is supplied to analytical system of ML. Apart
from machine learning big data is also plays a major role in artificial intelligence, deep learning an IoT. Big data challenges
include data storage, data analysis, capturing data, visualization, querying, search, sharing, transfer, updating, information
privacy and data source.

Keywords: 6 V‘s of big data, Machine learning, Deep learning, Data Visualization, HADOOP, Issues of big data,
Applications of big data.

1. INTRODUCTION big challenge in engineering.


Big data may include both structured and unstructured
data for business use. It does not depend on the 1.2. Velocity: It accelerates the data analysis process.
importance of data. But it depends on what the Velocity deals with the speed of collecting and
organization will do with the collected data. These data analyzing the data which can be generated already.
can be analyzed in depth and make good decisions for Everyday data is increasing at every second. So the
business moves. speed of transmission of data must be analyzed. It is
done by the newer technology of big data while
generating itself. Analyzing of the data happened
Volume before putting the data into the database.
Terabytes,

Records,
1.3. Value: It is mainly for business growth. The value
Variety Velocity
Tables, of the data is related to the cost of collecting and
Batch
Structured analyzing the data to guarantee that the data can be
Real-time
Unstructured monetized. Link between the data and insights is not
Multi-factor 6 V’s of Processes
always mean that the data is value.
Big Data
1.4. Veracity: It provides ultra-reliable data sets.
Veracity Value
Veracity also deals with noise and abnormality of
Trustworthiness Statistical
data. In big data strategy the data should be keep in
Authenticity Events
Variability clean. Gathering loads and loads of data is not use if it
Origin,reputation Correlations
Changing data
is not a good quality.
Changing model
1.5. Variability: variability: to what extent, and how
fast, is the structure of your data changing? And how
often does the meaning or shape of your data change?
Figure-1: 6 V’s of Big Data
1.6. Variety: It found new forms for investigation.
1.1. Volume: It offers an increased amount of data. Variety means different data types and categories of
Because of the large volume of data distributed big data repository. Nowadays all the data are not
systems can be used to manage. In distributed system structured in data table. Approximately 80% of data is
data can be stored in different locations and brought unstructured. Latest technology of big data allows both
together by using any software when required. For structured and unstructured.
example in face book there are 10 billion messages, 4.5
billion likes and 350 million pictures are uploaded 2. CONCEPTS
every day. Analyzing this kind of data is really a very Most of the organizations and industries are facing a
big challenge to protecting and analyzing the increasing classified into two different algorithms. One is
volume of data. Big data analytics is a strategy used to classification and another one is regression. Classification
analyzing the large data set and also uncover the hidden algorithm is suitable where the output is categorized. The
patterns and connections among those data. Big data regression algorithm is used where the output value is
analytics used to support business to achieve more profit real. In other terminology unsupervised learning, the
and also discovers the new revenue opportunities, information is neither classified nor labeled and allows
improves the efficiency of customer service delivery etc. the algorithm to act on the information without guidance.
Big data analytics deals with the challenges of Unlike supervised learning no training and teacher is
unstructured and vast data. Hadoop is a best framework provided for learning. Unsupervised learning is again
for big data analytics. It takes the incoming data and classified into two algorithms. One is clustering and
divides it into cheaper disk. This technology is used to another is association. Clustering is an algorithm where
take better decisions in businesses. the related data grouped into several different types.
Data science is a recent area which helps to collect, Association is suitable when we describe the large
analyze, visualize, manage and preservation of huge data. portions of data.
Data mining is the process of examining the important Deep learning [1] is a subset of machine learning. It is
patterns from the large data set. uses multi-layered artificial neural networks to deliver the
Artificial Intelligence and its sub undergrowth (For tasks such as speech recognition, object detection,
example Machine Leaning, Deep Learning, Neutral language translation etc. Deep learning is helped to
Networks), all are algorithm based. These algorithmic predictive analysis in most accurate manner. NVIDIA
methods are used on vast amount of Data (Big Data) to GUI-accelerated framework is best suitable to implement
produce desired results and to find trends, patterns and deep learning. This framework also provides the
predictions. Composite analytical tasks faster than human interfaces to the programming languages like c, c++ and
imagination are done on Big Data with the help of python. Tensor Flow and Picher are other frameworks
Machine Learning and Artificial Intelligence. used for scientists, researchers to improve productivity
Machine learning classified as supervised and [2].
unsupervised learning. In supervised learning, training
data includes both inputs and desired results. This kind of
learning is fast and accurate. Supervised learning is

Artificial Intelligence

Machine Learning
Data Science

Deep Big Data


Learning

Figure- 2: Relationship between big data, Artificial Intelligence Machine Learning, Deep Learning
3. ADVANTAGES OF BIG DATA increase the productivity in business processes. Vendors
are predicting the stock of any product through social
Big data is mainly used for human beings. It also used in media data, weather forecasts and web search trends.
science, technology and business. Supply chain is one of the best big data analytics. Big
data analytics is also improves the HR businesses.
3.1. Customer services:
Big data helps the customers to create predictive models 3.3 Reduce costs:
for specific task. This kind of prediction is done by the Big data tools and automations are helped to reduce the
data analysis. It helps the customers to understand the cost. These big data analytics tools are used to automate
behavior of the entire process. Customer relationship the self- driving car with cameras, sensors, GPS and
management system is used to help the customers to powerful computers. By implementing the new
contact with the enterprises. technologies the cost can be reduced.

3.2 Increased productivity: Big data analytics is used to 3.2. Improved customer service:
Customer service is very much important for all the to do the batch processing for a long time. It gives the
businesses and organizations. Because the customer’s result as accurate, but it is still slow processing.
feedback on service is taken to the common repository
and later it should be analyzed that produces better
decision or results. Customers can meet the product 5. PLATFORM, TOOLS AND SOFTWARE USED
management team to improve the service better than IN BIG DATA HADOOP (High Availability
others in market. If the organization does not respond for Distributed Object- Oriented Platform)
the customer’s service, they lose the customers which will
affect the business. Hadoop is a framework for both formatted (structured)
and unformatted (unstructured) data in distributed servers.
3.3. Fraud detection: Characteristic of Hadoop Platform Stack – HDFS + Hive
Fraud is a false representation of normal data. These are + HBase + Pig.
frequently happens in financial industries. Anomalies can Hadoop Distributed File System (HDFS) is a file
be detected easily by the machine learning techniques of management framework for data distribution and storage
big data. This technique helped in banking and credit card of the system. In this storage system the files are stored
companies to spot the stolen cards easily. [3]Frauds can sequentially with same block size except the last block.
be detected easily in structured data. But it is a big This file system is easy for data handling and storage of
challenge to detect in unstructured data which cannot data.
follow any model. Another common use for big data
improved security: Big data tools are helped in police Hive – Apache Hive enables users to process data without
department to catch the criminals and detect the activities explicitly writing MapReduce code. Hive language,
of them. The national security agency also uses the big HiveQL (Hive Query Language), resembles Structured
data analytics to detect the terrorist who live with us. Query Language (SQL) .A Hive table structure consists
Some big data technique helps to detect the cyber- of rows and columns. The rows typically correspond to
securities. some record, transaction, or particular entity (for
example, customer) detail. The values of the
4. ISSUES IN BIG DATA corresponding columns represent the various attributes or
characteristics for each row. Additionally, a user may
4.1. Problem in managing data: Data from different consider using Hive if the user has experience with SQL
sectors are very huge which requires more space to store and the data is already in HDFS. Hive is not intended for
and need management tools to process those data. To real-time querying
manage the heterogeneous format [8] of data some tools
are used. It is tedious process. If it does not managed Hbase – HBase is built on top of HDFS. HBase uses a
properly, gives an unacceptable results. Many firms had key/value structure to store the contents of an HBase
chosen the business intelligence to manage the huge table. Each value is the data to be stored at the
amount of data. But this is difficult to change them from intersection of the row, column, and version.
traditional working platform into the new platform. Each key consists of the following elements.
Therefore still we need the advanced technology and Row length, Row (sometimes called the row key) Column
tools to manage this situation. family length, Column family, Column qualifier, Version,
Key type.
4.2. Storage issues:
For every business applications or any kind of firm’s Pig – Apache Pig consists of a data flow language, Pig
storage of the large amount of big data is major issues. Latin, and environment to execute the Pig code. The main
Normally big data volumes are measured in terms of benefit of using Pig is to utilize the power of MapReduce
Exabyte. That is we need 25000 disk spaces to store the in a distributed system, while simplifying the tasks of
data. It is not possible in single system. So we need to developing and executing a MapReduce job.
store the data in cloud [8]. Even if the data is stored on
cloud it take a long time to store from variety of data Mahout- Hadoop is an open-source framework from
collections and retrieving from the cloud. This is a major Apache that allows to store and process big data in a
issue to store the bulky data. distributed environment across clusters of computers
using simple programming models. Apache Mahout is an
4.3. Processing issues: open source project that is primarily used for creating
Most of the organizations are moving to the online mode scalable machine learning algorithms. It implements
of processing to boost their business or customer services. popular machine learning techniques such as:
For this mode storage is required in zeta byte. This huge Recommendation, Classification, Clustering
amount of data processing is still challenging task. Some
of the organizations use MapReduce tool [8] which helps YARN: The technology used for job scheduling and
resource management and one of the main components in key/value(tuple) pair. The reduce operation combines all
Hadoop is called Yarn. Yarn stands for Yet Another the tuples based on the key and modifies the key value
Resource Negotiator though it is called as Yarn by the accordingly.
developers. Yarn was previously called MapReduce2 and
Nextgen MapReduce. This enables Hadoop to support Spark: It is an open source parallel processing
different processing types. It runs interactive queries, framework Which is faster in memory operations. Spark
streaming data and real time applications. Also it is another big data processing engine which have the
supports broader range of different applications. Yarn capabilities of machine learning environment with 100
combines central resource manager with different times faster than Hadoop.
containers. It can combine the resources dynamically to
different applications and the operations are monitored Apache Hadoop : It is a free framework to store large
well. amount of data in a cluster using JAVA. It splits big data
and distributes those data across nodes in cluster.
MapReduce: A software structure for distributed
processing of huge database on computing clusters. Non-Relational Database: It stores massive set of data.
MapReduce is a core component of hadoop. MapReduce Many organizations uses non-relational database to
performed in two different operations. One is map replace XML and transmit structured data between web
operations which will convert the set of data into another app and the server.
set of data in which elements are broken up into

Hadoop User Experience (HAU)

Sqo
op PIG HIVE Mahout Oozie

Zoo HBA
Kee SE
Flu per YARN/ Map Reduce V2
me
Hadoop Distributed File System (HDFS)

Figure- 3: Hadoop Ecosystem


6. APPLICATIONS software.
Now days Big Data is used in everywhere. It takes a Tableau [4]: Tableau desktop is interactive data
major role in business, health and financial sectors. visualization software. This software use the drag and
drop of data to represent it visually. Programming skills
6.1. Data Visualization are not required to use this software. It is very easy and
fastest tool and also free for students.
Visualization is the process of creating visual images Sheets in Tableau dashboard: Bubble, Tree Map, Line
which is used in complex applications like detective Graph.
agency, police enquiry and health issues. It is done by 1. Bubble: partner (text), Measure (Trade value-
using the software called computer graphics. The size), Filter.
relationship among multiple values can be identified very 2. Tree Map: Commodity description (Text),
easily. It maintains a large data set. This data Measure(Commodity value share-Size), Trade
visualization is applied in big data to represent data Flow, Filter
patterns and insights of data. This kind of pictorial or 3. Line Graph: Measure(Trade value-Size)
graphical representation of large data set is easy for
decision makers to make a good decision. By this Big 6.2 Big Data in Healthcare:
Data visualization a scientist can visualize the data
efficiently. It improves the return on investments in Big Data plays a major role in all the areas of medicine.
business. Many software vendors offering best tools for But in this paper three categories are explained. Image
visualization such as TIBCO, Qlink and Tableau processing, signal processing and genomics. For each and
every activity in hospital the images are very much structure:
important to spot the diseases. Medical images are used
for the following processing: VIII. CONCLUSION
1. Diagnosis This paper gives the literature survey of big data and
2. Computed Tomography(CT) analytics. It details about the introduction of big data,
3. Magnetic Resonance Imaging(MRI) concepts, and applications. We also discussed about big
Signal processing is another technology used for high- data with Hadoop environment. The one of the major
resolution acquisition and the multitude of monitors is challenge is to manage the big data with new innovations
connected to the patient. Now the healthcare systems uses in the field of Hadoop is getting bigger. Revolutionary
the singular physiological waveform of data. [5] technologies need to be carried out to exploit the big data
completely in future. Another challenge in big data is to
6.3. Big Data in Finance: provide greater security among Social Media Networks.
For long period of time the larger data set contains
historical data. The financial process can be re-engineered REFERENCES
with big data to manage growing volume of data. Big data
in banking is very much useful to enhance the customer [1] Mao, Feng, et al. "Small Boxes Big Data: A Deep
services, increases the revenue and business engagement. Learning Approach to Optimize Variable Sized Bin
Currently big data technologies are combined with Packing." arXiv preprint arXiv:1702.04415 (2017).
financial services to improve the efficiency of services to [2] Reference: Available from:
the customers. Securing the storage of data in banking https://developer.nvidia.com/deep-learning/
and financial institutions is a challenging process. [3] Sharma, Vikash, Bhavna Pandey, and Vipin Kumar.
Security should also extend for online banking and "Importance of big data in financial fraud detection."
electronic communications of sensitive information. International Journal of Automation and Logistics
Dell’s shareplex connector for Hadoop is used to improve 2.4 (2016): 332-348.
the security in these firms. [4] Reference: Available from:
https://www.tableau.com//
6.4. Big Data in Fraud Detection: [5] Belle, Ashwin, et al. "Big data analytics in
healthcare." BioMed research international 2015
Today people are using credit cards for shopping and bill (2015).
payment[10]. They spent the amount based on the card [6] Kamaruddin, Sk, and Vadlamani Ravi. "Credit card
limit and then they paid it later through the bank. If a card fraud detection using big data analytics: use of
is stolen and used by some other persons than the PSOAANN based one- class classification."
transaction shows abnormal expenditure, this is called Proceedings of the International Conference on
fraudulent transaction. Identification of fraud detection is Informatics and Analytics. ACM, 2016.
complex process and it was a challenge task to detect the [7] Reference: Available from http://www-
fraud. Classification methods are used to find the fraud. 01.ibm.com/software/analytics/solutions/customer-
[6] analytics/social-media- analytics/
[8] Wani, Mudasir Ahmad, and Suraiya Jabin. "Big Data:
6.5. Big Data and Sentiment Analysis: Issues, Challenges, and Techniques in Business
Sentiment Analysis in big data is mainly used in social Intelligence." Big Data Analytics. Springer,
media such as WhatsApp, Facebook and other social Singapore, 2018. 613-628.
media. The major purpose of sentiment analysis is to [9] Satyanarayana, L. "A Survey on Challenges and
decide the user’s attitude and moods. The opinion is Advantages in Big Data." International Journal of
expressed in positive or negative emotions. Other people Computer Science and Technology 6.2 (2015): 115-
opinions are very much important to take decisions. 119.
Hadoop based environment is used to do the sentiment [10] Sharma, Vikash, Bhavna Pandey, and Vipin
analysis. The different opinions are collected from Kumar. "Importance of big data in financial fraud
different users and the gathered information is stored in detection." International Journal of Automation and
HDFS environment. The data are classified based on Logistics 2.4 (2016): 332-348.
sentence level. Some machine learning techniques and
algorithms are used to find whether the sentiment is
positive, negative or neutral. The sentiment analysis is
also referred as Natural Language Processing (NLP).
IBM developed IBM Social Media Analytics [7] that
captures structured and unstructured data from social
media networks to develop a common understanding of
opinions, attitudes and trends. It has the following

You might also like