0% found this document useful (0 votes)
95 views

Challenges in Big Data Analytics Techniques

This document discusses challenges in big data analytics techniques. It begins by defining big data and its characteristics of volume, velocity, and variety. Traditional data management and analytics techniques are unable to handle big data due to its massive scale. The document then surveys various big data analytics techniques including text mining, social media analysis, and predictive analytics. It also discusses challenges faced by these techniques in areas like data storage, processing speed, and handling diverse data types and sources.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

Challenges in Big Data Analytics Techniques

This document discusses challenges in big data analytics techniques. It begins by defining big data and its characteristics of volume, velocity, and variety. Traditional data management and analytics techniques are unable to handle big data due to its massive scale. The document then surveys various big data analytics techniques including text mining, social media analysis, and predictive analytics. It also discusses challenges faced by these techniques in areas like data storage, processing speed, and handling diverse data types and sources.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Challenges in Big Data Analytics Techniques: A Survey

C.Komalavalli Chetna Laroiya


Information Technology Information Technology
Jagan Institute of Management Studies, Rohini Jagan Institute of Management Studies,Rohini
New Delhi, India New Delhi, India
[email protected] [email protected]

Abstract— The explosive growth in several areas of business, Data management is defined as processes and technologies for
engineering, health care, scientific studies and social networking storing and retrieving the data for analysis. Big data analytics is
results in enormous volume of data. Internet of things, Mobile and a sub process for extracting inferences from the big data.
Cloud computing are also playing a significant role in data generation.
Knowledge recovery and decision making from these huge volumes of
data is a very challenging. Big Data computing is a new paradigm
which is used to combine the large data and techniques for efficient
analyzing. Big data refers to huge volume of dataset and characterized
by 3 V’s known as Volume, Velocity, and Variety. Proper processing
of data enable us to identify the new opportunities and knowledge
about the market. Traditional Techniques are not able to store and
manage the big data. In this paper, we are focusing on various data
analytics techniques and challenges in that techniques.

Keywords— Big data, Text Mining, Social Media, Audio, .


Video , Predictive analytics
I. INTRODUCTION Fig 1 Big Data Processes [3]
Our century has been defined as the Electronic age and new
II. BIG DATA CHARACTERISTICS
advancements in semiconductor technologies, innovations in
Information technology and connected world using Internet of • Volume
things are eventually leading to faster computing and large scale Volume refers to the size of data. In the era of Information
storage. This trend enables the generation of tremendous technology, volume of data generated by the PCs, Facebook,
amount of data that includes sensor data, health care data, mobile devices etc., are in terabytes or petabytes. Big data
social media data, stock exchange data, transport data, Grid data volumes are varied with respect to time and type of data. Two
etc..[1]. The collection of these data is known as Big Data. type of datasets with the same volume of data may need
different data management techniques for storing and retrieval
Big data is high-volume, high-velocity and high-variety depending upon their data type [3].
information assets that demand cost-effective, innovative forms • Velocity
of information processing for enhanced insight and decision The term 'velocity' refers to the speed at which the data is
making [2]. generated. How fast the data is generated and processed to meet
the demands, cannot be imaginable. The data flow from sources
Big Data has gained the attention of all IT communities of the like processes, machines, networks and human interaction with
digital world. Rate of Data generation and collection of data is the social media sites are massive and continuous. This rate of
increasing exponentially every day. The rapid evolution and growth of data drives for real time analytics and evidence based
adoption of Big Data by the industry needs to be processed for planning.
taking the intelligent business decisions and optimize the • Variety
business processes. Big Data is derived from many devices such It defines the heterogeneous data sources and types of data
as Personal Computer, Smart Phones, Satellite, GPS devices, formats such as structured, semi structured and unstructured
Sensors, monitoring devices, RFID etc. Importance of Big Data data. Nowadays data can be in the form of audio, video, Email,
can be realized in various aspects of business such as Cost Photos, Facebook Chats etc. They are unstructured and poses the
Reduction, Time Reduction, Smart decision, Optimized problem of storing, processing and analyzing the data.
offering, satisfying the customer, understanding the customer
needs etc..
• Veracity

978-1-5386-5933-5/19/$31.00 2019
c IEEE 223
Quality of collected data is also to be considered as one B. Need for Big Data Analytics:
of the characteristic of the Big Data. While considering data Storage and retrieval of large amount of structured,
quality, it is important to set the rules around what type of data unstructured and semi structured is a challenging task for the
and sources from data to be collected. Frequency of the new data decision makers. Marketers focus on target marketing,
collection and the type of data sources are also to be considered insurance providers focus on providing personalized insurances
[4]. to their customers, and healthcare providers focus on providing
• Variability quality and low-cost treatment to patients [3]. Even though
It refers to the dynamic nature of data and inconsistency of advancements in Technology helps in predicting Human
the data. Data flow from different sources can be inconsistent Behaviour, it is much needed to understand the driving and
with the daily, seasonal and event triggered peak data loads. Big regulating factors for handling Big data. The current Big Data
data is also variable because of data dimensions from multiple techniques include rule-based systems, pattern mining, decision
sources and data types. trees and other data mining techniques for developing business
• Validity rules in the large volumes of datasets. Big data applications
It defines the accurate and correct data for the analysis. The need effective storage mechanisms, horizontal computational
big data analytics is useful for the decision making, if the scalability and availability of data in the system.
underlying data is valid. Validity ensures that is made free from
noise during the cleaning step. C. Big Data Analytics:
It is the process of analysing big data i.e large and varied
• Value datasets for discovering hidden patterns, unknown correlations,
It is very important for deriving the business value from market trends, customer preferences and other useful
the data. It helps the people to understand the customer better information which can help the organizations to take their
and optimizing the processes, thus improving machine or business decisions. It is a set of technologies and techniques
business performance. used for finding the hidden pattern from large datasets. The goal
of big data Analytics is to help the business organization for
• Volatility taking better decisions. Since the large volume of data is
It defines the storage of the big data and validity of the unstructured, effective extraction techniques are applied for the
storage. In the real time data, there is a need for defining the extraction of structured data. The extracted data are summary
relevance of the data to the analysis. of the unstructured data, there could be a loss of data after the
extraction.
The process of extracting meaningful insights from the big data
• Vulnerability
involves two step process: 1) Data Management 2) Analytics
It defines the security concern of the data.
Data management process applies to the tools and techniques
A. Type of Data involved in the generation of the data from the sources,
Traditionally, data is stored in the structured format in the form collecting the data from the various sources, pre-processing the
of Rows and Columns. Current data volumes are stored in both data and retrieving the data for further analysis. Analytics
structured, unstructured and Semi-structured format. involves tools and techniques used to analyse and acquire
Structured data: It refers to the traditional database data which knowledge from the pattern. These techniques ca be applied for
is stored in a systematic manner. 20% of total existing data is structured, unstructured and semi structured data.
structured and used in most programming activities. Structured
data can be generated by humans and machines. Web logs, D. Big Data and Warehousing Difference:
Sensors data are under this category. Human generated data Lot of confusion in these two terms, we are discussing the
includes personal data of a human. differences in Table 1:
Unstructured Data: 80% of the total data contributes TABLE 1: DIFFERENCE BETWEEN BIG DATA AND WAREHOUSING
unstructured data. The name itself suggests that there is no
format in storage. Unstructured data is also classified as Data Warehousing Big Data
machine-generated or human-generated. Machine generated Meaning Architecture for It is a Technology
data can be satellite Images, Scientific data, Radar Data etc... . extracting the data
Human generated includes social media data, website content, from RDBMS and
audio and video of You tube, Images of facebook, twitter etc., data repository
text message from various social media sites and mobile Use Organization needs Organizations
applications etc.. to identify the Compare with
Semi-structured data: It does not follow with the formal informed decision large volume of
structure of data models associated with the databases, but such as current data for taking
contains tags and sematic elements for enforcing hierarchies of scenario, next year better decision,
records and fields within the data. Self-Describing structure planning etc.. more customer
defines the semi-structured data. base, more profit
etc.

224 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence)
Sources of Data from Data from any Modern text analytics techniques need some of the below listed
Data Homogenous or type of resources features in Table 2:
Heterogeneous such as RDBMS,
Database social Media data TABLE 2: FEATURES OF LATEST GENERATION OF TEXT ANALYTICS SOLUTION
etc. Feature Business Value
Data type Structured Structured, Foundation Data-driven Requires self-
Format Unstructured, learning systems
Semi Structured Speed Real time instead Continuous
Type of Data Historical Data Real Data, of Periodic runs listening to textual
Streaming Data data streams
Volatile Non Volatile Non Volatile Logic Probabilistic Explores different
Nature Of Data Subject Data Subject inferences answers and gains
data Oriented Oriented trust of users
Processing It takes more HDFS takes less Output Intuitive Enables self-
Processing Time time for visualizations service text
processing the analytics for lines
data of business
communication to
E. Challenges in Analytics on Big Data the users.
Big data analytics imposes 3 fold challenges to its analytics. Big
data attributes are posing 1st step of challenge. Data processing Most of the available data is in textual form in databases. On
which includes aggregation and cleaning should result with this contexts, manual analytics or extraction of important
valuable and valid data. Aggregation of data from different information is not possible, so it is relevant to provide some
sources and in different format is also a research challenge. automatic tools for analyzing large textual data [6][19].
Availability of data is also compromised. Following are popular techniques applied for text analytics.
a. Information Extraction (IE): Algorithm extracts
Management
structured data from unstructured text. Two sub-tasks
Data
Proess Challenge involved in IE are Relation Extraction (RE) and Entity
Challenge Challenges in privacy,
Challenge Challenges related to security, governanc and
Recognition (ER). For example, given the sentence
Challenges related
to the data itself
the analytics of data lack of skill for data
analytics
“Steve Jobs co-founded Apple Inc. in 1976”, the Re
sub-system extract relations ex. Founder of [Steve
Jobs, Apple Inc.] or Founded In [Apple Inc., 1976][3].
validity
veracity
value
volume
variety
velocity
Volume

Cleaning
5. Missing data &
4. Data Interpretation
Modelling
3. Analysis &
Integration
2. Data aggregation &
and Analytics
1. Data Acquiition

5. Data Governance
4. Security & Privacy
3. Cost
2. Data Ownership
Sharing
1. Data & Information

b. Question based Information Retrieval: Three


Components of question based approach are: one
question processing – It refers the question focus area
and expected answer. Second component is document
Fig 2: Challenges in Big Data Analytics [5] processing and third component is answer processing
which is, used to extract the answers from the output
1) Analytics of Unstructured Data:
of the previous component, component rank them, and
Unstructured data which constitute 95% of big data. The
return the highest-ranked candidate as the output of the
format of semi-structured data does not conform to strict
QA System.
standards. Extensible Markup Language (XML), a textual
language for exchanging data on the Web, is a typical
c. Text summarization: Applications where method is
example of semi-structured data. XML documents contain applicable include scientific and news articles,
user-defined data tags which make them machine-readable. advertisements, emails, and blogs [3]. This process
defines two methods namely Extractive and
2)Unstructured Text Analytics: [6] Abstractive. In extractive method, summary
Social network feeds, emails, blogs, online forums, survey determining the salient units of a text is calculated.
responses, corporate documents, news, and call center logs Summary is based on their location and frequency in
are examples of textual data held by organizations. Text the text. In Abstractive method, Natural Language
analytics involve statistical analysis, computational Processing is incorporated for finding summarization.
linguistics, and machine learning. This systems generates more coherent summaries than
the extractive systems, but implementation is very
difficult.

9th International Conference on Cloud Computing, Data Science & Engineering (Confluence) 225
d. Sentiment analysis (opinion mining) techniques [3]: • Machine learning—System with autonomous
It is used to analyse opinionated text, which contains acquisition and integrates the knowledge learnt from
one’s opinions toward products, organization. experience and analytical observation.
Document level techniques determines the whether the • Complexity science: It integrates the complex simulation
complete document holds a negative or a positive models which are derived from statistical physics and
sentiment assuming that the sentiment is for one entity nonlinear dynamics.
only. Sentence level determines the polarity of one
sentence about an entity expressed in that sentence. 4)Audio analytics[8] :
e. Text mining is one of the important fields of data Audio analytics is also known as speech analytics. We are
mining dealing with unstructured or semi-structure always in contact with the audio and brain is continuously
data. Categorization (supervised) technique finds processing audio data. The audio format can be wav.mp3
counts of words and topic of the document will be or windows media format. Audio data is unstructured and
decided by the count. In this process, text documents needs to be pre-processed before the analytics. The major
are classified into predefined class label. Segmentation applications involved in the speech analytics are call
technique is for Outlier analysis. Association centres and health care. Every day, thousands of calls are
technique is for affinity analysis, link analysis and recorded in the different domains of call centres. This
sequence analysis. analytics helps in increasing the customer base [9].
Growth rate of audio data is increasing in a sophisticated
manner, there is a need for technology based analytics
solution. Technology based analytics are more efficient
and cost effective compared to manual processing.
Phonetic search and Automatic research recognition can
be used for analytics. In the second approach the raw audio
stream and terms are searched in transcript. Phonetic
search is used in proper names, fuzzy matching, out of
vocabulary words, uncommon language varieties.
Automatic research recognition is used in Short search
terms, context-sensitive search [10].
Automatic speech recognition is widely known as large
vocabulary continuous speech recognition. First phase of
this process is indexing and searching. In this, speech
recognition algorithm matched sounds to words by
matching from the predefined dictionary. If the exact
Fig 3 Steps of Text Analytic Process [6] matching is not able to find, it picks the similar one. In
3) Social Media Analytics[7] : second, text based methods are used to find the similar one
This analytics can be categorized depends upon the type from the index file. Phonetic approach algorithm searches
of data. the phonetic closeness between search term and raw audio
• Content based: This is used for data based upon the stream. It uses linguistic context less since it matches
content such as i.e. customer feedback, product reviews, audio signal to sequence of speech sounds. ASR generates
images, and videos. scripts by matching audio signal to the words[18].
• Structure based: This is based on extracting intelligence
from the relationships among the participating entities. Choice of search strategy determines the dependency on
Example could be a research document. the linguistic context. First Approach is more context
Automated sentiment analysis of social media digital texts sensitive and depends on the statistical language. Phonetic
uses elements from machine learning such as latent approach uses very little linguistic context. Depends on
semantic analysis and semantic orientation [7].Techniques user’s choice technology can be selected.
for social media analytics belongs to three broad areas:
• Computational statistics: refers to computationally
intensive statistical methods including resampling
methods, local regression, kernel density estimation and 5)Video Analytics:
principal components analysis. It is also known as video content analysis and videos are
major contributor of unstructured data[11]. Digital camera.
Surveillance camera, CCTV, You tube videos are

226 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence)
contributing to this analytics [16]. Analysing video stream those techniques. Our future work is aimed at focusing on
and extracting useful information from streams are highly solutions for the above discussed challenges in big data and the
challenging task. Video Analytics techniques are categorized existing techniques for its analytics.
as Server based and Edge based architecture. In the server
based architecture [12], all the videos are stored in a REFERENCES
centralized server for the process. Before that, videos are
compressed by reducing the frame rates or image resolution.
This may affect the accuracy of the analysis. In the second [1] K. A. I. Hammad, M. A. I. Fakharaldien and J. M. Zain, “Big Data
Analysis and Storage,” in International Conference on Operations
approach, videos are processed in the local system on the raw Excellence and Service Engineering, Orlando, Florida, USA,
data captured by the camera. This results in more accurate September 10-11, 2015.
result than the first one since the whole video streams [2] M. A. Beyer and D. Laney, “The Importance of 'Big Data': A
Definition,” Gartner, 2012.
available for the analysis. Deep Learning can be used in the
Video analytics [13][14]. It extracts the hierarchical [3] A. Gandomi and M. Haider, “Beyond the hype: Big data concepts,
representation from large scale data such as images and methods, and analytics,” International Journal of Information
Management, vol. 35, no. 2, pp. 137-144, April 2015.
video by using deep architecture models with multiple layers
of non-linear transformation, Deep learning methods are [4] J. CHEN, Y. CHEN, et.. al “Big data challenge: a data management
perspective,” Front. Comput. Sci., vol. 7, no. 2, p. 157–164, 2013.
differentiating different levels of abstractions by designing
the layer depth, width and features needed for the learning [5] O. Müller, I. Junglas, S. Debortoli and J. v. Brocke, “Using Text
Analytics to Derive Customer Service Management Benefits from
tasks. Unstructured Data,” MIS Quarterly Executive, December 2016.

III. BIG DATA ANALYTICS PROBLEMS : OUR PERSPECTIVE [6] J. P. Verma, S. Agrawal and B. Patel, “Big Data Analytics: Challenges
And Applications for Text,Audio, Video And Social Media
From the study, we observed that human generated data is Data,” International Journal on Soft Computing, Artificial
Intelligence and Applications (IJSCAI), vol. 5, no. 1, February
authentic or personal liking based. Human generated data could 2016.
be biased and if so, could lead to false finding which could have
impact based on the domain applied. We face the problem of [7] B. Batrinca and P. C. Treleaven, “Social media analytics: a survey of
techniques, tools and platforms,” AI & SOCIETY, vol. 30, no. 1,
spurious correlation while applying two or more variable p. 89–116, February 2015.
regression analysis, is that because the data under analytics is
not 100% authentic? [8] Audio analytics: new opportunities in litigation and investigation, Ernst
& Young Global Limited,, UK , 2016.

On the other hand, machine generated data is comparatively [9] N. Khan, I. Yaqoob, et.. al “Big Data: Survey, Technologies,
Opportunities, and Challenges,” Scientific World Journal, p. 18,
authentic. Social media analytics is accessible through APIs, Volume 2014.
due to the commercial value of the data, most of the major
sources such as Facebook and Google are making it [10] F. B.Adamu, A. Habbal, S. Hassan, R. L. Cottrell, B. White and I.
Abdullahi, “A Survey On Big Data Indexing Strategies”.
increasingly difficult for academics to obtain comprehensive
access to their ‘raw’ data. Do we have complete data of the [11] U. Sivarajah, M. M. Kamal, Z. Irani and V. Weerakkody, “Critical
analysis of Big Data challenges and analytical methods,” Journal
domain or partial unavailable data in the domain having of Business Research, vol. 70, pp. 263-286, 2017.
important finding? Data taken for analytics has to represent the
generic set, no disjoint set of data should be stripped off related [12] G. D. Puri and D. D. Haritha, “Big Data Analytics, Privacy Concerns,
Privacy Preserving Methods and Privacy Concerns,” Indian
to the domain for which analytics has to done. Any stripped off Journal of Science and Technology, vol. 9, no. 17, May 2016.
data with some insight not revealed from the collected data set
included in the analytics, there could be a chance of [13] C. JI, Y. LI, et.. al “Big Data Processing : Big Challenges and
Opportunities,” Journal of Interconnection Networks, vol. 13, no.
compromising the result. Other issue observed from the study 3 & 4, 2012..
of existing technologies for unstructured data analytics is
[14] R. Feldman and J. Sanger, The Text Mining Handbook, CAMBRIDGE
performance measurement of social media analytics has no fixed UNIVERSITY PRESS, 2007.
algorithm. In the video analytics, size of the video stream and junk
created by the video stream. Even technology based analytics is [15] C. Nyce, “Predictive Analytics White Paper,” American Insitute for
CPCU.
required for audio data.

IV. CONCLUSION [16] Seref Sagiroglu, Duygu Sinanc,A big Data Review, International
Conference on Collaboration Technologies and Systems (CTS) IEEE
This paper presents the fundamental concepts of Big Data and Xplorer, 25 July 2013
its Characteristics. We presented the various types of big data
and role of Big Data analytics in the current scenario. Our [17] S Kaisler, Frank Armour, J. Alberto Espinosa, W.Money,”.Big Data:
primary focus has been on various types of analytic techniques Issues and Challenges Moving Forward” 46th Hawaii International
Conference on System Sciences,18 March 2013 IEEE
and gaining valuable information from big data. In this paper,
we have presented the various unstructured big data analytics
techniques in detail and the challenges faced while applying

9th International Conference on Cloud Computing, Data Science & Engineering (Confluence) 227
[18] Sivarajah U, Kamal MM, Irani Z et al (2017),”Critical analysis of Big [19] X.Wu,X.Zhu, Gong-Qing Wu,Wei Ding,”Data Mining with Big
Data challenges and analytical methods”, Journal of Business Data”, IEEE Transactions on Knowledge and Data Engineering
Research. 70: 263-286. archive,Vol 6,Issue 1, Januray 2014, Pages 97-107

228 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence)

You might also like