0% found this document useful (0 votes)
6 views

A Review Report on Big Data Analytics

Uploaded by

TAUSEEF ALI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

A Review Report on Big Data Analytics

Uploaded by

TAUSEEF ALI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

A Review Report on Big Data Analytics

Muhammad Faizan Berlas


Department of Computer Science
Virtual University of Pakistan
Lahore, Pakistan
[email protected]

Abstract—The emergence of modern information systems, The first step towards developing a big data software
social media platforms, Internet of Things (IoT) devices, web system is to understand the big data, and the technologies and
and mobile applications, and other new technologies have terminologies associated with it. Hence, this paper first
resulted in the generation of huge amounts of data. Big data presents an overview of different published literature and
refers to extensive and diverse datasets that are generated from surveys related to big data technologies and terminologies [2],
a wide range of sources and pose challenges in processing and and [6]. Once a foundational understanding of big data is
analyzing using traditional data tools. Software systems that are established, further review is done on published literature
designed to handle large and complex datasets are known as big
related to big data software engineering and architecture for
data software systems. These are a collection of tools,
developing big data systems [1], [3], [5], [8], [9], and [12].
technologies, and platforms designed to handle, process,
analyze, and extract insights from big data. This research paper Big data processing, analysis, and utilization are a few of
presents a comprehensive review of various aspects of big data the most important aspects of big data research. Published
software engineering, development tools, literature reviews, literature related to this area of big data research has also been
techniques, and terminologies. It offers valuable insights into the reviewed in this research paper [4], [7], [10], and [11].
significant areas of big data that have undergone extensive
research in recent years. By providing a thorough This paper is assembled as follows: Section II presents the
understanding of big data software systems, the paper serves as research work that was collected and analyzed for the
a valuable resource for researchers venturing into the realm of literature review. Section III provides the findings from
big data and seeking potential avenues for future research. numerous research papers and discusses them in the context
of big data software engineering. Finally, section IV presents
Keywords—Big data, Big data software engineering, Big data the conclusions, the limitations of the current research work,
analytics, Big data software development. and ideas for future research.
I. INTRODUCTION II. RELATED WORK
The term big data refers to large, complex, diverse, and To conduct a comprehensive literature review, a diverse
difficult-to-process datasets that are generated from a wide collection of research publications covering various aspects of
range of sources including social media platforms, Internet of big data, including analytics, software engineering,
Things (IoT) devices, web and mobile applications, machine development tools, techniques, terminologies, and literature
and sensor data, transactional data, government and public reviews, was gathered. These publications were sourced from
data, scientific and research data, multimedia data and other reputable online research databases, with a specific focus on
heterogeneous data sources [2]. Because of its complexity, selecting journals indexed in either Science Citation Index
this type of data is difficult to process with traditional data Expanded (SCIE) or Emerging Sources Citation Index (ESCI)
management tools or data processing applications [4]. for their rigorous peer-review process and academic authority.
Today’s digital world is also called the era of big data [5]. The
emergence of new technologies and advancements in data Table I shows the list of 12 papers that were collected and
collection methods have generated a lot of data from a wide presents them in ascending order by the year of publication.
range of data sources. All of this data needs to be stored, For each paper, the area of research work is presented along
processed, and analyzed for various reasons. with a brief description of the research area and the proposed
approach provided in the paper.
Big data stores a lot of valuable information. From this
data, valuable insights and patterns can be generated. This TABLE I. RELATED WORK IN BIG DATA
information is helpful in business and its careful examination
is a crucial factor for staying ahead of the competition. Paper Year Research Area
Proposed
However, data management and analysis of big data present Approach
serious challenges and require significant resources, new
Gorton and Klein Big data software
methods, and powerful technologies [2]. Likewise, [8]
2015
development
Case study
developing big data software systems requires scalable Big data
architecture, and hence software requirement engineering for Acharjya and challenges, open Narrative literature
2016
developing big data systems must consider pervasive Ahmed [4] research issues, review
distribution, variable request loads, computation-intensive and tools
analytics, high availability, and sustainability [9]. Big data software Narrative literature
Gorton et al. [9] 2016
engineering review
In this paper, several publications are reviewed to provide Big data Fuzzy set
Wang et al. [7] 2017
an overview of the research work done in the field of big data processing techniques
Big data software
analysis and software engineering. The surveyed publications Osvaldo et al. [12] 2017
development
MapReduce model
include published literature from 2015 to 2023. Big data Systematic
Heidari et al. [11] 2018
processing literature review
Proposed
Paper Year Research Area
Approach
Structured
Oussous et al. [2] 2018
Big data Narrative literature Variety
technologies review
Latent Dirichlet
Unstructured
Gurcan and Big data software Allocation (LDA)

4 V's of Big data


2019
Cagiltay [5] engineering based topic Availaility
modeling
Big data software Narrative literature
Veracity
Grüning et al. [3] 2019 Accountability
engineering review
Davoudian and Big data software Narrative literature
2020
Liu [1] engineering review Fast generation
Big data Velocity
Narrative literature
Abdalla [6] 2022 techniques and
review Growth rate
terminologies
Big data analytics
capabilities and Structural equation Terabytes
Alyahya et al. [10] 2023 Impact on modeling (SEM) Volume
sustainable method
performance
Petabytes

Fig. 1. Attributes of Big Data [4].


III. DISCUSSION
This section provides the findings and proposed Abdalla [6] provides a comprehensive review of
approaches presented in the research work that was selected techniques and tools used for big data processing. NoSQL,
for the literature review. The research questions are designed Cassandra, Hadoop, Strom, Spark, Hive, and OpenRefine are
to determine the big data research areas and identify the the tools utilized for analyzing big data [6]. Machine learning,
knowledge gaps in big data software engineering, analysis, deep learning, cloud computing, go computing, edge
and development. The objective of this discussion is to computing, and concentric computing are the techniques
provide a comprehensive overview of different aspects of big utilized in big data processing applications. The significance
data research and identify research areas for future work. of [6] work lies in its analysis of big data terminologies and
techniques across various factors, including publication year,
A. RQ1: What is big data and what are the different aspects performance metrics, achievement of existing models, and
of big data research? employed methods.
This question attempts to develop a foundational The work by Davoudian and Liu [1] offers valuable
understanding of big data, and its associated technologies and insights into three key software engineering activities within
terminologies. Oussous et al. [2] provide a detailed overview the context of big data software systems. These activities
of big data applications, challenges, and technologies include requirement elicitation, software design and
developed for big data systems to overcome these challenges. construction based on specified requirements, and software
Furthermore, these technologies are compared with each other quality assurance. The work of [1] is based on state-of-the-art
based on different system layers such as Data Storage Layer, research and industrial practices used in big data software
Data Processing Layer, Data Querying Layer, Data Access applications. Likewise, [3] presents a comprehensive
Layer, and Management Layer. Table II shows the list of big explanation of several software engineering aspects and
data applications, and challenges provided by Oussous et al. provides a set of guidelines for developing big data software
[2]. systems. These guidelines can be considered as a unified
TABLE II. BIG DATA APPLICATIONS AND CHALLENCES BY OUSSOUS [2] framework of procedures, offering a valuable resource to both
new and experienced scientists for building robust big data
Big Data Applications Big Data Challenges software systems.
Smart Grid Case Big data management Gurcan and Cagiltay [5] identify a lack of knowledge
about the skill sets and domains required for big data software
E-health Big data cleaning engineering. Software requirement engineering for big data
Internet of Things (IoT) Big data aggregation software applications requires more advance, progressive, and
specific professional knowledge such as scalable architecture,
Public Utilities Imbalanced systems capacities
real-time data processing and coding, integration, and testing.
Transportation and
Imbalanced big data
To determine the skill sets and knowledge domains for big
logistics data software engineering, [5] analyzed the online job ads for
Political services and
Big data analytics
big data software engineering and collected data from
government monitoring indeed.com, an online employment site offering
Big data machine learning. This comprehensive search and filter options. The job ads were
includes; queried and filtered to find ads containing “big data”, ‘big data
Software development
1. Data stream learning developer”, “big data software”, etc. A total of 2,638 online
2. Deep learning
frameworks
3. Incremental and ensemble
jobs were selected. The collected data was then processed to
learning convert uppercase letters to lower case, remove stop words,
4. Granular computing HTML tags, and other unnecessary characters. With the
completion of this step, each job ad was characterized by a list Gorton and Klein [8] present a case study of their work to
of unique words, hence, a total of 2638 jobs were enhance a system for consolidating data from multiple
characterized by 10,432 terms. The words belonging to each petascale medical-record databases for clinical applications.
job ad were then combined to create a document in which each For attaining high scalability and availability, [8] use NoSQL
job consists of a list of unique words. The resulting document, databases for data aggregation. Architecture tactics, which are
known as the document-termed matrix, was used for elemental design decisions, embodying architectural
quantitative analysis. Document-termed matrix was analyzed knowledge for satisfying the quality attributes of a particular
to determine major competency areas and their required design, have been used by [8] for developing the architecture
knowledge and skills. This was done by the Latent Dirichlet of big data software systems.
Allocation (LDA) model, which is a probabilistic model used
to determine the abstract topics from the systematic set of Gorton et al. [9] describe big data design challenges that
words provided in the document-termed matrix. the LDA must be addressed to develop data-intensive systems and
model was implemented with the MALLET tool. This tool elaborate five issues that cause such additional design
employs the Gibbs sampling algorithm for implementing the challenges.
LDA model. The granularity level of discovered topics and First is pervasive distribution. High scalability and
the ideal number of selected topics were obtained by analyzing availability are achieved by highly distributed systems.
the relevance and logical connection between discovered
topics and their keywords. Most of the topic names were given Big data systems must support write-heavy workloads [9],
by a meaningful combination of the first four keywords. Some hence, this requirement constitutes the second challenge. Big
topic names were assigned by considering the general data systems, spanning from social media platforms to high-
meaning of all keywords. resolution sensor data collection in the power grid, necessitate
the capability to endure and handle substantial volumes of
In this way, 48 trending topics were determined that reveal write-intensive operations.
the knowledge domains and skill sets of big data software
engineering. These skills and knowledge domains were then The third issue revolves around the management of
mapped into 10 core competency areas and a competency map variable request loads. Systems often encounter highly
was developed. Furthermore, 15 most in-demand fluctuating workloads due to factors such as product
programming languages, 15 most in-demand programming promotions, emergencies, and statutory deadlines like tax
tools, 15 most in-demand databases and warehouses, 15 most submissions. To avoid the expenses associated with
in-demand big data tools, 15 most in-demand programming overprovisioning to handle occasional spikes, cloud platforms
languages and databases or data warehouses combinations and offer elasticity, enabling applications to dynamically add
15 most in-demand combinations of tools and databases or processing capacity when necessary and release resources
data warehouses were identified. These results help in during low-demand periods. Effectively leveraging this
determining the skillsets and knowledge areas that are deployment mechanism necessitates an architecture that
required for big data software engineering. Table III shows the incorporates application-specific strategies to detect workload
list of big data competency areas, programming languages, surges, swiftly allocate additional resources to distribute the
and programming tools provided by [5]. load, and release resources as the workload diminishes.
TABLE III. BIG DATA COMPETENCY AREAS AND PROGRAMMING Computation-intensive analytics is the fourth issue in
LANGUAGES AND PROGRAMMING TOOLS BY GURCAN AND CAGILTAY [5] designing big data software systems. In big data systems, there
is a need to support a wide range of query workloads,
Programming Programming
Competency Areas
Languages Tools
encompassing both quick-response requests and long-running
queries that involve complex analytics on substantial portions
Big data frameworks Java Jenkins
of the data collection. Effectively addressing the diverse
Big data processes Python Maven
requirements of transactional and analytical workloads at a
Big data analytics Scala Spring MVC large scale poses a significant software engineering design
Data processing challenge, with cost-effectiveness as a key consideration.
R SVN
types
Software The fifth issue in designing big data software systems is to
development life C++ Github achieve high availability for an application constituting
cycle thousands of nodes. Hardware and network failures are
Programming JavaScript Hibernate inevitable in such applications, hence, the resulting distributed
Software software and data architecture must be resilient.
development .Net Node.js
frameworks
Osvaldo et al. [12] utilize the MapReduce model and
Vocational
propose a practical approach that utilizes Model Driven
Ruby Angular.js Engineering (MDE) for the semi-automated development of
background
Soft skills C JQuery
software systems for Big Data platforms. The model presented
by [12] is of great importance because it helps in extracting
Work style Perl Backbone.js valuable information from big data by preserving the business
Interoperability C# Sprint logic and employing big data features throughout the
Typescript Lucene development process.
Go NumPy [12] also provides a comparison of the proposed model
Julia Ant with other published models. The evaluation criteria used by
[12] include the Practical solution, Approach, Requirement
Php Flask
phase, Design phase, Implementation phase, and Test phase.
TABLE IV. BIG DATA CHALLENGES AND OPEN RESEARCH ISSUES BY IV. CONCLUSION AND FUTURE WORK
ACHARJYA AND AHMED [4]
Big data software engineering poses significant challenges
Big Data Challenges Big Data Open Research Issues because of the unique requirements associated with big data
systems. Likewise, big data software design and development
Data storage and analysis IoT for big data analytics also require state-of-the-art tools and technologies to meet the
Knowledge discovery quality and design requirements of big data systems. This
and computational Cloud computing for big data analytics research work establishes a foundational understanding of
complexities various aspects of big data by providing a detailed overview
Scalability and Bio-inspired computing for big data of published literature. The research work specifically focuses
visualization of data analytics on big data software engineering, development, processing,
Information security
Quantum computing for big data and published literature reviews and attempts to identify
analytics research gaps in big data research.
Although there is a lot of published literature on big data,
Acharjya and Ahmed [4] describe the characteristics of big given the relevance, and vast and difficult nature of the topic,
data software systems and identify open research issues in big it appears that only a small amount of work has been published
data analytics. Like [12], [4] also describe the challenges faced thus far. The published research work focuses on various
during the design and development of big data systems; aspects of big data, and several approaches have been
however, unlike [12], the challenges identified by [4] are proposed to solve specific challenges; however, there is a need
related to computational complexities, information security, for detailed empirical studies for evaluating its usefulness in
and computation methods of big data software systems. Table real-life business and industrial problems. Moreover, the
IV provides the list of big data challenges and open research practical implementation of these proposed approaches is yet
issues in big data analytics detailed by [4]. to be implemented on an industrial scale. Published literature
on the utilization of big data in e-commerce is rare. Similarly,
Wang et al. [7] focus on fuzzy set techniques that are used
case studies describing the benefits and exploring the
for big data processing and elaborate on the benefits of fuzzy
competitive advantages of big data analysis in business and
sets in solving big data processing problems. Furthermore, [7]
marketing are scarce.
present a critical assessment of big data processing problems
and propose an advanced augmentation of fuzzy sets and their One of the limitations of this research paper is the small
integration with other tools to provide a unique and promising number of published research papers for the literature review.
environment for overcoming big data processing problems. This is because only those research papers were selected
which have open access and which exist in either the Science
Alyahya et al. [10] identify the potential impact of big data
Citation Index Expanded (SCIE) indexed journals or
analytics capabilities on sustainable performance, specifically
Emerging Sources Citation Index (ESCI) indexed journals.
through the lens of strategic agility. [10] use the resource-
Moreover, this literature review tries to combine various
based view and dynamic capabilities view to build a
aspects of big data research in a single article i.e. big data
theoretical framework. A survey employing positivist
software engineering, development, processing, analysis, and
methodology was conducted to gather data from 410
literature reviews, are all combined in this research work. Big
managers in Saudi Arabia. The collected data was analyzed
data is an emerging field with a vast scope and it requires a
using Structural Equation Modeling (SEM) method. The
more systematic approach to review its published literature.
findings of [10] indicate that big data analytics capabilities
Reviewing the published literature based on specific
have a significant effect on economic, environmental, social,
elimination and inclusion criteria and focusing on any
and sustainable performance.
particular aspect of big data research will provide a more
The storage of big data in the form of graphs is becoming focused overview of the subject.
increasingly popular. A graph is a computational method for
performing analysis on huge datasets [11]. Heidari et al. [11] REFERENCES
investigated and categorized graph processing frameworks [1] A. Davoudian and M. Liu, “Big Data Systems,” ACM Computing
and systems that are used for big data analysis. Surveys, vol. 53, no. 5, pp. 1–39, Oct. 2020, doi: 10.1145/3408314.
[2] A. Oussous, F.-Z. Benjelloun, A. Ait Lahcen, and S. Belfkih, “Big Data
B. RQ2: What are the specific areas of Big Data research technologies: A survey,” Journal of King Saud University - Computer
that have been the most extensively studied? and Information Sciences, vol. 30, no. 4, pp. 431–448, Oct. 2018, doi:
10.1016/j.jksuci.2017.06.001.
This question attempts to identify the most prominent
[3] B. A. Grüning, S. Lampa, M. Vaudel, and D. Blankenberg, “Software
areas in big data research. After careful examination of the engineering for scientific big data analysis,” GigaScience, vol. 8, no. 5,
published literature provided in this research paper, it can be May 2019, doi: 10.1093/gigascience/giz054.
concluded that big data software engineering is the most [4] D. P. Acharjya and K. Ahmed, “A Survey on Big Data Analytics:
extensively studied area of big data research [1], [3], [5], and Challenges, Open Research Issues and Tools,” International Journal
[9]. Big data processing [7] and [11], big data software of Advanced Computer Science and Applications, vol. 7, no. 2, Feb.
development [8] and [12], and big data technologies and 2016, doi: 10.14569/IJACSA.2016.070267.
terminologies [2], and [6] are other areas of big data research [5] F. Gurcan and N. E. Cagiltay, "Big Data Software Engineering:
that with almost similar amounts of research work in each Analysis of Knowledge Domains and Skill Sets Using LDA-Based
Topic Modeling," in IEEE Access, vol. 7, pp. 82541-82552, 2019, doi:
area. The impact of big data analytics capabilities on business 10.1109/ACCESS.2019.2924075.
and sustainable performance [10] is the least studied area of
[6] H. B. Abdalla, “A brief survey on big data: technologies, terminologies
big data research. and data-intensive applications,” Journal of Big Data, vol. 9, no. 1,
Nov. 2022, doi: 10.1186/s40537-022-00659-3.
[7] H. Wang, Z. Xu, and W. Pedrycz, “An overview on the roles of fuzzy
set techniques in big data processing: Trends, challenges and
opportunities,” Knowledge-Based Systems, vol. 118, pp. 15–30, Feb.
2017, doi: 10.1016/j.knosys.2016.11.008.
[8] I. Gorton and J. Klein, "Distribution, Data, Deployment: Software
Architecture Convergence in Big Data Systems," in IEEE Software,
vol. 32, no. 3, pp. 78-85, May-June 2015, doi: 10.1109/MS.2014.51.
[9] I. Gorton, A. B. Bener and A. Mockus, "Software Engineering for Big
Data Systems," in IEEE Software, vol. 33, no. 2, pp. 32-35, Mar.-Apr.
2016, doi: 10.1109/MS.2016.47.
[10] M. Alyahya, M. Aliedan, G. Agag, and Z. H. Abdelmoety,
“Understanding the Relationship between Big Data Analytics
Capabilities and Sustainable Performance: The Role of Strategic
Agility and Firm Creativity,” Sustainability, vol. 15, no. 9, p. 7623,
May 2023, doi: 10.3390/su15097623.
[11] S. Heidari, Y. Simmhan, R. N. Calheiros, and R. Buyya, “Scalable
Graph Processing Frameworks: A Taxonomy and Open
Challenges,” ACM Computing Surveys, vol. 51, no. 3, pp. 1–53, Jul.
2018, doi: 10.1145/3199523.
[12] S. S. Osvaldo, D. Lopes, A. C. Silva, and Z. Abdelouahab, “Developing
software systems to Big Data platform based on MapReduce model:
An approach based on Model Driven Engineering,” Information and
Software Technology, vol. 92, pp. 30–48, Dec. 2017, doi:
10.1016/j.infsof.2017.07.0.

You might also like