Big Data Ecosystem
Big Data Ecosystem
net/publication/327389927
CITATIONS READS
0 74
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Rajat Kumar Behera on 25 October 2018.
1 Introduction
The concept big data has evolved due to outburst of data from various sources
like data centers, cloud, internet, Internet of things (IoT), mobile, sensors and
other spheres [1]. Data has entered into every industry, all business operations and
currently thought about a major factor in production [2]. Big data is broad and
encompasses new technology developments and trends [3]. It represents the enormous
volume of structured, semi-structured and unstructured data and usually in terms of
peta bytes (PB) or exa bytes (EB). The data are collected at an unparalleled scale and
creates difficulty in making intelligent decisions. For instance, in the process of data
acquirement, when the sourced data require decisions on cleanness i.e. what data to
discard and what data to keep, remains a challenging task and how to store the
reliable data with the right metadata becomes a major point of concern. Though the
decisions can be based on the data itself, but greatly data are still not in a structured
format. Blog content and Tweeter, Instagram feeds are imperceptibly structured
pieces of text while machine generated data, such as satellite images, photographs,
and videos are structured well for storage and visualization but not for semantic
content and search. Transforming such content into a structured format for data
analysis tends to be a problem. Undoubtedly, big data has the potential to help the
industries in improving their operations and to make intelligent decisions.
1.1 The Five V’s of Big Data
Every day, 2500 petabytes (PB) of data is created from digital pictures, videos,
transaction records, posts from social media websites, intelligent sensors etc. [4].
Thus, big data describe massive and complex data sets which are unfeasible to
manage with traditional software tools, statistical analysis and machine learning
algorithms. Big data can be therefore characterized by the creation of data and its
storage, analysis and retrieval as defined by 5 V [5].
1. Volume: It denotes to the enormous quantity generated in no-time and determines
the value and potential of it under consideration and requires special
computational platforms in order to analyze it.
2. Velocity: It refers to the speed at which data is created and processed to meet the
challenges and demands that lie in the path of development and growth.
3. Variety:It can be defined as the type of the content of data analysis. Big data is
not just the acquisition of strings, dates and numbers. It is also the data collected
from various sources like sensors, audio, video, structured, semi-structured and
unstructured texts.
4. Veracity: It is added by some organizations which focus on the quality of the
variability in the captured data. It refers to the trust-worthiness of the data and
the reputation of the data sources.
5. Value: It is referring to the significance of the data being collated for analysis.
The proposition of the value is easy to access and produces various quality
analytics like descriptive, diagnosis, prescriptive to produce insightful action in
time-bound manner.
Additionally, two more V’s which represent visualization and variability (i.e.
constantly changing data) is commonly used to make it to 7V’s but it fails to address
additional requirements such as usability and privacy which are equally important.
Fig. 1. Three layers of λ architecture, namely Batch, Speed and Serving Layer
Open-source technology stacks for λ architecture are presented in Table 1.
Table 1. λ Architecture Open Source Technology Stack
Area Technology Stack
Data Ingestion Apache Kafka, Apache Flume and Apache Samza
Batch Layer Apache Hadoop, Apache MapReduce, Apache Spark and Apache Pig
Batch Views Apache HBase, ElephantDB, and Apache Impala
Speed Layer Apache Storm, Apache Spark Streaming
Real-time View Apache Cassandra, Apache HBase
Manual Merge Apache Impala
Query Apache Hive, Apache Pig, Apache Spark SQL and Apache Impala
Fig. 5. Mu Architecture
Open-source technology stacks for Mu architecture remains same as Kappa
architecture.
6. Zeta architecture: It is built on pluggable components and all together, it
produces a holistic architecture [18]. Zeta is characterized by 7 components,
namely:
1. Distributed File System (DFS): It's the common data location for all needs
and is reliable and scalable.
2. Real-time Data Storage: it's based on real-time distributed technologies,
especially NoSQL solutions and is meant for delivering user supplied
responses promptly and quickly.
3. Enterprise Applications (EA): EA focuses to comprehend all business goals
of the system. The examples of this layer are web servers or business
applications.
4. Solution Architecture (SA): SA spotlight is on a specific business problem.
Unlike EA, it concerns a more specific problem. Different solutions can be
combined to construct the solution for the more global problem.
5. Pluggable Compute Model (PCM): it implements all analytic computations
and are pluggable in nature as it has to cater to different needs.
6. Dynamic Global Resource Management (DGRM) - It allows dynamic
allocation of resources that enables business to easily accommodate for
priority tasks.
7. Deployment/Container Management System: this guarantee a single,
standardized method of deployment and implies that deployed resources
don't concern about any environment changing, i.e. deployment in the local
environment is identical with prod environment.
The architecture is outlined in Figure 6.
3 Discussion
This paper briefly reviews big data ecosystem architecture to the best of the
knowledge, for discussions and usages in research, academia and industry. The
information presented discusses some research papers in the literature and a bunch of
systems, but when it comes to the discussion of a small fraction of the existing big
data technology and architecture, there are many different attributes that carry equal
importance, weight and a rationale for comparison.
4 Conclusion
References
1. Salisu Musa Borodo, Siti Mariyam Shamsuddin, Shafaatunnur Hasan. “Big Data Platforms
and Techniques”, Vol. 17, No. 1, January 2016, pp. 191 ~ 200.
2. James M, Michael C, Brad B, Jacques B, Richard D, Charles R. Big data: The next
frontier for innovation, competition, and productivity. McKinsey Glob Inst. 2011.
3. 10 emerging technologies for Big Data – TechRepublic, 2012,
http://www.techrepublic.com/blog/big-data-analytics/10-emerging-technologies-for-big-
data/
4. Every Day Big Data Statistics – 2.5 Quintillion Bytes of Data Created Daily, 2015,
http://www.vcloudnews.com/every-day-big-data-statistics-2-5-quintillion-bytes-of-data-
created-daily/
5. Big Data: The 5 Vs Everyone Must Know, 2014,
https://www.linkedin.com/pulse/20140306073407-64875646-big-data-the-5-vs-everyone-
must-know
6. Notes from Marz’ Big Data – principles and best practices of scalable real-time data
systems – chapter 1, 2017, https://markobigdata.com/2017/01/08/notes-from-marz-big-data-
principles-and-best-practices-of-scalable-real-time-data-systems-chapter-1/
7. The Secrets of Building Realtime Big Data Systems, 2011,
https://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-
systems/15-2_Low_latency_reads_and
8. lambda architecture, 2017, http://lambda-architecture.net/
9. Big Data Using Lambda Architecture, 2015, http://www.talentica.com/pdf/Big-Data-
Using-Lambda-Architecture.pdf
10. wikipedia lambda architecture, https://en.wikipedia.org/wiki/Lambda_architecture
11. Kreps, Jay. "Questioning the Lambda Architecture", 2014, radar.oreilly.com
12. Lambda Architecture for Big Data by Tony Siciliani, 2015,
https://dzone.com/articles/lambda-architecture-big-data
13. Kappa architecture, http://milinda.pathirage.org/kappa-architecture.com/
14. Data processing architectures – Lambda and Kappa, 2015,
https://www.ericsson.com/research-blog/data-processing-architectures-lambda-and-kappa/
15. Microservices Architecture: An Introduction to Microservices, 2017,
http://www.bmc.com/blogs/microservices-architecture- introduction-microservices/
16. Data Integration Design Patterns With Microservices by Mike Davison, 2016,
https://blogs.technet.microsoft.com/cansql/2016/12/05/data-integration-design-patterns-
with-microservices/
17. Real Time Big Data #TD3PI, 2015, http://jtonedm.com/2015/06/04/real-time-big-data-
td3pi/
18. Zeta architecture, 2017, http://www.waitingforcode.com/general-big-data/zeta-
architecture/read
19. The Lambda architecture: principles for architecting realtime Big Data systems,
http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for
20. iot-a : the ιnternet of thιngs archιtecture, http://iot-a.info