Modeling of Big Data Processing
Modeling of Big Data Processing
HADI HASHEM
LOR Software-Networks department, Télécom SudParis, Institut Mines-Télécom, 9 rue Charles Fourier,
Evry, 91000, France
[email protected]
http://lor.telecom-sudparis.eu/
DANIEL RANC
LOR Software-Networks department, Télécom SudParis, Institut Mines-Télécom, 9 rue Charles Fourier,
Evry, 91000, France
[email protected]
http://lor.telecom-sudparis.eu/
More than 2.5 exabytes of data are created everyday based on the user information automatically
generated over Internet. Social networks, mobile devices, emails, blogs, videos, banking transactions
and other consumer interaction, are now driving the successful marketing campaigns, by establishing
a new digital channel between the brands and their audiences. Powerful tools are needed to store and
explore this daily expending BigData, in order to submit an easy and reliable processing of user
information. Expected quick and high quality results are as much important as priceless investments
for marketers and industrials. Traditional modeling tools face their limits in this challenge, as the
information keeps growing in volume and variety, thing that can be handled only by non-relational
data modeling techniques.
1. Introduction
Data engines based on SQL (Structured Query Language) first created in the 1970’s,
show a high performance indicator when processing small relational data, but are very
limited in face of data expansion in volume and variety. MPP (Massively Parallel
Computing) first created in the early 1980’s, has slowly improved the performance
indicator for complex volumes of data. Still, it could not be used to process BigData in a
daily expansion. Hadoop MapReduce (explained in this section) is considered as the most
recently efficient processing technique as it is most performant when dealing with
complex high volumes of data. In section 2, 3 and 4 we show a preview of the existing
non-relational data models and the available related modeling techniques. The section 5
details the main activity of distributed systems, in terms of data consistency, data
placement and system coordination. The last part of the paper (sections 6, 7 and 8)
explains the purpose of this study and the processing model we seek, after presenting the
analysis and the results of testing. Finally, the conclusion is addressed in section 9.
1
2 Hadi Hashem – Daniel Ranc
1.1. MapReduce
In 2004, Google published a new paper introducing the use of a simplified data
computing technique, showing high performance when processing complex volumes of
data. An easy-to-use model as MapReduce does not require programming skills in
parallel and distributed systems. All the details of parallelization, fault-tolerance, locality
optimization and load balancing [Perera (2013)] are embedded in a plug-and-play
framework.
RDMS (Relational Database Management Systems) are unable to handle this task for
several reasons:
(1) The primary constraining factor is database schema, because of the continuous
changing structure of schema-less BigData.
(2) The complexity and the size of data, overflows the capacity of traditional RDMS
to acquire, manage and process data with reasonable costs (computing time and
performance).
(3) Relation-Entity modeling of BigData does not easily adapt with fault-tolerant and
distributed systems.
NoSQL (Non-relational SQL) first used in 1998 is increasingly chosen as viable
alternative to relational databases, particularly for interactive web applications and
services [Li (2013)], [Tudorica (2011)].
database, performing a join between two tables would be terribly inefficient. Instead, the
programmer has to implement such logic in his application, or design his application so
as to not need it. BigTable comprises a client library (linked with the user's code), a
master server that coordinates activity, and many tablet servers, that can be changed
dynamically.
and/or ending-place node and a set of properties. Graph databases apply graph theory
[Lai (2012)] to the storage of information about the relationships between entries. The
relationships between people in social networks, is the most obvious example. The
relationships between items and attributes in recommendation engines, is another.
Relational databases are unable to store relationship data. Relationship queries can be
complex, slow and unpredictable. Since graph databases are designed for this sort of
thing, the queries are more reliable. Graph data models are limited in performance in
some situations:
(1) When crossing all the nodes in the same query, time responses are very slow.
Search queries must be based on at least one identified entity.
(2) In order to avoid database schema upgrade when model gets changed, schema-
less graph databases requires a manual update on all database objects.
Graph databases are well-suited for analyzing interconnections, which is why there
has been a lot of interest in using graph databases to mine data from social media. Graph
databases are also related to Document databases because many implementations allow
one model a value as a map or document. The concept behind graphing a database is
often credited to 18th century mathematician Leonhard Euler.
(4) Since sorting makes things more complex, unordered key-value data model can
be partitioned across multiple servers by hashing the key using the enumerable
keys technique.
Last but not least, inverted search consists of using an index to find data that meets a
criteria, then aggregate data using original representation. Inverted search is more a data
processing pattern rather than data modeling.
Nested sets model also belongs to hierarchical data modeling techniques. It is about
modeling tree-like structures and is used in RDBMS. This model is perfectly applicable
to key-value stores and document databases. It consists of storing leafs of the tree in an
array and to map each non-leaf node to a range of leafs using start and end indexes:
(1) Documents processed by search engines can be modeled hierarchically. This
approach consists of flattening nested documents by numbered field names. It
causes however scalability issues related to the query complexity.
(2) Nested documents can also be flattened by proximity queries that limit the
acceptable distance between words in the document, thing that will solve the
scalability issues.
4. Graph Processing
Batch graph processing technique related to graph databases can be done using
MapReduce routines [Lin (2010)], in order to explore the neighborhood of a given node
or relationships between two or a few nodes. This approach makes key-value stores,
document databases and BigTable-style databases suitable for processing large graphs.
Adjacency list representation can be used in graph processing. Graphs are serialized into
key-value pairs using the identifier of the vertex as the key and the record comprising the
vertex’s structure as the value. In MapReduce process, the shuffle and sort phase can be
exploited to propagate information between vertices using a form of distributed message
passing. In the reduce phase, all messages that have the same key arrive together and
another computation is performed. Combiners in MapReduce are responsible for
performing local aggregation which reduces the amount of data to be shuffled across the
cluster. They are only effective if there are multiple key-value pairs with the same key
computed on the same machine that can be aggregated.
6. Current study
The biggest challenge nowadays is to get high quality processing results with a
reduced computing time and costs. To do so, the processing sequence must be reviewed
on the top, so that we could add one or more modeling tools. Unfortunately, the existing
processing models do not take in consideration this requirement and focus on getting high
calculation performances which will increase the computing time and costs. The needed
modeling tools and operators will help the user/developer to identify the processing field
on the top of the sequence and to send into the computing module only the data related to
the requested result. The remaining data is not relevant and it will slow down the
12 Hadi Hashem – Daniel Ranc
processing. The second improvement would be to override the cloud providers pricing
policy, by being able to decentralize the processing on one or more cloud engines, in
parallel or consecutively, based on the best available computing costs.
8. BigData workbench
The previous experiment shows the impact of the modeling tools on non-relational
data processing. In order to implement a new abstraction based on model driven
architecture, we thought about creating new automatic programming software allowing
the users/developers, based on drag & drop features, to do the following:
(1) Add one or more components from available data sources (data files, social
networks, web services…)
(2) Apply predefined analysis on sample data in order to dynamically define the
structure of the files/messages.
(3) Apply one or more of non-relational data modeling tools by connecting the
components.
(4) Select a Hadoop processing engine available on a local or distant network.
We believe that such software solution could help users to reduce data processing
costs by:
(1) Making his own design of the processing chain.
(2) Decentralizing the processing on different computing engines.
(3) Reducing the volume of data to compute.
14 Hadi Hashem – Daniel Ranc
9. Conclusion
The data model provides a visual way to manage data resources and creates
fundamental data architecture, so that we can have more applications to optimize data
reuse and reduce computing costs. Each technique has strengths and weakness in the way
it addresses each audience. Most are oriented more toward designers than they are toward
the user community. These techniques produce models that are very intricate and focus
on making sure that all possible constraints are described. Still, this is often at the
expense of readability. The evaluation must be based on the technical completeness of
each technique and on its readability in the same time. Technical completeness is in terms
of the representation of:
(1) Entities and attributes.
(2) Relationships.
(3) Unique identifiers.
(4) Sub-types and super-types.
(5) Constraints between relationships.
A technique’s readability is characterized by its graphic treatment of relationship
lines and entity boxes, as well as its adherence to the general principles of good graphic
design. The complexity of a relational database limits the scalability of data storage, but
makes it very easy to query data through traditional RDBMS. Non-relational database
systems have the opposite characteristics, unlimited scalability with more limited query
capabilities. The challenge of BigData is querying data easily. Creating data models on
physical data and computing path help manage raw data. The future will bring more
hybrid systems combining the attributes of both approaches [Zhu (2012)]. Meanwhile,
the dynamic model discussed in this paper offer help in the challenge of managing
BigData.
An Integrative Modeling of BigData Processing 15
References
Anderson, E. (2009). Efficient tracing and performance analysis for large distributed
systems. Published in Modeling, Analysis & Simulation of Computer and
Telecommunication Systems, 2009. MASCOTS '09. IEEE International Symposium
on. Print ISBN 978-1-4244-4927-9.
Agarwal, S. (2010). Volley: Automated Data Placement for Geo-Distributed Cloud
Services. Published in NSDI'10 Proceedings of the 7th USENIX conference on
Networked systems design and implementation Pages 2-2.
Chang, F. (2006). Bigtable: A Distributed Storage System for Structured Data. Published
in OSDI '06 Proceedings of the 7th symposium on Operating systems design and
implementation Pages 205-218. Print ISBN 1-931971-47-1.
Ghemawat, S., Gobioff, H., Leung, S.K. (2003). The Google File System. Published in
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems
principles Pages 29-43. Print ISBN 1-58113-757-5.
Kaur, K., Rani, R. (2013). Modeling and querying data in NoSQL databases. Published
in: Big Data, 2013 IEEE International Conference. INSPEC Accession Number
13999217.
Lai, S. (2012). Graph-theory model based E-commerce website design and realize.
Published in Computing and Networking Technology (ICCNT), 2012 8th
International Conference.
Li, Y., Manoharan, S. (2013). A performance comparison of SQL and NoSQL databases.
Published in Communications, Computers and Signal Processing (PACRIM), 2013
IEEE Pacific Rim Conference. ISSN 1555-5798.
Lin, J., Schatz, M. (2010). Design Patterns for Efficient Graph Algorithms in
MapReduce. Published in MLG '10 Proceedings of the Eighth Workshop on Mining
and Learning with Graphs Pages 78-85. Print ISBN 978-1-4503-0214-2.
Perera, S., Gunarathne, T. (2013). Hadoop MapReduce Cookbook. Published by Packt
Publishing. Print ISBN 978-1-84951-728-7.
Tudorica, B.G., Bucur, C. (2011). A comparison between several NoSQL databases with
comments and notes. Published in Roedunet International Conference (RoEduNet),
2011 10th. Print ISBN 978-1-4577-1233-3.
Vora, M.N. (2011). Hadoop-HBase for large-scale data. Published in Computer Science
and Network Technology (ICCSNT), 2011 International Conference. Print ISBN 978-
1-4577-1586-0.
Wang, G., Tang, J. (2012). The NoSQL Principles and Basic Application of Cassandra
Model. Published in Computer Science & Service System (CSSS), 2012 International
Conference. Print ISBN 978-1-4673-0721-5.
Yu, B. (2012). On Managing Very Large Sensor-Network Data Using Bigtable.
Published in Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM
International Symposium on. Print ISBN 978-1-4673-1395-7.
Zhu, J., Wang, A. (2012). Data Modeling for Big Data. Published in CA Technologies.
Pages 75-80.