Distributed Databases , NOSQL Systems and BIGDATA
Distributed Databases , NOSQL Systems and BIGDATA
• Now, the fees details are maintained in the accounts section. In this case, the
designer will fragment the database as follows −
• CREATE TABLE STD_FEES AS SELECT Regd_No, Fees FROM STUDENT;
Horizontal Fragmentation
• Horizontal fragmentation groups the tuples of a table in accordance to values of
one or more fields. Horizontal fragmentation should also confirm to the rule of
reconstructiveness. Each horizontal fragment must have all columns of the
original base table.
• For example, in the student schema, if the details of all students of Computer
Science Course needs to be maintained at the School of Computer Science, then
the designer will horizontally fragment the database as follows −
CREATE COMP_STD
AS
SELECT * FROM STUDENT WHERE COURSE = "Computer Science";
Hybrid Fragmentation
• In hybrid fragmentation, a combination of horizontal and vertical fragmentation
techniques are used. This is the most flexible fragmentation technique since it
generates fragments with minimal extraneous information. However,
reconstruction of the original table is often an expensive task.
Hybrid fragmentation can be done in two alternative ways −
• At first, generate a set of horizontal fragments; then generate vertical fragments
from one or more of the horizontal fragments.
• At first, generate a set of vertical fragments; then generate horizontal fragments
from one or more of the vertical fragments.
CREATE TABLE Hybrid AS
SELECT Stu_id , Stu_name FROM Student
Where Stu_id = 12;
Stu_id Stu_name
12 Arav
Data Replication
Data Replication is the process of storing data in more than one site or node. It is
useful in improving the availability of data. It is simply copying data from a
database from one server to another server so that all the users can share the same
data without any inconsistency. The result is a distributed database in which users
can access data relevant to their tasks without interfering with the work of others.
Data Replication
• Data replication is the process of storing separate copies of the database at two
or more sites. It is a popular fault tolerance technique of distributed databases.
• Advantages of Data Replication
• Reliability − In case of failure of any site, the database system continues to work
since a copy is available at another site(s).
• Reduction in Network Load − Since local copies of data are available, query
processing can be done with reduced network usage, particularly during prime
hours. Data updating can be done at non-prime hours.
• Quicker Response − Availability of local copies of data ensures quick query
processing and consequently quick response time.
• Simpler Transactions − Transactions require less number of joins of tables
located at different sites and minimal coordination across the network. Thus,
they become simpler in nature.
Disadvantages of Data Replication
• Increased Storage Requirements − Maintaining multiple copies of data is
associated with increased storage costs. The storage space required is in
multiples of the storage required for a centralized system.
• Increased Cost and Complexity of Data Updating − Each time a data item is
updated, the update needs to be reflected in all the copies of the data at the
different sites. This requires complex synchronization techniques and protocols.
• Undesirable Application – Database coupling − If complex update mechanisms
are not used, removing data inconsistency requires complex co-ordination at
application level. This results in undesirable application – database coupling
What are the different strategies for placing the data in distributed database
• A distributed database implementation follows one of the following data
placement strategies for ease of access. The selection is influenced by many
factors like the locality, reliability, performance, storage and communication
costs.
Strategy Description
Centralized Database tables are stored in a single
location/single site (server). All the other sites
have to forward the request to the central site
for accessing data.
Fragmented Database tables are fragmented (either
(Partitioned) vertically or horizontally or both) and different
fragments are stored in different sites.
Complete Database tables are replicated (duplicated) fully
replication into two or more copies and each copy is stored
in different sites.
Selective Among set of tables, certain tables are made
replication into multiple copies and stored in different sites.
The tables that are most frequently accessed will
be replicated.
Hybrid It is the mix of all. Most of the distributed
databases follow this strategy. Here, some of the
tables replicated, some of the tables
fragmented, etc.
Types of Distributed Databases
• Distributed databases can be broadly classified into homogeneous and
heterogeneous distributed database environments, each with further sub-
divisions, as shown in the following illustration.
Experience
User Id Role Compnay
1 Full Time Faculty CAB College
2 Principal New Summit College
2 Visiting Faculty Member KMC
RDBMS NoSQL
Users know RDBMS well as it is old This is relatively new and experts in
and many organizations use this NoSQL are less as this database is
database for the proper format of data. evolving day by day.
User interface tools to access data is
available in the market so that users
User interface tools to access and
can try with all the schema to the
manipulate data in NoSQL is very less
RDBMS infrastructure. This helps to
and hence users do not have many
interact with the data well and users
options to interact with data.
will understand the data in a better
manner.
It works well with high loads.
RDBMS scalability and performance
Scalability is very good in NoSQL.
faces some issues if the data is huge.
This makes the performance of the
Servers may not run properly with the
database better when compared with
available load and this leads to
RDBMS. A huge amount of data could
performance issues.
be easily handled by users.
Multiple tables can be joined easily in Multiple tables cannot be joined in
RDBMS and this does not cause any NoSQL as it is not an easy task for the
latency in the working of the database. database and does not work well with the
The primary key helps in this case. performance of the data.
The availability of the database depends Though the databases are readily
on the server performance and it is mostly available, consistency provided in some
available whenever the database is databases is less. This results in the
opened. The data provided is consistent performance of the database and users
and does not confuse users. should check the availability often.
Data analysis and querying can be done Data analysis is done also in NoSQL but
easily with RDBMS even though the it works well with real-time data
queries are complex. Slicing and dicing analytics. Reports are not done in the
can be done with the available data to database but if the application has to be
make the proper analysis of the data built, then NoSQL is a solution for the
given. same.
Advantages of NoSQL:
The main advantages are high scalability and high availability.
• High scalability –
NoSQL database use sharding for horizontal scaling. Partitioning of data and
placing it on multiple machines in such a way that the order of the data is
preserved is sharding. Vertical scaling means adding more resources to the
existing machine whereas horizontal scaling means adding more machines to
handle the data. Vertical scaling is not that easy to implement but horizontal
scaling is easy to implement. Examples of horizontal scaling databases are
MongoDB, Cassandra etc. NoSQL can handle huge amount of data because of
scalability, as the data grows NoSQL scale itself to handle that data in efficient
manner.
• High availability –
Auto replication feature in NoSQL databases makes it highly available because in
case of any failure data replicates itself to the previous consistent state.
The CAP Theorem
The three letters in CAP refer to three desirable properties of distributed systems
with replicated data: consistency (among replicated copies), availability (of the
system for read and write operations) and partition tolerance (in the face of the
nodes in the system being partitioned by a network fault).
The CAP theorem states that it is not possible to guarantee all three of the desirable
properties – consistency, availability, and partition tolerance at the same time in a
distributed system with data replication.
The theorem states that networked shared-data systems can only strongly support
two of the following three properties:
• Consistency –
Consistency means that the nodes will have the same copies of a replicated data
item visible for various transactions. A guarantee that every node in a
distributed cluster returns the same, most recent and a successful write.
Consistency refers to every client having the same view of the data. There are
various types of consistency models. Consistency in CAP refers to sequential
consistency, a very strong form of consistency.
• Availability –
Availability means that each read or write request for a data item will either be
processed successfully or will receive a message that the operation cannot be
completed. Every non-failing node returns a response for all the read and write
requests in a reasonable amount of time. The key word here is “every”. In simple
terms, every node (on either side of a network partition) must be able to
respond in a reasonable amount of time.
• Partition Tolerance –
Partition tolerance means that the system can continue operating even if the
network connecting the nodes has a fault that results in two or more partitions,
where the nodes in each partition can only communicate among each other.
That means, the system continues to function and upholds its consistency
guarantees in spite of network partitions. Network partitions are a fact of life.
Distributed systems guaranteeing partition tolerance can gracefully recover from
partitions once the partition heals.
Disadvantages of NoSQL
• No standardization rules
• Limited query capabilities
• RDBMS databases and tools are comparatively mature
• It does not offer any traditional database capabilities, like consistency when
multiple transactions are performed simultaneously.
• When the volume of data increases it is difficult to maintain unique values as
keys become difficult
• Doesn’t work as well with relational data
• The learning curve is stiff for new developers
• Open source options so not so popular for enterprises.
Key Value Pair Based
• Data is stored in key/value pairs. It is designed in such a way to handle lots of
data and heavy load.
• Key-value pair storage databases store data as a hash table where each key is
unique, and the value can be a JSON, BLOB(Binary Large Objects), string, etc.
• For example, a key-value pair may contain a key like “name” associated with a
value like “shikha”.
• It is one of the most basic NoSQL database example. This kind of NoSQL database
is used as a collection, dictionaries, associative arrays, etc. Key value stores help
the developer to store schema-less data. They work best for shopping cart
contents.
• Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases.
They are all based on Amazon’s Dynamo paper.
Column-based
• Column-oriented databases work on columns and are based on BigTable paper
by Google. Every column is treated separately. Values of single column databases
are stored contiguously.
• They deliver high performance on aggregation queries like SUM, COUNT, AVG,
MIN etc. as the data is readily available in a column.
• Column-based NoSQL databases are widely used to manage data
warehouses, business intelligence, CRM, Library card catalogs,
• HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column
based database.
Document-Oriented:
• Document-Oriented NoSQL DB stores and retrieves data as a key value pair but
the value part is stored as a document. The document is stored in JSON or XML
formats. The value is understood by the DB and can be queried.
• In this diagram on your left you can see we have rows and columns, and in the
right, we have a document database which has a similar structure to JSON. Now
for the relational database, you have to know what columns you have and so
on. However, for a document database, you have data store like JSON object.
You do not require to define which make it flexible.
• The document type is mostly used for CMS systems, blogging platforms, real-
time analytics & e-commerce applications. It should not use for complex
transactions which require multiple operations or queries against varying
aggregate structures.
• Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are
popular Document originated DBMS systems.
Graph-Based
• A graph type database stores entities as well the relations amongst those
entities. The entity is stored as a node with the relationship as edges. An edge
gives a relationship between nodes. Every node and edge has a unique
identifier.
• Compared to a relational database where tables are loosely connected, a
Graph database is a multi-relational in nature. Traversing relationship is fast as
they are already captured into the DB, and there is no need to calculate them.
• Graph base database mostly used for social networks, logistics, spatial data.
• Neo4J, Infinite Graph, OrientDB , FlockDB are some popular graph-based
databases.
Big Data
Big Data is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently. Big data is also a data but
with huge size.
What is an Example of Big Data?
Following are some of the Big Data examples-
• The New York Stock Exchange is an example of Big Data that generates
about one terabyte of new trade data per day.
• Social Media :The statistic shows that 500+terabytes of new data get ingested
into the databases of social media site Facebook, every day. This data is mainly
generated in terms of photo and video uploads, message exchanges, putting
comments etc.
• A single Jet engine can generate 10+terabytes of data in 30 minutes of flight
time. With many thousand flights per day, generation of data reaches up to
many Petabytes.
Types Of Big Data
Following are the types of Big Data:
• Structured
• Unstructured
• Semi-structured
Structured
• Any data that can be stored, accessed and processed in the form of fixed format
is termed as a ‘structured’ data. Over the period of time, talent in computer
science has achieved greater success in developing techniques for working with
such kind of data (where the format is well known in advance) and also deriving
value out of it. However, nowadays, we are foreseeing issues when a size of such
data grows to a huge extent, typical sizes are being in the rage of multiple
zettabytes.
An ‘Employee’ table in a database is an example of Structured Data
Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in
terms of its processing for deriving value out of it. A typical example of
unstructured data is a heterogeneous data source containing a combination of
simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don’t know how to derive value out of
it since this data is in its raw form or unstructured format.
Examples Of Un-structured Data
• The output returned by ‘Google Search’
Semi-structured
• Semi-structured data can contain both the forms of data. We can see semi-
structured data as a structured in form but it is actually not defined with e.g. a
table definition in relational DBMS. Example of semi-structured data is a data
represented in an XML file.
Examples Of Semi-structured Data
Personal data stored in an XML file-
• <rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Characteristics Of Big Data
Big data can be described by the following characteristics:
• Volume
• Variety
• Velocity
• Variability
• (i) Volume – The name Big Data itself is related to a size which is enormous. Size
of data plays a very crucial role in determining value out of data. Also, whether a
particular data can actually be considered as a Big Data or not, is dependent
upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to
be considered while dealing with Big Data solutions.
(ii) Variety – The next aspect of Big Data is its variety.
• Variety refers to heterogeneous sources and the nature of data, both structured
and unstructured. During earlier days, spreadsheets and databases were the
only sources of data considered by most of the applications. Nowadays, data in
the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also
being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands, determines real
potential in the data.
• Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the data
effectively.
Big Data Architecture
A big data architecture is designed to handle the ingestion, processing, and analysis
of data that is too large or complex for traditional database systems.
Big data solutions typically involve one or more of the following types of workload:
• Batch processing of big data sources at rest.
• Real-time processing of big data in motion.
• Interactive exploration of big data.
• Predictive analytics and machine learning.
Most big data architectures include some or all of the following components:
• Data sources: All big data solutions start with one or more data sources.
Examples include:
– Application data stores, such as relational databases.
– Static files produced by applications, such as web server log files.
– Real-time data sources, such as IoT devices.
• Data storage: Data for batch processing operations is typically stored in a
distributed file store that can hold high volumes of large files in various formats.
This kind of store is often called a data lake. Options for implementing this
storage include Azure Data Lake Store or blob containers in Azure Storage.
• Batch processing: Because the data sets are so large, often a big data solution
must process data files using long-running batch jobs to filter, aggregate, and
otherwise prepare the data for analysis. Usually these jobs involve reading
source files, processing them, and writing the output to new files. Options
include running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or
custom Map/Reduce jobs in an HDInsight Hadoop cluster, or using Java, Scala, or
Python programs in an HDInsight Spark cluster.
• Real-time message ingestion: If the solution includes real-time sources, the
architecture must include a way to capture and store real-time messages for
stream processing. This might be a simple data store, where incoming messages
are dropped into a folder for processing. However, many solutions need a
message ingestion store to act as a buffer for messages, and to support scale-
out processing, reliable delivery, and other message queuing semantics.
Options include Azure Event Hubs, Azure IoT Hubs, and Kafka.
• Stream processing: After capturing real-time messages, the solution must
process them by filtering, aggregating, and otherwise preparing the data for
analysis. The processed stream data is then written to an output sink. Azure
Stream Analytics provides a managed stream processing service based on
perpetually running SQL queries that operate on unbounded streams. You can
also use open source Apache streaming technologies like Storm and Spark
Streaming in an HDInsight cluster.
• Analytical data store: Many big data solutions prepare data for analysis and
then serve the processed data in a structured format that can be queried using
analytical tools. The analytical data store used to serve these queries can be a
Kimball-style relational data warehouse, as seen in most traditional business
intelligence (BI) solutions. Alternatively, the data could be presented through a
low-latency NoSQL technology such as HBase, or an interactive Hive database
that provides a metadata abstraction over data files in the distributed data
store. Azure Synapse Analytics provides a managed service for large-scale,
cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and
Spark SQL, which can also be used to serve data for analysis.
• Analysis and reporting: The goal of most big data solutions is to provide insights
into the data through analysis and reporting. To empower users to analyze the
data, the architecture may include a data modeling layer, such as a
multidimensional OLAP cube or tabular data model in Azure Analysis Services. It
might also support self-service BI, using the modeling and visualization
technologies in Microsoft Power BI or Microsoft Excel. Analysis and reporting
can also take the form of interactive data exploration by data scientists or data
analysts. For these scenarios, many Azure services support analytical notebooks,
such as Jupyter, enabling these users to leverage their existing skills with Python
or R. For large-scale data exploration, you can use Microsoft R Server, either
standalone or with Spark.
• Orchestration: Most big data solutions consist of repeated data processing
operations, encapsulated in workflows, that transform source data, move data
between multiple sources and sinks, load the processed data into an analytical
data store, or push the results straight to a report or dashboard. To automate
these workflows, you can use an orchestration technology such Azure Data
Factory or Apache Oozie and Sqoop.
Advantages Of Big Data Processing
Ability to process Big Data in DBMS brings in multiple benefits, such as-
• Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like facebook, twitter are
enabling organizations to fine tune their business strategies.
• Improved customer service
Traditional customer feedback systems are getting replaced by new systems
designed with Big Data technologies. In these new systems, Big Data and natural
language processing technologies are being used to read and evaluate consumer
responses.
• Early identification of risk to the product/services, if any
• Better operational efficiency
What are the Disadvantages of Big Data?
• Since all the information collected requires a lot of effort and resources, storing it
before it can be examined needs a vast space. Although the analysis of enormous
information seems possible, some significant disadvantages of Big Data come to
light in terms of space, cost, and user security.
1. Unstructured Data
• The data collected can be arranged or present in the form of random
information. More variations in data can create difficulty in processing results
and generating solutions. If the information is broken or unstructured, many
users can get neglected while deriving future outcomes or analyzing present
scenarios.
2. Security Concerns are most dreaded disadvantages of Big Data
• For highly secured data or confidential information, highly secured networks are
needed for its transfer and storage. Furthermore, with the increased global
politics and complex situations between nations, leaked data can be used as an
advantage by enemies, so keeping it secure is essential and requires building
such a network.
3. Expensive
• The process of data generation and its analysis is costly without the surety of
favorable results. The top businesses can mainly research this field as the space
sector, where the wealthiest companies and individuals carry out research. The
Cost of setting up super-computers is one of the leading disadvantages of Big
Data analytics. Even if the cost is incurred somehow, the information usually
residing on the cloud has to be arranged for and will require maintenance.
4. Skilled Analysts
• The professionals needed to carry out research and run complicated software
are highly paid and hard to find. There is a scarcity of individuals skilled for the
data analyst job despite the increasing scope in this area of knowledge. Data is
the resource of the new generation as to remain in the market; it is necessary to
keep yourself updated with further information.
5. Hardware and Storage
• The servers and hardware needed to store and run high-quality software are
very costly and hard to build. Also, the information is available in bulk with
continuous changes, and processing requires faster software and applications.
And we cannot forget the uncertainty involved with getting accurate results.
Map Reduce
MapReduce is a programming model for writing applications that can process Big
Data in parallel on multiple nodes. MapReduce provides analytical capabilities for
analyzing huge volumes of complex data.
Why MapReduce?
• Traditional Enterprise Systems normally have a centralized server to store and
process data. The following illustration depicts a schematic view of a traditional
enterprise system. Traditional model is certainly not suitable to process huge
volumes of scalable data and cannot be accommodated by standard database
servers. Moreover, the centralized system creates too much of a bottleneck
while processing multiple files simultaneously.