0% found this document useful (0 votes)
8 views

Distributed Databases , NOSQL Systems and BIGDATA

Distributed databases consist of logically related data spread across multiple locations, managed by a Distributed Database Management System (DDBMS) that allows independent processing at each site. Key features include location independency, distributed query processing, and transaction management, while advantages include easier expansion and resilience to data loss. However, challenges such as complexity, security, and maintaining data integrity exist, alongside various strategies for data placement and types of distributed databases, including homogeneous and heterogeneous systems.

Uploaded by

shikha sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Distributed Databases , NOSQL Systems and BIGDATA

Distributed databases consist of logically related data spread across multiple locations, managed by a Distributed Database Management System (DDBMS) that allows independent processing at each site. Key features include location independency, distributed query processing, and transaction management, while advantages include easier expansion and resilience to data loss. However, challenges such as complexity, security, and maintaining data integrity exist, alongside various strategies for data placement and types of distributed databases, including homogeneous and heterogeneous systems.

Uploaded by

shikha sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Distributed Databases , NOSQL

Systems and BIGDATA


Distributed databases
A Distributed database is defined as a logically related collection of data that is
shared which is physically distributed over a computer network on different sites.
The Distributed DBMS is defined as, the software that allows for the management
of the distributed database and make the distributed data available for the users.
A distributed DBMS consist of a single logical database that is divided into a number
of pieces called the fragments. In DDBMS, Each site is capable of independently
processing the users request.
Some general features of distributed databases are:
• Location independency - Data is physically stored at multiple sites and managed
by an independent DDBMS.
• Distributed query processing - Distributed databases answer queries in a
distributed environment that manages data at multiple sites. High-level queries
are transformed into a query execution plan for simpler management.
• Distributed transaction management - Provides a consistent distributed
database through commit protocols, distributed concurrency control
techniques, and distributed recovery methods in case of many transactions and
failures.
• Seamless integration - Databases in a collection usually represent a single
logical database, and they are interconnected.
• Network linking - All databases in a collection are linked by a network and
communicate with each other.
• Transaction processing - Distributed databases incorporate transaction
processing, which is a program including a collection of one or more database
operations. Transaction processing is an atomic process that is either entirely
executed or not at all.
The contents of a distributed database are spread across multiple locations. That
means the contents may be stored in different systems that are located in the same
place or geographically far away. However, the database still appears uniform to the
users i.e the fact that the database is stored at multiple locations is transparent to
the users.
• The different components of a distributed database are −
Users
• There are many users who use the distributed database. For them, the fact that
the database is spread across multiple locations is transparent and they perceive
the database to be one whole construct.
Global schema
• The global schema shows the overall design of the database. It helps to logically
understand the design of the whole database as in reality the database is not
stored at any one place and spread over various systems.
Database modules
• The database modules are the parts of the database stored in multiple locations.
In homogeneous distributed database, all these parts contain the same data
model while in heterogeneous distributed database, the parts may have
different data models.
Advantages of DDBMS
• The database is easier to expand as it is already spread across multiple systems
and it is not too complicated to add a system.
• The distributed database can have the data arranged according to different
levels of transparency i.e data with different transparency levels can be stored at
different locations.
• The database can be stored according to the departmental information in an
organisation. In that case, it is easier for a organisational hierarchical access.
• there were a natural catastrophe such as fire or an earthquake all the data would
not be destroyed it is stored at different locations.
• It is cheaper to create a network of systems containing a part of the database. This
database can also be easily increased or decreased.
• Even if some of the data nodes go offline, the rest of the database can continue its
normal functions.
Disadvantages of DDBMS
• The distributed database is quite complex and it is difficult to make sure that a user
gets a uniform view of the database because it is spread across multiple locations.
• This database is more expensive as it is complex and hence, difficult to maintain.
• It is difficult to provide security in a distributed database as the database needs to
be secured at all the locations it is stored. Moreover, the infrastructure connecting
all the nodes in a distributed database also needs to be secured.
• It is difficult to maintain data integrity in the distributed database because of its
nature. There can also be data redundancy in the database as it is stored at
multiple locations.
• The distributed database is complicated and it is difficult to find people with the
necessary experience who can manage and maintain it.
Distributed DBMS - Design Strategies
We will study the strategies that aid in adopting the designs. The strategies can be
broadly divided into replication and fragmentation. However, in most cases, a
combination of the two is used.
Fragmentation
• Fragmentation is the task of dividing a table into a set of smaller tables. The
subsets of the table are called fragments. Fragmentation can be of three types:
horizontal, vertical, and hybrid (combination of horizontal and vertical).
Horizontal fragmentation can further be classified into two techniques: primary
horizontal fragmentation and derived horizontal fragmentation.
• Fragmentation should be done in a way so that the original table can be
reconstructed from the fragments. This is needed so that the original table can
be reconstructed from the fragments whenever required. This requirement is
called “re constructiveness.”
Advantages of Fragmentation
• Since data is stored close to the site of usage, efficiency of the database system
is increased.
• Local query optimization techniques are sufficient for most queries since data is
locally available.
• Since irrelevant data is not available at the sites, security and privacy of the
database system can be maintained.
Disadvantages of Fragmentation
• When data from different fragments are required, the access speeds may be
very low.
• In case of recursive fragmentations, the job of reconstruction will need
expensive techniques.
• Lack of back-up copies of data in different sites may render the database
ineffective in case of failure of a site.
Vertical Fragmentation
• In vertical fragmentation, the fields or columns of a table are grouped into
fragments. In order to maintain re constructiveness, each fragment should
contain the primary key field(s) of the table. Vertical fragmentation can be used
to enforce privacy of data.
• For example, let us consider that a University database keeps records of all
registered students in a Student table having the following schema.
STUDENT
Regd_No Name Course Address Semester Fees Marks

• Now, the fees details are maintained in the accounts section. In this case, the
designer will fragment the database as follows −
• CREATE TABLE STD_FEES AS SELECT Regd_No, Fees FROM STUDENT;
Horizontal Fragmentation
• Horizontal fragmentation groups the tuples of a table in accordance to values of
one or more fields. Horizontal fragmentation should also confirm to the rule of
reconstructiveness. Each horizontal fragment must have all columns of the
original base table.
• For example, in the student schema, if the details of all students of Computer
Science Course needs to be maintained at the School of Computer Science, then
the designer will horizontally fragment the database as follows −
CREATE COMP_STD
AS
SELECT * FROM STUDENT WHERE COURSE = "Computer Science";
Hybrid Fragmentation
• In hybrid fragmentation, a combination of horizontal and vertical fragmentation
techniques are used. This is the most flexible fragmentation technique since it
generates fragments with minimal extraneous information. However,
reconstruction of the original table is often an expensive task.
Hybrid fragmentation can be done in two alternative ways −
• At first, generate a set of horizontal fragments; then generate vertical fragments
from one or more of the horizontal fragments.
• At first, generate a set of vertical fragments; then generate horizontal fragments
from one or more of the vertical fragments.
CREATE TABLE Hybrid AS
SELECT Stu_id , Stu_name FROM Student
Where Stu_id = 12;

Stu_id Stu_name
12 Arav
Data Replication
Data Replication is the process of storing data in more than one site or node. It is
useful in improving the availability of data. It is simply copying data from a
database from one server to another server so that all the users can share the same
data without any inconsistency. The result is a distributed database in which users
can access data relevant to their tasks without interfering with the work of others.
Data Replication
• Data replication is the process of storing separate copies of the database at two
or more sites. It is a popular fault tolerance technique of distributed databases.
• Advantages of Data Replication
• Reliability − In case of failure of any site, the database system continues to work
since a copy is available at another site(s).
• Reduction in Network Load − Since local copies of data are available, query
processing can be done with reduced network usage, particularly during prime
hours. Data updating can be done at non-prime hours.
• Quicker Response − Availability of local copies of data ensures quick query
processing and consequently quick response time.
• Simpler Transactions − Transactions require less number of joins of tables
located at different sites and minimal coordination across the network. Thus,
they become simpler in nature.
Disadvantages of Data Replication
• Increased Storage Requirements − Maintaining multiple copies of data is
associated with increased storage costs. The storage space required is in
multiples of the storage required for a centralized system.
• Increased Cost and Complexity of Data Updating − Each time a data item is
updated, the update needs to be reflected in all the copies of the data at the
different sites. This requires complex synchronization techniques and protocols.
• Undesirable Application – Database coupling − If complex update mechanisms
are not used, removing data inconsistency requires complex co-ordination at
application level. This results in undesirable application – database coupling

What are the different strategies for placing the data in distributed database
• A distributed database implementation follows one of the following data
placement strategies for ease of access. The selection is influenced by many
factors like the locality, reliability, performance, storage and communication
costs.
Strategy Description
Centralized Database tables are stored in a single
location/single site (server). All the other sites
have to forward the request to the central site
for accessing data.
Fragmented Database tables are fragmented (either
(Partitioned) vertically or horizontally or both) and different
fragments are stored in different sites.
Complete Database tables are replicated (duplicated) fully
replication into two or more copies and each copy is stored
in different sites.
Selective Among set of tables, certain tables are made
replication into multiple copies and stored in different sites.
The tables that are most frequently accessed will
be replicated.
Hybrid It is the mix of all. Most of the distributed
databases follow this strategy. Here, some of the
tables replicated, some of the tables
fragmented, etc.
Types of Distributed Databases
• Distributed databases can be broadly classified into homogeneous and
heterogeneous distributed database environments, each with further sub-
divisions, as shown in the following illustration.

Homogeneous Distributed Databases


In a homogeneous distributed database, all the sites use identical DBMS and
operating systems. Its properties are −
• The sites use very similar software.
• The sites use identical DBMS or DBMS from the same vendor.
• Each site is aware of all other sites and cooperates with other sites to process
user requests.
• The database is accessed through a single interface as if it is a single database.
Types of Homogeneous Distributed Database
There are two types of homogeneous distributed database −
• Autonomous − Each database is independent that functions on its own. They
are integrated by a controlling application and use message passing to share
data updates.
• Non-autonomous − Data is distributed across the homogeneous nodes and a
central or master DBMS co-ordinates data updates across the sites.
Heterogeneous Distributed Databases
In a heterogeneous distributed database, different sites have different operating
systems, DBMS products and data models. Its properties are −
• Different sites use dissimilar schemas and software.
• The system may be composed of a variety of DBMSs like relational, network,
hierarchical or object oriented.
• Query processing is complex due to dissimilar schemas.
• Transaction processing is complex due to dissimilar software.
• A site may not be aware of other sites and so there is limited co-operation in
processing user requests.
Types of Heterogeneous Distributed Databases
• Federated − The heterogeneous database systems are independent in nature
and integrated together so that they function as a single database system.
• Un-federated − The database systems employ a central coordinating module
through which the databases are accessed.
Distributed DBMS Architectures
DDBMS architectures are generally developed depending on three parameters −
• Distribution − It states the physical distribution of data across the different sites.
• Autonomy − It indicates the distribution of control of the database system and
the degree to which each constituent DBMS can operate independently.
• Heterogeneity − It refers to the uniformity or dissimilarity of the data models,
system components and databases.
General Architecture of Pure Distributed Databases
• In Figure , which describes the generic schema architecture of a DDB, the
enterprise is presented with a consistent, unified view showing the logical
structure of underlying data across all nodes.
• Global conceptual schema (GCS), which provides network transparency.
• The logical organization of data at each site is specified by the local conceptual
schema (LCS).
• To accommodate potential heterogeneity in the DDB, each node is shown as
having its own local internal schema (LIS) based on physical organization details
at that particular site.
• External View or Schema : Depicts user view of data.
Federated Database Schema Architecture
• Typical five-level schema architecture to support global applications in the FDBS
environment is shown in Figure. In this architecture, the local schema is the
conceptual schema (full database definition) of a component database, and
the component schema is derived by translating the local schema into a
canonical data model or common data model (CDM) for the FDBS.
• Schema translation from the local schema to the component schema is
accompanied by generating mappings to transform commands on a component
schema into commands on the corresponding local schema. The export
schema represents the subset of a component schema that is available to the
FDBS. The federated schema is the global schema or view, which is the result of
integrating all the shareable export schemas. The external schemas define the
schema for a user group or an application, as in the three-level schema
architecture.
An Overview of Three-Tier Client-Server Architecture
full-scale DDBMSs have not been developed to support all the types of
functionalities that we have discussed so far. Instead, distributed database
applications are being developed in the context of the client-server architectures. It
is now more common to use a three-tier architecture, particularly in Web
applications. This architecture is illustrated in Figure.
• Presentation layer (client). This provides the user interface and interacts with
the user. The programs at this layer present Web interfaces or forms to the
client in order to interface with the application. Web browsers are often
utilized, and the languages and specifications used include HTML, XHTML, CSS,
Flash, MathML, Scalable Vector Graphics (SVG), Java, JavaScript, Adobe Flex,
and others. This layer handles user input, output, and navigation by accepting
user commands and displaying the needed information, usually in the form of
static or dynamic Web pages. The latter are employed when the interaction
involves database access. When a Web interface is used, this layer typically
communicates with the application layer via the HTTP protocol.
• Application layer (business logic). This layer programs the application logic. For
example, queries can be formulated based on user input from the client, or
query results can be formatted and sent to the client for presentation.
Additional application functionality can be handled at this layer, such as
security checks, identity verification, and other functions. The application layer
can interact with one or more databases or data sources as needed by
connecting to the database using ODBC, JDBC, SQL/CLI, or other database
access techniques.
• Database server. This layer handles query and update requests from
the application layer, processes the requests, and sends the results. Usually
SQL is used to access the database if it is relational or object-relational and
stored database procedures may also be invoked. Query results (and
queries) may be formatted into XML when transmitted between the
application server and the database server.
Introduction to NoSQL
• A NoSQL originally referring to non SQL or non relational is a database that
provides a mechanism for storage and retrieval of data. This data is modeled in
means other than the tabular relations used in relational databases.
• NoSQL databases are used in real-time web applications and big data and their
use are increasing over time. NoSQL systems are also sometimes called Not only
SQL to emphasize the fact that they may support SQL-like query languages.
• Traditional RDBMS uses SQL syntax to store and retrieve data for further
insights. Instead, a NoSQL database system encompasses a wide range of
database technologies that can store structured, semi-structured, unstructured
and polymorphic data.
Why NoSQL?
• The concept of NoSQL databases became popular with Internet giants like
Google, Facebook, Amazon, etc. who deal with huge volumes of data. The
system response time becomes slow when you use RDBMS for massive volumes
of data.
• To resolve this problem, we could “scale up” our systems by upgrading our
existing hardware. This process is expensive. The alternative for this issue is to
distribute database load on multiple hosts whenever the load increases. This
method is known as “scaling out.”
• NoSQL database is non-relational, so it scales out better than relational
databases as they are designed with web applications in mind.
Following are key differences between RDBMS vs NoSQL:
• RDBMS is called relational databases while NoSQL is called a distributed
database. They do not have any relations between any of the databases. When
RDBMS uses structured data to identify the primary key, there is a proper
method in NoSQL to use unstructured data.
• RDBMS is scalable vertically and NoSQL is scalable horizontally. Hence in
RDBMS, servers have to be added and power has to be increased. This makes
scalability an RDBMS expensive. While in NoSQL, we just need to add more
machines and this does not make the database expensive.
• Maintenance of RDBMS is expensive as manpower is needed to manage the
servers added in the database. NoSQL is mostly automatic and does some
repairs on its own. Data distribution and administration is less in NoSQL.
Let’s take an example of data stored in RDBMS as :
User

Uid Fname Lname


1 Indra Chaudhary
2 Chokraj Dawadi
Skill
User Id Skill Name
1 Big Data
1 Cloud
2 Calculus

Experience
User Id Role Compnay
1 Full Time Faculty CAB College
2 Principal New Summit College
2 Visiting Faculty Member KMC

UID Fname Lname Skill name Role Company


1 Indra Chaudhary Big Data Full Time Faculty CAB College
1 Indra Chaudhary Cloud Full Time Faculty CAB college

2 Chokraj Dawadi Calculus Principal New Summit


College
2 Chokraj Dawadi Calculus Visiting Faculty KMC
• In NOSQL we can express above three tables in the JSON below:
{
User : [
{
Uid : 1
Firstname : “Indra”
Last Name : “Chaudhary “
}
{
Uid : 2
Firstname : “Chokraj”
Last Name : “Dawadi “
}
]
Skill : [“Big Data” , “Cloud” , “Calculus”]
Experience : [
{
Role: “Full Time Faculty”
Company : “CAB College “
}
{
Role: “Principal”
Company : “CAB College “
}
{
Role: “Visiting Faculty”
Company : “KMC “
}
]
}
RDBMS vs NoSQL Comparison Table

RDBMS NoSQL

Users know RDBMS well as it is old This is relatively new and experts in
and many organizations use this NoSQL are less as this database is
database for the proper format of data. evolving day by day.
User interface tools to access data is
available in the market so that users
User interface tools to access and
can try with all the schema to the
manipulate data in NoSQL is very less
RDBMS infrastructure. This helps to
and hence users do not have many
interact with the data well and users
options to interact with data.
will understand the data in a better
manner.
It works well with high loads.
RDBMS scalability and performance
Scalability is very good in NoSQL.
faces some issues if the data is huge.
This makes the performance of the
Servers may not run properly with the
database better when compared with
available load and this leads to
RDBMS. A huge amount of data could
performance issues.
be easily handled by users.
Multiple tables can be joined easily in Multiple tables cannot be joined in
RDBMS and this does not cause any NoSQL as it is not an easy task for the
latency in the working of the database. database and does not work well with the
The primary key helps in this case. performance of the data.

The availability of the database depends Though the databases are readily
on the server performance and it is mostly available, consistency provided in some
available whenever the database is databases is less. This results in the
opened. The data provided is consistent performance of the database and users
and does not confuse users. should check the availability often.
Data analysis and querying can be done Data analysis is done also in NoSQL but
easily with RDBMS even though the it works well with real-time data
queries are complex. Slicing and dicing analytics. Reports are not done in the
can be done with the available data to database but if the application has to be
make the proper analysis of the data built, then NoSQL is a solution for the
given. same.

Documents cannot be stored in RDBMS


Documents can be stored in the NoSQL
because data in the database should be
database as this is unstructured and not in
structured and in a proper format to create
rows and columns format.
identifiers.
Partitions can be created in the
Partitions cannot be created in the
database easily and key-value pairs are
database. Key-value pairs are needed to
not needed to identify the data in the
identify the data in a particular format
source. Software as a service can be
specified in the schema database.
integrated with NoSQL.

Advantages of NoSQL:
The main advantages are high scalability and high availability.
• High scalability –
NoSQL database use sharding for horizontal scaling. Partitioning of data and
placing it on multiple machines in such a way that the order of the data is
preserved is sharding. Vertical scaling means adding more resources to the
existing machine whereas horizontal scaling means adding more machines to
handle the data. Vertical scaling is not that easy to implement but horizontal
scaling is easy to implement. Examples of horizontal scaling databases are
MongoDB, Cassandra etc. NoSQL can handle huge amount of data because of
scalability, as the data grows NoSQL scale itself to handle that data in efficient
manner.
• High availability –
Auto replication feature in NoSQL databases makes it highly available because in
case of any failure data replicates itself to the previous consistent state.
The CAP Theorem
The three letters in CAP refer to three desirable properties of distributed systems
with replicated data: consistency (among replicated copies), availability (of the
system for read and write operations) and partition tolerance (in the face of the
nodes in the system being partitioned by a network fault).
The CAP theorem states that it is not possible to guarantee all three of the desirable
properties – consistency, availability, and partition tolerance at the same time in a
distributed system with data replication.
The theorem states that networked shared-data systems can only strongly support
two of the following three properties:
• Consistency –
Consistency means that the nodes will have the same copies of a replicated data
item visible for various transactions. A guarantee that every node in a
distributed cluster returns the same, most recent and a successful write.
Consistency refers to every client having the same view of the data. There are
various types of consistency models. Consistency in CAP refers to sequential
consistency, a very strong form of consistency.
• Availability –
Availability means that each read or write request for a data item will either be
processed successfully or will receive a message that the operation cannot be
completed. Every non-failing node returns a response for all the read and write
requests in a reasonable amount of time. The key word here is “every”. In simple
terms, every node (on either side of a network partition) must be able to
respond in a reasonable amount of time.

• Partition Tolerance –
Partition tolerance means that the system can continue operating even if the
network connecting the nodes has a fault that results in two or more partitions,
where the nodes in each partition can only communicate among each other.
That means, the system continues to function and upholds its consistency
guarantees in spite of network partitions. Network partitions are a fact of life.
Distributed systems guaranteeing partition tolerance can gracefully recover from
partitions once the partition heals.
Disadvantages of NoSQL
• No standardization rules
• Limited query capabilities
• RDBMS databases and tools are comparatively mature
• It does not offer any traditional database capabilities, like consistency when
multiple transactions are performed simultaneously.
• When the volume of data increases it is difficult to maintain unique values as
keys become difficult
• Doesn’t work as well with relational data
• The learning curve is stiff for new developers
• Open source options so not so popular for enterprises.
Key Value Pair Based
• Data is stored in key/value pairs. It is designed in such a way to handle lots of
data and heavy load.
• Key-value pair storage databases store data as a hash table where each key is
unique, and the value can be a JSON, BLOB(Binary Large Objects), string, etc.
• For example, a key-value pair may contain a key like “name” associated with a
value like “shikha”.
• It is one of the most basic NoSQL database example. This kind of NoSQL database
is used as a collection, dictionaries, associative arrays, etc. Key value stores help
the developer to store schema-less data. They work best for shopping cart
contents.
• Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases.
They are all based on Amazon’s Dynamo paper.
Column-based
• Column-oriented databases work on columns and are based on BigTable paper
by Google. Every column is treated separately. Values of single column databases
are stored contiguously.
• They deliver high performance on aggregation queries like SUM, COUNT, AVG,
MIN etc. as the data is readily available in a column.
• Column-based NoSQL databases are widely used to manage data
warehouses, business intelligence, CRM, Library card catalogs,
• HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column
based database.
Document-Oriented:
• Document-Oriented NoSQL DB stores and retrieves data as a key value pair but
the value part is stored as a document. The document is stored in JSON or XML
formats. The value is understood by the DB and can be queried.
• In this diagram on your left you can see we have rows and columns, and in the
right, we have a document database which has a similar structure to JSON. Now
for the relational database, you have to know what columns you have and so
on. However, for a document database, you have data store like JSON object.
You do not require to define which make it flexible.
• The document type is mostly used for CMS systems, blogging platforms, real-
time analytics & e-commerce applications. It should not use for complex
transactions which require multiple operations or queries against varying
aggregate structures.
• Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are
popular Document originated DBMS systems.
Graph-Based
• A graph type database stores entities as well the relations amongst those
entities. The entity is stored as a node with the relationship as edges. An edge
gives a relationship between nodes. Every node and edge has a unique
identifier.
• Compared to a relational database where tables are loosely connected, a
Graph database is a multi-relational in nature. Traversing relationship is fast as
they are already captured into the DB, and there is no need to calculate them.
• Graph base database mostly used for social networks, logistics, spatial data.
• Neo4J, Infinite Graph, OrientDB , FlockDB are some popular graph-based
databases.
Big Data
Big Data is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently. Big data is also a data but
with huge size.
What is an Example of Big Data?
Following are some of the Big Data examples-
• The New York Stock Exchange is an example of Big Data that generates
about one terabyte of new trade data per day.
• Social Media :The statistic shows that 500+terabytes of new data get ingested
into the databases of social media site Facebook, every day. This data is mainly
generated in terms of photo and video uploads, message exchanges, putting
comments etc.
• A single Jet engine can generate 10+terabytes of data in 30 minutes of flight
time. With many thousand flights per day, generation of data reaches up to
many Petabytes.
Types Of Big Data
Following are the types of Big Data:
• Structured
• Unstructured
• Semi-structured
Structured
• Any data that can be stored, accessed and processed in the form of fixed format
is termed as a ‘structured’ data. Over the period of time, talent in computer
science has achieved greater success in developing techniques for working with
such kind of data (where the format is well known in advance) and also deriving
value out of it. However, nowadays, we are foreseeing issues when a size of such
data grows to a huge extent, typical sizes are being in the rage of multiple
zettabytes.
An ‘Employee’ table in a database is an example of Structured Data
Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000

Unstructured
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in
terms of its processing for deriving value out of it. A typical example of
unstructured data is a heterogeneous data source containing a combination of
simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don’t know how to derive value out of
it since this data is in its raw form or unstructured format.
Examples Of Un-structured Data
• The output returned by ‘Google Search’

Semi-structured
• Semi-structured data can contain both the forms of data. We can see semi-
structured data as a structured in form but it is actually not defined with e.g. a
table definition in relational DBMS. Example of semi-structured data is a data
represented in an XML file.
Examples Of Semi-structured Data
Personal data stored in an XML file-
• <rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Characteristics Of Big Data
Big data can be described by the following characteristics:
• Volume
• Variety
• Velocity
• Variability
• (i) Volume – The name Big Data itself is related to a size which is enormous. Size
of data plays a very crucial role in determining value out of data. Also, whether a
particular data can actually be considered as a Big Data or not, is dependent
upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to
be considered while dealing with Big Data solutions.
(ii) Variety – The next aspect of Big Data is its variety.
• Variety refers to heterogeneous sources and the nature of data, both structured
and unstructured. During earlier days, spreadsheets and databases were the
only sources of data considered by most of the applications. Nowadays, data in
the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also
being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands, determines real
potential in the data.
• Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the data
effectively.
Big Data Architecture
A big data architecture is designed to handle the ingestion, processing, and analysis
of data that is too large or complex for traditional database systems.

Big data solutions typically involve one or more of the following types of workload:
• Batch processing of big data sources at rest.
• Real-time processing of big data in motion.
• Interactive exploration of big data.
• Predictive analytics and machine learning.
Most big data architectures include some or all of the following components:
• Data sources: All big data solutions start with one or more data sources.
Examples include:
– Application data stores, such as relational databases.
– Static files produced by applications, such as web server log files.
– Real-time data sources, such as IoT devices.
• Data storage: Data for batch processing operations is typically stored in a
distributed file store that can hold high volumes of large files in various formats.
This kind of store is often called a data lake. Options for implementing this
storage include Azure Data Lake Store or blob containers in Azure Storage.
• Batch processing: Because the data sets are so large, often a big data solution
must process data files using long-running batch jobs to filter, aggregate, and
otherwise prepare the data for analysis. Usually these jobs involve reading
source files, processing them, and writing the output to new files. Options
include running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or
custom Map/Reduce jobs in an HDInsight Hadoop cluster, or using Java, Scala, or
Python programs in an HDInsight Spark cluster.
• Real-time message ingestion: If the solution includes real-time sources, the
architecture must include a way to capture and store real-time messages for
stream processing. This might be a simple data store, where incoming messages
are dropped into a folder for processing. However, many solutions need a
message ingestion store to act as a buffer for messages, and to support scale-
out processing, reliable delivery, and other message queuing semantics.
Options include Azure Event Hubs, Azure IoT Hubs, and Kafka.
• Stream processing: After capturing real-time messages, the solution must
process them by filtering, aggregating, and otherwise preparing the data for
analysis. The processed stream data is then written to an output sink. Azure
Stream Analytics provides a managed stream processing service based on
perpetually running SQL queries that operate on unbounded streams. You can
also use open source Apache streaming technologies like Storm and Spark
Streaming in an HDInsight cluster.
• Analytical data store: Many big data solutions prepare data for analysis and
then serve the processed data in a structured format that can be queried using
analytical tools. The analytical data store used to serve these queries can be a
Kimball-style relational data warehouse, as seen in most traditional business
intelligence (BI) solutions. Alternatively, the data could be presented through a
low-latency NoSQL technology such as HBase, or an interactive Hive database
that provides a metadata abstraction over data files in the distributed data
store. Azure Synapse Analytics provides a managed service for large-scale,
cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and
Spark SQL, which can also be used to serve data for analysis.
• Analysis and reporting: The goal of most big data solutions is to provide insights
into the data through analysis and reporting. To empower users to analyze the
data, the architecture may include a data modeling layer, such as a
multidimensional OLAP cube or tabular data model in Azure Analysis Services. It
might also support self-service BI, using the modeling and visualization
technologies in Microsoft Power BI or Microsoft Excel. Analysis and reporting
can also take the form of interactive data exploration by data scientists or data
analysts. For these scenarios, many Azure services support analytical notebooks,
such as Jupyter, enabling these users to leverage their existing skills with Python
or R. For large-scale data exploration, you can use Microsoft R Server, either
standalone or with Spark.
• Orchestration: Most big data solutions consist of repeated data processing
operations, encapsulated in workflows, that transform source data, move data
between multiple sources and sinks, load the processed data into an analytical
data store, or push the results straight to a report or dashboard. To automate
these workflows, you can use an orchestration technology such Azure Data
Factory or Apache Oozie and Sqoop.
Advantages Of Big Data Processing
Ability to process Big Data in DBMS brings in multiple benefits, such as-
• Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like facebook, twitter are
enabling organizations to fine tune their business strategies.
• Improved customer service
Traditional customer feedback systems are getting replaced by new systems
designed with Big Data technologies. In these new systems, Big Data and natural
language processing technologies are being used to read and evaluate consumer
responses.
• Early identification of risk to the product/services, if any
• Better operational efficiency
What are the Disadvantages of Big Data?
• Since all the information collected requires a lot of effort and resources, storing it
before it can be examined needs a vast space. Although the analysis of enormous
information seems possible, some significant disadvantages of Big Data come to
light in terms of space, cost, and user security.
1. Unstructured Data
• The data collected can be arranged or present in the form of random
information. More variations in data can create difficulty in processing results
and generating solutions. If the information is broken or unstructured, many
users can get neglected while deriving future outcomes or analyzing present
scenarios.
2. Security Concerns are most dreaded disadvantages of Big Data
• For highly secured data or confidential information, highly secured networks are
needed for its transfer and storage. Furthermore, with the increased global
politics and complex situations between nations, leaked data can be used as an
advantage by enemies, so keeping it secure is essential and requires building
such a network.
3. Expensive
• The process of data generation and its analysis is costly without the surety of
favorable results. The top businesses can mainly research this field as the space
sector, where the wealthiest companies and individuals carry out research. The
Cost of setting up super-computers is one of the leading disadvantages of Big
Data analytics. Even if the cost is incurred somehow, the information usually
residing on the cloud has to be arranged for and will require maintenance.
4. Skilled Analysts
• The professionals needed to carry out research and run complicated software
are highly paid and hard to find. There is a scarcity of individuals skilled for the
data analyst job despite the increasing scope in this area of knowledge. Data is
the resource of the new generation as to remain in the market; it is necessary to
keep yourself updated with further information.
5. Hardware and Storage
• The servers and hardware needed to store and run high-quality software are
very costly and hard to build. Also, the information is available in bulk with
continuous changes, and processing requires faster software and applications.
And we cannot forget the uncertainty involved with getting accurate results.
Map Reduce
MapReduce is a programming model for writing applications that can process Big
Data in parallel on multiple nodes. MapReduce provides analytical capabilities for
analyzing huge volumes of complex data.
Why MapReduce?
• Traditional Enterprise Systems normally have a centralized server to store and
process data. The following illustration depicts a schematic view of a traditional
enterprise system. Traditional model is certainly not suitable to process huge
volumes of scalable data and cannot be accommodated by standard database
servers. Moreover, the centralized system creates too much of a bottleneck
while processing multiple files simultaneously.

• Google solved this bottleneck issue using an algorithm called MapReduce.


MapReduce divides a task into small parts and assigns them to many computers.
Later, the results are collected at one place and integrated to form the result
dataset.
How MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
• The Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
• The Reduce task takes the output from the Map as an input and combines those
data tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Let us now take a close look at each of the phases and try to understand their
significance.
• Input Phase − Here we have a Record Reader that translates each record in an
input file and sends the parsed data to the mapper in the form of key-value pairs.
• Map − Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key-value pairs.
• Intermediate Keys − They key-value pairs generated by the mapper are known as
intermediate keys.
• Combiner − A combiner is a type of local Reducer that groups similar data from
the map phase into identifiable sets. It takes the intermediate keys from the
mapper as input and applies a user-defined code to aggregate the values in a
small scope of one mapper. It is not a part of the main MapReduce algorithm; it
is optional.
• Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It
downloads the grouped key-value pairs onto the local machine, where the
Reducer is running. The individual key-value pairs are sorted by key into a larger
data list. The data list groups the equivalent keys together so that their values
can be iterated easily in the Reducer task.
• Reducer − The Reducer takes the grouped key-value paired data as input and
runs a Reducer function on each one of them. Here, the data can be aggregated,
filtered, and combined in a number of ways, and it requires a wide range of
processing. Once the execution is over, it gives zero or more key-value pairs to
the final step.
• Output Phase − In the output phase, we have an output formatter that
translates the final key-value pairs from the Reducer function and writes them
onto a file using a record writer.
• Let us try to understand the two tasks Map &f Reduce with the help of a small
diagram −
Hadoop
Hadoop is an open-source framework that allows to store and process big data in a
distributed environment across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines,
each offering local computation and storage.
With growing data velocity the data size easily outgrows the storage limit of a
machine. A solution would be to store the data across a network of machines. Such
filesystems are called distributed filesystems. Since data is stored across a network
all the complications of a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems.
HDFS (Hadoop Distributed File System) is a unique design that provides storage
for extremely large files with streaming data access pattern and it runs
on commodity hardware. Let’s elaborate the terms:
• Extremely large files: Here we are talking about the data in range of
petabytes(1000 TB).
• Streaming Data Access Pattern: HDFS is designed on principle of write-once and
read-many-times. Once data is written large portions of dataset can be processed
any number times.
• Commodity hardware: Hardware that is inexpensive and easily available in the
market. This is one of feature which specially distinguishes HDFS from other file
system.
Nodes: Master-slave nodes typically forms the HDFS cluster.
• NameNode(MasterNode):
– Manages all the slave nodes and assign work to them.
– It executes filesystem namespace operations like opening, closing, renaming
files and directories.
– It should be deployed on reliable hardware which has the high config. not
on commodity hardware.
• DataNode(SlaveNode):
– Actual worker nodes, who do the actual work like reading, writing,
processing etc.
– They also perform creation, deletion, and replication upon instruction from
the master.
– They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in background.
• Namenodes:
– Run on the master node.
– Store metadata (data about data) like file path, the number of blocks, block
Ids. etc.
– Require high amount of RAM.
– Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
persistent copy of it is kept on disk.
• DataNodes:
– Run on slave nodes.
– Require high memory as data is actually stored here.
Hadoop can run in 3 different modes.
1. Standalone(Local) Mode
By default, Hadoop is configured to run in a no distributed mode. It runs as a single Java
process. Instead of HDFS, this mode utilizes the local file system. This mode useful for
debugging and there isn't any need to configure core-site.xml, hdfs-site.xml, mapred-
site.xml, masters & slaves. Stand-alone mode is usually the fastest mode in Hadoop.

2. Pseudo-Distributed Mode(Single node)


Hadoop can also run on a single node in a Pseudo Distributed mode. In this mode, each
daemon runs on separate java process. In this mode custom configuration is
required( core-site.xml, hdfs-site.xml, mapred-site.xml ). Here HDFS is utilized for input
and output. This mode of deployment is useful for testing and debugging purposes.

3. Fully Distributed Mode


This is the production mode of Hadoop. In this mode typically one machine in the
cluster is designated as NameNode and another as Resource Manager exclusively. These
are masters. All other nodes act as Data Node and Node Manager. These are the slaves.
Configuration parameters and environment need to specified for Hadoop Daemons.
This mode offers fully distributed computing capability, reliability, fault tolerance and
scalability.

You might also like