0% found this document useful (0 votes)
2 views

MODULE 3

This document discusses the comparison between Hadoop and relational databases, highlighting the limitations of SQL-based systems and the advantages of NoSQL databases. It outlines different types of NoSQL databases, such as key-value stores, column family stores, document stores, and graph databases, and introduces concepts like ACID and BASE compliance. Additionally, it describes how Hadoop can modernize data warehousing by serving as a landing zone for data, enabling scalable and flexible data processing.

Uploaded by

Megha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

MODULE 3

This document discusses the comparison between Hadoop and relational databases, highlighting the limitations of SQL-based systems and the advantages of NoSQL databases. It outlines different types of NoSQL databases, such as key-value stores, column family stores, document stores, and graph databases, and introduces concepts like ACID and BASE compliance. Additionally, it describes how Hadoop can modernize data warehousing by serving as a landing zone for data, enabling scalable and flexible data processing.

Uploaded by

Megha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

MODULE 3

Hadoop and the Data


Warehouse: Friends or Foes?
3.1 Comparing and Contrasting
Hadoop with Relational Databases
KEYPOINTS:
• Database models and database systems
• database technology is the relational database
management system (RDBMS)
• structured query language (SQL) -common
programming language for managing data
stored in an RDBMS
• The 1980s and 1990s-object database
• class of technologies-NoSQL databases-
Hadoop
 NoSQL data stores
• “Just Say No to SQL”-1980s
• limitations of (SQL-based) relational databases
• they were tired of forcing square pegs into round holes by
solving problems that relational databases weren’t
designed for.

• A relational database is a powerful tool, but for some


kinds of data (like key-value pairs, or graphs) and some
usage patterns (like extremely large scale storage) a
relational database just isn’t practical. And when it
comes to high-volume storage, relational database can
be expensive, both in terms of database license costs
and hardware costs.
• NoSQL databases – suitable for data storage
and processing problems.
• These NoSQL databases typically provide
massive scalability by way of clustering, and
are often designed to enable high throughput
and low latency
• The name NoSQL is somewhat misleading
because many databases that fit the category
do have SQL support (rather than “NoSQL”
support). Think of its name instead as “Not
Only SQL.”
The NoSQL offerings available today can be broken down into
four distinct categories, based on their design and purpose:

✓ Key-value stores
✓ Column family stores
✓ Document stores
✓ Graph databases
• ✓ Key-value stores:

• This offering provides a way to store any kind of data


without having to use a schema.
• This is in contrast to relational data bases, where you
need to define the schema (the table structure) before
any data is inserted.
• Since key-value stores don’t require a schema, you have
great flexibility to store data in many formats.
• In a key-value store, a row simply consists of a key (an
identifier) and a value, which can be anything from an
integer value to a large binary data string.
• Many implementations of key-value stores are based on
Amazon’s Dynamo paper.
• ✓ Column family stores:

• Here you have databases in which columns


are grouped into column families and stored
together on disk.

• many of these databases aren’t column-


oriented, because they’re based on Google’s
BigTable paper, which stores data as a
multidimensional sorted map.
• ✓ Document stores:

• This offering relies on collections of similarly


encoded and formatted documents to improve
efficiencies.

• Document stores enable individual documents in a


collection to include only a subset of fields, so only
the data that’s needed is stored.

• For sparse data sets, where many fields are often


not populated, this can translate into significant
space savings.
• empty columns -Document stores also
enables schema flexibility, because only the
fields that are needed are stored, and new
fields can be added.

• table structures are defined up front before


data is stored, and changing columns is a
tedious task that impacts the entire data set.
• ✓ Graph databases:

• Here you have databases that store graph struc


tures — representations that show collections of entities (vertices
or nodes) and their relationships (edges) with each other.

• These structures enable graph databases to be extremely well


suited for storing complex structures, like the linking
relationships between all known web pages.

• (For example, individual web pages are nodes, and the edges
connecting them are links from one page to another.)

• Google, of course, is all over graph technology, and invented a


graph processing engine called Pregel to power its PageRank
algorithm.
 ACID versus BASE data stores
• relational database systems is something known as ACID
compliance.

• ✓ Atomicity: The database transaction must completely


succeed or completely fail. Partial success is not allowed.

• ✓ Consistency: During the database transaction, the


RDBMS progresses from one valid state to another. The
state is never invalid.

• ✓ Isolation: The client’s database transaction must occur in


isolation from other clients attempting to transact with the
RDBMS.
• ✓ Durability:

• The data operation that was part of the


transaction must be reflected in nonvolatile
storage.(computer memory that can retrieve stored
information even when not powered – like a hard disk)
and persist after the transaction successfully
completes.

• Transaction failures cannot leave the data in


a partially committed state.
use cases for RDBMSs,
• online transaction processing, depend on
ACID-compliant transactions between the
client and the RDBMS for the system to
function properly.

• A great example of an ACID-compliant


transaction is a transfer of funds from one bank
account to another………….
• ………
• Whereas ACID defines the key characteristics
required for reliable transaction processing,
the NoSQL world requires different
characteristics to enable flexibility and
scalability.

• These opposing characteristics are cleverly


captured in the acronym BASE:
BASE
• ✓ Basically Available:
The system is guaranteed to be available
for querying by all users. (No isolation here.

• ✓ Soft State:
The values stored in the system may
change because of the eventual consistency
model, as described in the next bullet.
• ✓ Eventually Consistent:
As data is added to the system, the system’s
state is gradually replicated across all nodes.

For example, in Hadoop, when a file is written to the


HDFS, the replicas of the data blocks are created in
different data nodes after the original data blocks
have been written. For the short period before the
blocks are replicated, the state of the file system
isn’t consistent.
NoSQL would be complete without mentioning the CAP theorem,
which represents the three kinds of guarantees that architects aim
to provide in their systems:

• ✓ Consistency: Similar to the C in ACID, all nodes


in the system would have the same view of the
data at any time.

• ✓ Availability: The system always responds to


requests.

• ✓ Partition tolerance: The system remains online


if network problems occur between system
nodes.
Figure 11-1: CAP theorem guarantees and
imple mentation examples.
In figure
• 1. ✓ Systems using traditional relational
technologies normally aren’t partition
tolerant, so they can guarantee consistency
and availability.

• if one part of these traditional relational


technologies systems is offline, the whole
system is offline.
• ✓ Systems where partition tolerance and
availability are of primary importance can’t
guarantee consistency, because updates (that
destroyer of consistency) can be made on
either side of the partition.

• The key-value stores Dynamo and CouchDB


and the column-family store Cassandra are
popular examples of partition tolerant /
availability (PA) systems.
• ✓ Systems where partition tolerance and
consistency are of primary importance can’t
guarantee availability because the systems
return errors until the partitioned state is
resolved.
Hadoop-based data stores are considered CP systems
(consistent and partition tolerant).

With data stored redundantly across many slave nodes,


outages to large portions (partitions) of a Hadoop cluster can
be tolerated.

Hadoop is considered to be consistent because it has a central


metadata store (the NameNode) which maintains a single,
consistent view of data stored in the cluster.

We can’t say that Hadoop guarantees availability, because if


the NameNode fails applications cannot access data in the
cluster.
 Structured data storage and processing
in Hadoop

Hadoop’s core characteristics:

Hadoop is, first and foremost, a general-purpose data storage


and processing platform designed to scale out to thousands
of compute nodes and petabytes of data.

There’s no data model in Hadoop itself; data is simply stored


on the Hadoop cluster as raw files. As such, the core
components of Hadoop itself have no special capabilities for
cataloging, indexing, or querying structured data.
extended for highly specific purposes. The Hadoop community
has done just that with a number of Apache projects — projects
that, in totality, make up the Hadoop ecosystem.

• ✓ Hive: A data warehousing framework for Hadoop.


Hive catalogs data in structured files and provides a
query interface with the SQL-like language named
HiveQL.

• ✓ HBase: A distributed database — a NoSQL database


that relies on multiple computers rather than on a single
CPU, in other words — that’s built on top of Hadoop.

• ✓ Giraph: A graph processing engine for data stored in


Hadoop.
3.1 Modernizing
the Warehouse with Hadoop
four (specific) ways that Hadoop can modernize
the warehouse:

✓ Landing Zone for All Data


✓ Queryable Archive of Cold Data
✓ Preprocessing Engine
✓ Data Discovery Zone
The landing zone

What exactly is the landing zone?

The landing zone is merely the central place


where data will land in your enterprise — weekly
extractions of data from operational databases.

for example, or from systems generating log files.


✓ It can handle all kinds of data.
✓ It’s easily scalable.
✓ It’s inexpensive.
✓ Once you land data in Hadoop, you have the
flexibility to query, analyze, or process the data
in a variety of ways.
A Hadoop-based landing zone, seen in Figure 11-2—
The enterprise doorstep: Hadoop serves as a landing zone for
incoming data.
• In Figure 11-2, we can see the data warehouse
presented as the primary resource for the various kinds
of analysis listed on the far right side of the figure.

• Here we also see the concept of a landing zone


represented, where Hadoop will store data from a
variety of incoming data sources.

• To enable a Hadoop landing zone, you’ll need to ensure


you can write data from the various data sources to
HDFS.

• For relational databases, a good solution would be to


use Sqoop.
• Organizations started using data warehouses to generate
reports from relational data as their data storage capabilities
improved in the 1980s.

• Early databases were designed for Online Transaction


Processing (OLTP), which is efficient for transactions but not for
reporting.

• Relational Online Analytical Processing (ROLAP) databases were


developed to support large-scale reporting and analysis.

• This led to the creation of data warehouses, which are separate


from operational data stores and optimized for analysis and
reporting.

• The core idea is using purpose-built tools: OLTP systems for


transactions and data warehouses for analytics.
Data warehouses are under increasing stress though, for
the following reasons:

✓ Increased demand to keep longer periods of data


online.

✓ Increased demand for processing resources to


transform data for use in other warehouses and data
marts.

✓ Increased demand for innovative analytics, which


requires analysts to pose questions on the warehouse
data, on top of the regular reporting that’s already being
done. This can incur significant additional processing.
A queryable archive of cold warehouse
data

You might also like