=Big data Technologi
=Big data Technologi
ITS66904
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <2> of 9
Learning Outcomes
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <3> of 9
Key Terms You Must Be Able To
Use
• If you have mastered this topic, you should be able to use the
following terms correctly in your assignments and exams:
• Hadoop MapReduce
• Key-value
• Document-based
• NOSQL
• RDBMS
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <4> of 9
What Technology Do We Have
For Big Data ??
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <5> of 9
Hadoop for Big Data
Source: https://hadoop.apache.org/
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <6> of 9
Hadoop Creation History
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <7> of 9
Hadoop: Assumptions
• Flexible: Can easily access new data source and tap into
different types of data (structured and unstructured)
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <12> of 9
13
NOSQL
• The Name:
– Stands for Not Only SQL
– The term NOSQL was introduced by Carl Strozzi
in 1998 to name his file-based database
– It was again re-introduced by Eric Evans when an
event was organized to discuss open source
distributed databases
– Eric states that “… but the whole point of seeking
alternatives is that you need to solve a problem
that relational databases are a bad fit for. …”
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <13> of 9
Key features (advantages)
– non-relational
– don’t require schema
– data are replicated to multiple
nodes (so, identical & fault-tolerant)
and can be partitioned:
• down nodes easily replaced
• no single point of failure
– horizontal scalable
– cheap, easy to implement
(open-source)
– massive write performance
– fast key-value access
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <14> of 9
Disadvantages
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <15> of 9
Who is using them?
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <16> of 9
NOSQL categories
1.Key-value
• Example: DynamoDB, Voldermort, Scalaris
2.Document-based
• Example: MongoDB, CouchDB
3.Column-based
• Example: BigTable, Cassandra, Hbase
4.Graph-based
• Example: Neo4J, InfoGrid
• “No-schema” is a common characteristics
of most NOSQL storage systems
• Provide “flexible” data types
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <17> of 9
Key-value
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <19> of 9
Key-value
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <#> of 9
Key-value
Pros:
– very fast
– very scalable (horizontally distributed to nodes based on
key)
– simple data model
– eventual consistency
– fault-tolerance
Cons:
- Can’t model more complex data structure such
as objects
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <21> of 9
Document-based
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <22> of 9
Document-based
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <23> of 9
Document-based
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <#> of 9
Column-based
• Based on Google’s BigTable paper
• Like column oriented relational databases (store data in column order) but with a twist
• Tables similarly to RDBMS, but handle semi-structured
• Data model:
– Collection of Column Families
– Column family = (key, value) where value = set of related columns (standard, super)
– indexed by row key, column key and timestamp
allow key-value pairs to be stored (and retrieved on key) in a massively parallel system
storing principle: big hashed distributed tables
properties: partitioning (horizontally and/or vertically), high availability etc. completely
transparent to application
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <#> of 9
Column-based
• One column family can have variable
numbers of columns
• Cells within a column family are sorted “physically”
• Very sparse, most cells have null values
• Comparison: RDBMS vs column-based NOSQL
– Query on multiple tables
• RDBMS: must fetch data from several places on disk and
glue together
• Column-based NOSQL: only fetch column families of those
columns that are required by a query (all columns in a
column family are stored together on the disk, so multiple
rows can be retrieved in one read operation data locality)
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <#> of 9
Column-based
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <#> of 9
Graph-based
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <#> of 9
Apache Hive
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <#> of 9
Hive is not
• A relational database
• A design for OnLine Transaction
Processing (OLTP)
• A language for real-time queries and row-
level updates
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <#> of 9
Features of Hive
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <#> of 9
SQL-on-Hadoop
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <33> of 9
Summary of Main Teaching Points
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <34> of 9
Question and Answer Session
Q&A
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <35> of 9
What we will cover next
ITS66904 - Big Data Technologies Overview of Big Data Technologies Slide <36> of 9