0% found this document useful (0 votes)
1 views72 pages

NoSql-Unit-2

The document discusses NoSQL storage architecture, focusing on column-oriented databases and their advantages over traditional relational databases. It highlights the flexibility of schema design, data processing, and the management of sparse datasets, particularly in systems like Cassandra and MongoDB. Additionally, it covers CRUD operations, indexing, and the unique features of HBase within the Hadoop ecosystem.

Uploaded by

Ambar Majumdar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views72 pages

NoSql-Unit-2

The document discusses NoSQL storage architecture, focusing on column-oriented databases and their advantages over traditional relational databases. It highlights the flexibility of schema design, data processing, and the management of sparse datasets, particularly in systems like Cassandra and MongoDB. Additionally, it covers CRUD operations, indexing, and the unique features of HBase within the Hadoop ecosystem.

Uploaded by

Ambar Majumdar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Dr.

SELVA KUMAR S
B.M.S COLLEGE OF ENGINEERING
▪ NOSQL Storage Architecture
▪ Performing CRUD operations
▪ Querying NoSQL stores
▪ Modifying and Managing NOSQL Data stores
▪ Indexing and ordering datasets

2
UNDERSTANDING THE STORAGE ARCHITECTURE

▪ Column-oriented databases are among the most popular types of non-relational databases.

Publication from Google on big table established beyond a doubt that a cluster of
inexpensive hardware can be leveraged to hold huge amounts data, way more than a single
machine can hold, and be processed effectively and efficiently within a reasonable timeframe.

Three key themes emerged:

1. Data needs to be stored in a networked filesystem that can expand to multiple machines.

2. Data needs to be stored in a structure that provides more flexibility than the traditional
normalized relational database structures.

The storage scheme needs to allow for effective storage of huge amounts of sparse data sets.
It needs to accommodate for changing schemas without the necessity of altering the
underlying tables.

3. Data needs to be processed in a way that computations on it can be performed in isolated


subsets of the data and then combined to generate the desired output.

3
WORKING WITH COLUMN-ORIENTED DATABASES

▪ Using Tables and Columns in Relational Databases

4
▪ 1)RDBMS table has a few columns, sometimes tens of them-> Millions of rows could
potentially be held in a relational table 🡪 may bring the data access to a halt, unless
special considerations like denormalization are applied.

▪ 2) As you begin to use your table 🡪 may need to alter it to hold a few additional attributes

As newer records are stored 🡪 may have null values for these attributes 🡪 the existing
records.

▪ 3)Keeping greater variety of attributes the likelihood of sparse data sets 🡪 sets with null
in many cells 🡪 becomes increasingly real.

5
Consider that this data is evolving and you have to store each version of the cell value as it
evolves.

Think of it like a three-dimensional Excel spreadsheet, where the third dimension is time.

Then the values as they evolve through time could be thought of as cell values in multiple
spreadsheets put one behind the other in chronological order

Therefore
Altering the table as data evolves,
storing a lot of sparse cells,
and
working through value versions can get complex

6
CONTRASTING COLUMN DATABASES WITH RDBMS

• First and foremost, a column-oriented database imposes minimal need for upfront schema
definition and can easily accommodate newer columns as the data evolves.

• In a typical column-oriented store, you predefine a column-family and not a column.

• A column-family is a set of columns grouped together into a bundle.


• In a column database, a column-family is analogous to a column in an RDBMS.
• Both are typically defined before data is stored in tables and are fairly static in nature.
• Columns in RDBMS define the type of data they can store.
• Column-families have no such limitation; they can contain any number of columns, which
can store any type of data, as far as they can be persisted as an array of bytes.

7
▪ Each row of a column-oriented database table stores data values in only those columns for
which it has valid values.

▪ Continuously evolving data would get stored in a column database as shown in Figure

8
▪ Select * from emp where id=1

9
▪ Select first_name from emp where ssn=666

10
11
• Cassandra is designed such that it has no master or slave
nodes.
• It has a ring-type architecture, that is, its nodes are logically
distributed like a ring.
• Data is automatically distributed across all the nodes.
• Data is replicated across the nodes for redundancy.
• Data is kept in memory and lazily written to the disk.
• Hash values of the keys are used to distribute the data
among nodes in the cluster.
12
13
1.Data is written to a commitlog on disk.
2.The data is sent to a responsible node based on the hash value.
3.Nodes write data to an in-memory table called memtable.
4.From the memtable, data is written to an sstable in memory.
Sstable stands for Sorted String table. This has a consolidated
data of all the updates to the table.
5.From the sstable, data is updated to the actual table.
6.If the responsible node is down, data will be written to another
node identified as tempnode. The tempnode will hold the data
temporarily till the responsible node comes alive.
14
15
• Data on the same node is given first preference and is
considered data local.
• Data on the same rack is given second preference and is
considered rack local.
• Data on the same data center is given third preference and is
considered data center local.
• Data in a different data center is given the least preference.
▪ Data in the memtable and sstable is checked first so that the
data can be retrieved faster if it is already in memory.
16
▪ Cassandra performs transparent distribution of data by
horizontally partitioning the data in the following manner:
• A hash value is calculated based on the primary key of the data.
• The hash value of the key is mapped to a node in the cluster
• The first copy of the data is stored on that node.
• The distribution is transparent as you can both calculate the
hash value and determine where a particular row will be stored.

17
▪ The following diagram depicts a four node cluster with token
values of 0, 25, 50 and 75.

18
▪ Cassandra uses a gossip protocol to communicate with nodes in a cluster.
▪ It is an inter-node communication mechanism similar to the heartbeat protocol in
Hadoop.
▪ The gossip process runs periodically on each node and exchanges state
information with three other nodes in the cluster.
▪ Eventually, information is propagated
to all cluster nodes.

19
20
▪ MongoDB achieves replication using the concept replica sets.
▪ A replica set is a group of mongod instances that host the same data set.
▪ One of the nodes is selected as the primary or main node.
▪ The primary node receives all the operations from the user and the secondaries
are updated from the primary one by using the same operation to maintain
consistency.
▪ If the primary node goes down, one of the secondary nodes is selected as the
primary node and the operations are carried forward.
▪ When the fallen node recovers, it joins the cluster as the secondary nodes.
▪ We can control our cluster of mongo instances using Mongo Atlas.

21
22
▪ Sharding is used by MongoDB to store data across multiple machines.
▪ It uses horizontal scaling to add more machines to distribute data and operation
with respect to the growth of load and demand.
▪ Sharding arrangement in MongoDB has mainly three components:
▪ 1. Shards or replica sets
▪ 2. Configuration Servers
▪ 3. Query Router

23
24
25
26
1.Monitoring — ensuring main and secondary instances are
working as expected.
2.Notification — notify system admins about occurrences in
the Redis instances.
3.Failover management — Sentinel nodes can start a failover
process if the primary instance isn't available and enough
(quorum of) nodes agree that is true.
4.Configuration management — Sentinel nodes also serve as
a point of discovery of the current main Redis instance.
27
28
29
▪ At any given time, a Redis Enterprise cluster node can include
between zero and a few hundred Redis databases in one of the
following types:
• A simple database, i.e. a single primary shard
• A highly available (HA) database, i.e. a pair of primary and
replica shards
• A clustered database, which contains multiple primary shards,
each managing a subset of the dataset (or in Redis terms, a
different range of “hash-slots”)
• An HA clustered database, i.e. multiple pairs of primary/replica
shards

30
31
1.Reading the graph data from Neo4j Database
2.Loading (projecting) the data into an in-memory graph
3.Running an algorithm on a projected graph
4.Writing the results back to Neo4j Database (if the algorithm
runs in write mode)

32
33
• HBase is a distributed column-oriented database built on top of the Hadoop file
system.

• It is an open-source project and is horizontally scalable.

• HBase is a data model that is similar to Google’s big table designed to provide quick
random access to huge amounts of structured data.

• It leverages the fault tolerance provided by the Hadoop File System (HDFS).

• It is a part of the Hadoop ecosystem that provides random real-time read/write


access to data in the Hadoop File System.

34
• HBase deployment adheres to a master-worker pattern.

• Therefore, there is usually a master and a set of workers, commonly known as region servers.

• When HBase starts, the master allocates a set of regions to a region server.

• Each region stores an ordered set of rows, where each row is identified by a unique row-key.

• As the number of rows stored in a region grows in size beyond a configured threshold, the
region is split into two and rows are divided between the two new ranges.

35
• HBase stores columns in a column-family together.

• Therefore, each region maintains a separate store for each column-family

• Each store in turn maps to a physical file that is stored in the underlying distributed
filesystem.

• For each store, HBase abstracts access to the underlying filesystem with the help of a thin
wrapper that acts as the intermediary between the store and the underlying physical file.

36
▪ Each region has an in-memory store, or cache, and a write-ahead-log (WAL)
▪ When data is written to a region, it’s first written to the write-ahead-log.
▪ Soon afterwards, it’s written to the region’s in-memory store.
▪ If thein-memory store is full, data is flushed to disk and persisted in the underlying distributed
storage.

37
38
▪ Hadoop is a framework that enables processing of large data sets which reside in the form of
clusters.
▪ Hadoop is made up of several modules that are supported by a large ecosystem of
technologies.
▪ Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems.
▪ There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop
Common.

39
40
HOW IS HBASE DIFFERENT FROM OTHER NOSQL
MODELS
• HBase stores data in the form of key/value pairs in a columnar
model.
• In this model, all the columns are grouped together as Column
families.
• HBase provides a flexible data model and low latency access to
small amounts of data stored in large data sets.
• HBase on top of Hadoop will increase the throughput and
performance of distributed cluster set up.
• In turn, it provides faster random reads and writes operations.

41
42
43
• create - Creates a table.
• list - Lists all the tables in HBase.
• disable - Disables a table.
• is_disabled - Verifies whether a table is disabled.
• enable - Enables a table.
• is_enabled - Verifies whether a table is enabled.
• describe - Provides the description of a table.
• alter - Alters a table.
• exists - Verifies whether a table exists.
• drop - Drops a table from HBase.
• drop_all - Drops the tables matching the ‘regex’ given in the command.

44
• put - Puts a cell value at a specified column in a specified
row in a particular table.
• get - Fetches the contents of row or a cell.
• delete - Deletes a cell value in a table.
• deleteall - Deletes all the cells in a given row.
• scan - Scans and returns the table data.
• count - Counts and returns the number of rows in a table.
• truncate - Disables, drops, and recreates a specified table.

45
▪ hbase>create 'emp', 'personal data', 'professional data’
▪ hbase>list
▪ Table
▪ emp
▪ hbase>Scan ‘emp’
▪ 1 column = personal data:city, timestamp = 1417516501, value = hyderabad
▪ 1 column = personal data:name, timestamp = 1417525058, value = ramu
▪ hbase>alter 'emp', NAME ⇒ 'personal data', VERSIONS ⇒ 5
▪ hbase> alter ‘ table name ’, ‘delete’ ⇒ ‘ column family ’ // Deleting a column
family

46
▪ hbase> put 'selva', 'r1', 'c1', 'value, 10
▪ hbase> put 'selva', 'r1', 'c1', 'value, 15
▪ hbase> put 'selva', 'r1', 'c1', 'value, 20

▪ hbase> get 'selva', 'r1', 'c1’


▪ hbase> get 'selva', 'r1’
▪ hbase>get 'selva', 'r1', {TIMERANGE => [ts1, ts2]}
▪ hbase>get 'selva', 'r1', {COLUMN => ['c1', 'c2', 'c3’]}

▪ hbase> delete 'selva', 'r1', 'c1'


▪ hbase> deleteall 'selva', 'c1'

47
INDEXING IN MONGODB
▪ MongoDB uses indexing in order to make the query processing more efficient.
▪ If there is no indexing, then the MongoDB must scan every document in the
collection and retrieve only those documents that match the query.
▪ Indexes are special data structures that store a small part of the Collection’s
data in a way that can be queried easily.
▪ The indexes are order by the value of the field specified in the index.
▪ In MongoDB, querying without indexes is called a collection scan. A collection
scan will:
• Result in various performance bottlenecks
• Significantly slow down your application

48
▪ use students
▪ db.createCollection("studentgrades")
▪ db.studentgrades.insertMany(
▪ [
▪ {name: "Barry", subject: "Maths", score: 92},
▪ {name: "Kent", subject: "Physics", score: 87},
▪ {name: "Harry", subject: "Maths", score: 99, notes: "Exceptional Performance"},
▪ {name: "Alex", subject: "Literature", score: 78},
▪ {name: "Tom", subject: "History", score: 65, notes: "Adequate"}
▪ ]
▪ )
▪ db.studentgrades.find({},{_id:0})

49
50
▪ db.<collection>.createIndex(<Key and Index Type>, <Options>)

▪ When creating an index, you need to define the field to be indexed


▪ The direction of the key (1 or -1) to indicate ascending or descending order.

▪ db.studentgrades.createIndex(
▪ {name: 1},
▪ {name: "student name index"}
▪)

51
▪ db.<collection>.getIndexes()

▪ db.studentgrades.getIndexes()

52
▪ db.<collection>.dropIndex(<Index Name / Field Name>)

▪ db.studentgrades.dropIndex("student name index")

▪ db.studentgrades.dropIndexes()

53
• Single field index
• Compound index
• Multikey index

54
▪ db.studentgrades.createIndex({name: 1})

55
▪ db.studentgrades.createIndex({subject: 1, score: -1})

56
▪ MongoDB supports indexing array fields.
▪ These multikey indexes enable users to query documents using the elements within the array.
▪ MongoDB will automatically create a multikey index when encountered with an array field without
requiring the user to explicitly define the multikey type.
▪ db.createCollection("studentperformance")
▪ db.studentperformance.insertMany(
▪ [
▪ {name: "Barry", school: "ABC Academy", grades: [85, 75, 90, 99] },
▪ {name: "Kent", school: "FX High School", grades: [74, 66, 45, 67]},
▪ {name: "Alex", school: "XYZ High", grades: [80, 78, 71, 89]},
▪ ]
▪ )
▪ db.studentperformance.find({},{_id:0}).pretty()

57
▪ Built-in indexes are the best option on a table which having many
rows and that rows contain the indexed value.
▪ In a particular column which column having more unique values in
that case we can used indexing.
▪ Table which more overhead due to several reason like column
having more entries then in that case we can used indexing.

58
▪ Command ‘Create index’ creates an index on the column specified by the
user.
▪ After creating an index, Cassandra indexes new data automatically when
data is inserted.
▪ The index cannot be created on primary key as a primary key is already
indexed.
▪ Indexes on collections are not supported in Cassandra.
▪ Without indexing on the column, Cassandra can’t filter that column unless it
is a primary key.

59
60
61
62
▪ Select * from ratings_by_movies where rating=8;

63
▪ Create index IndexName on
KeyspaceName.TableName(ColumnName);

▪ CREATE INDEX [ IF NOT EXISTS ] index_name


▪ ON [keyspace_name.]table_name
▪ ([ ( KEYS | FULL ) ] column_name)
▪ (ENTRIES column_name);
64
65
▪ Command ‘Drop index’ drops the specified index.
• If the index does not exist, it will return an error unless IF EXISTS is used that will return no-
op.
• During index creation, you have to specify keyspace name with the index name otherwise
index will be dropped from the current keyspace.
▪ Syntax:

▪ Drop index IF EXISTS KeyspaceName.IndexName

66
▪ After successful execution of the command, DeptIndex will be dropped from the keyspace.
Now data cannot be filtered by the column dept.

67
▪ Indexes can be created on multiple columns and used in queries.

▪ cqlsh> CREATE TABLE cycling.cyclist_alt_stats ( id UUID PRIMARY KEY, lastname


text, birthday timestamp, nationality text, weight text, height text );
▪ cqlsh> CREATE INDEX birthday_idx ON cycling.cyclist_alt_stats ( birthday );
▪ CREATE INDEX nationality_idx ON cycling.cyclist_alt_stats ( nationality );

▪ cqlsh> SELECT * FROM cycling.cyclist_alt_stats WHERE birthday = '1982-01-29'


AND nationality = 'Russia';

68
▪ The indexes have been created on appropriate low cardinality columns, but the query still
fails.
▪ The error is not due to multiple indexes, but the lack of a partition key definition in the query.
▪ When you attempt a potentially expensive query, such as searching a range of rows,
Cassandra requires the ALLOW FILTERING directive.
▪ cqlsh> SELECT * FROM cycling.cyclist_alt_stats WHERE birthday = '1990-05-27'
AND nationality = 'Portugal' ALLOW FILTERING

69
▪ Collections can be indexed and queried to find a collection containing a particular value.
▪ Set and List collections

▪ CREATE INDEX team_idx ON cycling.cyclist_career_teams ( teams );


▪ SELECT * FROM cycling.cyclist_career_teams WHERE teams CONTAINS
'Nederland bloeit’;

70
▪ Maps
▪ CREATE TABLE cycling.birthday_list (cyclist_name text PRIMARY KEY, blist
map<text,text>);
▪ CREATE INDEX blist_idx ON cycling.birthday_list (ENTRIES(blist));
▪ SELECT * FROM cycling.birthday_list WHERE blist['age'] = '23’;

71
72

You might also like