NoSql-Unit-2
NoSql-Unit-2
SELVA KUMAR S
B.M.S COLLEGE OF ENGINEERING
▪ NOSQL Storage Architecture
▪ Performing CRUD operations
▪ Querying NoSQL stores
▪ Modifying and Managing NOSQL Data stores
▪ Indexing and ordering datasets
2
UNDERSTANDING THE STORAGE ARCHITECTURE
▪ Column-oriented databases are among the most popular types of non-relational databases.
Publication from Google on big table established beyond a doubt that a cluster of
inexpensive hardware can be leveraged to hold huge amounts data, way more than a single
machine can hold, and be processed effectively and efficiently within a reasonable timeframe.
1. Data needs to be stored in a networked filesystem that can expand to multiple machines.
2. Data needs to be stored in a structure that provides more flexibility than the traditional
normalized relational database structures.
The storage scheme needs to allow for effective storage of huge amounts of sparse data sets.
It needs to accommodate for changing schemas without the necessity of altering the
underlying tables.
3
WORKING WITH COLUMN-ORIENTED DATABASES
4
▪ 1)RDBMS table has a few columns, sometimes tens of them-> Millions of rows could
potentially be held in a relational table 🡪 may bring the data access to a halt, unless
special considerations like denormalization are applied.
▪ 2) As you begin to use your table 🡪 may need to alter it to hold a few additional attributes
As newer records are stored 🡪 may have null values for these attributes 🡪 the existing
records.
▪ 3)Keeping greater variety of attributes the likelihood of sparse data sets 🡪 sets with null
in many cells 🡪 becomes increasingly real.
5
Consider that this data is evolving and you have to store each version of the cell value as it
evolves.
Think of it like a three-dimensional Excel spreadsheet, where the third dimension is time.
Then the values as they evolve through time could be thought of as cell values in multiple
spreadsheets put one behind the other in chronological order
Therefore
Altering the table as data evolves,
storing a lot of sparse cells,
and
working through value versions can get complex
6
CONTRASTING COLUMN DATABASES WITH RDBMS
• First and foremost, a column-oriented database imposes minimal need for upfront schema
definition and can easily accommodate newer columns as the data evolves.
7
▪ Each row of a column-oriented database table stores data values in only those columns for
which it has valid values.
▪ Continuously evolving data would get stored in a column database as shown in Figure
8
▪ Select * from emp where id=1
9
▪ Select first_name from emp where ssn=666
10
11
• Cassandra is designed such that it has no master or slave
nodes.
• It has a ring-type architecture, that is, its nodes are logically
distributed like a ring.
• Data is automatically distributed across all the nodes.
• Data is replicated across the nodes for redundancy.
• Data is kept in memory and lazily written to the disk.
• Hash values of the keys are used to distribute the data
among nodes in the cluster.
12
13
1.Data is written to a commitlog on disk.
2.The data is sent to a responsible node based on the hash value.
3.Nodes write data to an in-memory table called memtable.
4.From the memtable, data is written to an sstable in memory.
Sstable stands for Sorted String table. This has a consolidated
data of all the updates to the table.
5.From the sstable, data is updated to the actual table.
6.If the responsible node is down, data will be written to another
node identified as tempnode. The tempnode will hold the data
temporarily till the responsible node comes alive.
14
15
• Data on the same node is given first preference and is
considered data local.
• Data on the same rack is given second preference and is
considered rack local.
• Data on the same data center is given third preference and is
considered data center local.
• Data in a different data center is given the least preference.
▪ Data in the memtable and sstable is checked first so that the
data can be retrieved faster if it is already in memory.
16
▪ Cassandra performs transparent distribution of data by
horizontally partitioning the data in the following manner:
• A hash value is calculated based on the primary key of the data.
• The hash value of the key is mapped to a node in the cluster
• The first copy of the data is stored on that node.
• The distribution is transparent as you can both calculate the
hash value and determine where a particular row will be stored.
17
▪ The following diagram depicts a four node cluster with token
values of 0, 25, 50 and 75.
18
▪ Cassandra uses a gossip protocol to communicate with nodes in a cluster.
▪ It is an inter-node communication mechanism similar to the heartbeat protocol in
Hadoop.
▪ The gossip process runs periodically on each node and exchanges state
information with three other nodes in the cluster.
▪ Eventually, information is propagated
to all cluster nodes.
19
20
▪ MongoDB achieves replication using the concept replica sets.
▪ A replica set is a group of mongod instances that host the same data set.
▪ One of the nodes is selected as the primary or main node.
▪ The primary node receives all the operations from the user and the secondaries
are updated from the primary one by using the same operation to maintain
consistency.
▪ If the primary node goes down, one of the secondary nodes is selected as the
primary node and the operations are carried forward.
▪ When the fallen node recovers, it joins the cluster as the secondary nodes.
▪ We can control our cluster of mongo instances using Mongo Atlas.
21
22
▪ Sharding is used by MongoDB to store data across multiple machines.
▪ It uses horizontal scaling to add more machines to distribute data and operation
with respect to the growth of load and demand.
▪ Sharding arrangement in MongoDB has mainly three components:
▪ 1. Shards or replica sets
▪ 2. Configuration Servers
▪ 3. Query Router
23
24
25
26
1.Monitoring — ensuring main and secondary instances are
working as expected.
2.Notification — notify system admins about occurrences in
the Redis instances.
3.Failover management — Sentinel nodes can start a failover
process if the primary instance isn't available and enough
(quorum of) nodes agree that is true.
4.Configuration management — Sentinel nodes also serve as
a point of discovery of the current main Redis instance.
27
28
29
▪ At any given time, a Redis Enterprise cluster node can include
between zero and a few hundred Redis databases in one of the
following types:
• A simple database, i.e. a single primary shard
• A highly available (HA) database, i.e. a pair of primary and
replica shards
• A clustered database, which contains multiple primary shards,
each managing a subset of the dataset (or in Redis terms, a
different range of “hash-slots”)
• An HA clustered database, i.e. multiple pairs of primary/replica
shards
30
31
1.Reading the graph data from Neo4j Database
2.Loading (projecting) the data into an in-memory graph
3.Running an algorithm on a projected graph
4.Writing the results back to Neo4j Database (if the algorithm
runs in write mode)
32
33
• HBase is a distributed column-oriented database built on top of the Hadoop file
system.
• HBase is a data model that is similar to Google’s big table designed to provide quick
random access to huge amounts of structured data.
• It leverages the fault tolerance provided by the Hadoop File System (HDFS).
34
• HBase deployment adheres to a master-worker pattern.
• Therefore, there is usually a master and a set of workers, commonly known as region servers.
• When HBase starts, the master allocates a set of regions to a region server.
• Each region stores an ordered set of rows, where each row is identified by a unique row-key.
• As the number of rows stored in a region grows in size beyond a configured threshold, the
region is split into two and rows are divided between the two new ranges.
35
• HBase stores columns in a column-family together.
• Each store in turn maps to a physical file that is stored in the underlying distributed
filesystem.
• For each store, HBase abstracts access to the underlying filesystem with the help of a thin
wrapper that acts as the intermediary between the store and the underlying physical file.
36
▪ Each region has an in-memory store, or cache, and a write-ahead-log (WAL)
▪ When data is written to a region, it’s first written to the write-ahead-log.
▪ Soon afterwards, it’s written to the region’s in-memory store.
▪ If thein-memory store is full, data is flushed to disk and persisted in the underlying distributed
storage.
37
38
▪ Hadoop is a framework that enables processing of large data sets which reside in the form of
clusters.
▪ Hadoop is made up of several modules that are supported by a large ecosystem of
technologies.
▪ Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems.
▪ There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop
Common.
39
40
HOW IS HBASE DIFFERENT FROM OTHER NOSQL
MODELS
• HBase stores data in the form of key/value pairs in a columnar
model.
• In this model, all the columns are grouped together as Column
families.
• HBase provides a flexible data model and low latency access to
small amounts of data stored in large data sets.
• HBase on top of Hadoop will increase the throughput and
performance of distributed cluster set up.
• In turn, it provides faster random reads and writes operations.
41
42
43
• create - Creates a table.
• list - Lists all the tables in HBase.
• disable - Disables a table.
• is_disabled - Verifies whether a table is disabled.
• enable - Enables a table.
• is_enabled - Verifies whether a table is enabled.
• describe - Provides the description of a table.
• alter - Alters a table.
• exists - Verifies whether a table exists.
• drop - Drops a table from HBase.
• drop_all - Drops the tables matching the ‘regex’ given in the command.
44
• put - Puts a cell value at a specified column in a specified
row in a particular table.
• get - Fetches the contents of row or a cell.
• delete - Deletes a cell value in a table.
• deleteall - Deletes all the cells in a given row.
• scan - Scans and returns the table data.
• count - Counts and returns the number of rows in a table.
• truncate - Disables, drops, and recreates a specified table.
45
▪ hbase>create 'emp', 'personal data', 'professional data’
▪ hbase>list
▪ Table
▪ emp
▪ hbase>Scan ‘emp’
▪ 1 column = personal data:city, timestamp = 1417516501, value = hyderabad
▪ 1 column = personal data:name, timestamp = 1417525058, value = ramu
▪ hbase>alter 'emp', NAME ⇒ 'personal data', VERSIONS ⇒ 5
▪ hbase> alter ‘ table name ’, ‘delete’ ⇒ ‘ column family ’ // Deleting a column
family
46
▪ hbase> put 'selva', 'r1', 'c1', 'value, 10
▪ hbase> put 'selva', 'r1', 'c1', 'value, 15
▪ hbase> put 'selva', 'r1', 'c1', 'value, 20
47
INDEXING IN MONGODB
▪ MongoDB uses indexing in order to make the query processing more efficient.
▪ If there is no indexing, then the MongoDB must scan every document in the
collection and retrieve only those documents that match the query.
▪ Indexes are special data structures that store a small part of the Collection’s
data in a way that can be queried easily.
▪ The indexes are order by the value of the field specified in the index.
▪ In MongoDB, querying without indexes is called a collection scan. A collection
scan will:
• Result in various performance bottlenecks
• Significantly slow down your application
48
▪ use students
▪ db.createCollection("studentgrades")
▪ db.studentgrades.insertMany(
▪ [
▪ {name: "Barry", subject: "Maths", score: 92},
▪ {name: "Kent", subject: "Physics", score: 87},
▪ {name: "Harry", subject: "Maths", score: 99, notes: "Exceptional Performance"},
▪ {name: "Alex", subject: "Literature", score: 78},
▪ {name: "Tom", subject: "History", score: 65, notes: "Adequate"}
▪ ]
▪ )
▪ db.studentgrades.find({},{_id:0})
49
50
▪ db.<collection>.createIndex(<Key and Index Type>, <Options>)
▪ db.studentgrades.createIndex(
▪ {name: 1},
▪ {name: "student name index"}
▪)
51
▪ db.<collection>.getIndexes()
▪ db.studentgrades.getIndexes()
52
▪ db.<collection>.dropIndex(<Index Name / Field Name>)
▪ db.studentgrades.dropIndexes()
53
• Single field index
• Compound index
• Multikey index
54
▪ db.studentgrades.createIndex({name: 1})
55
▪ db.studentgrades.createIndex({subject: 1, score: -1})
56
▪ MongoDB supports indexing array fields.
▪ These multikey indexes enable users to query documents using the elements within the array.
▪ MongoDB will automatically create a multikey index when encountered with an array field without
requiring the user to explicitly define the multikey type.
▪ db.createCollection("studentperformance")
▪ db.studentperformance.insertMany(
▪ [
▪ {name: "Barry", school: "ABC Academy", grades: [85, 75, 90, 99] },
▪ {name: "Kent", school: "FX High School", grades: [74, 66, 45, 67]},
▪ {name: "Alex", school: "XYZ High", grades: [80, 78, 71, 89]},
▪ ]
▪ )
▪ db.studentperformance.find({},{_id:0}).pretty()
57
▪ Built-in indexes are the best option on a table which having many
rows and that rows contain the indexed value.
▪ In a particular column which column having more unique values in
that case we can used indexing.
▪ Table which more overhead due to several reason like column
having more entries then in that case we can used indexing.
58
▪ Command ‘Create index’ creates an index on the column specified by the
user.
▪ After creating an index, Cassandra indexes new data automatically when
data is inserted.
▪ The index cannot be created on primary key as a primary key is already
indexed.
▪ Indexes on collections are not supported in Cassandra.
▪ Without indexing on the column, Cassandra can’t filter that column unless it
is a primary key.
59
60
61
62
▪ Select * from ratings_by_movies where rating=8;
63
▪ Create index IndexName on
KeyspaceName.TableName(ColumnName);
66
▪ After successful execution of the command, DeptIndex will be dropped from the keyspace.
Now data cannot be filtered by the column dept.
67
▪ Indexes can be created on multiple columns and used in queries.
68
▪ The indexes have been created on appropriate low cardinality columns, but the query still
fails.
▪ The error is not due to multiple indexes, but the lack of a partition key definition in the query.
▪ When you attempt a potentially expensive query, such as searching a range of rows,
Cassandra requires the ALLOW FILTERING directive.
▪ cqlsh> SELECT * FROM cycling.cyclist_alt_stats WHERE birthday = '1990-05-27'
AND nationality = 'Portugal' ALLOW FILTERING
69
▪ Collections can be indexed and queried to find a collection containing a particular value.
▪ Set and List collections
70
▪ Maps
▪ CREATE TABLE cycling.birthday_list (cyclist_name text PRIMARY KEY, blist
map<text,text>);
▪ CREATE INDEX blist_idx ON cycling.birthday_list (ENTRIES(blist));
▪ SELECT * FROM cycling.birthday_list WHERE blist['age'] = '23’;
71
72