0% found this document useful (0 votes)
4 views

bda-ia2-bda

The document discusses the business drivers for NoSQL databases, including volume, velocity, variability, and agility, highlighting their advantages over traditional RDBMS. It explains the CAP theorem, the base properties of NoSQL, and different types of NoSQL databases, such as key-value, document, column, and graph databases. Additionally, it covers concepts like shared nothing architecture, sharding, consistent hashing, Bloom filters, and applications of data visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

bda-ia2-bda

The document discusses the business drivers for NoSQL databases, including volume, velocity, variability, and agility, highlighting their advantages over traditional RDBMS. It explains the CAP theorem, the base properties of NoSQL, and different types of NoSQL databases, such as key-value, document, column, and graph databases. Additionally, it covers concepts like shared nothing architecture, sharding, consistent hashing, Bloom filters, and applications of data visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

lOMoARcPSD|23515852

BDA IA2 - bda

B teach (Pillai College of Engineering)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Mervis Mascarenhas ([email protected])
lOMoARcPSD|23515852

1) Business drivers of nosql


Volume:
The need to scale out (also known as horizontal scaling), rather than scale up (faster
processors), moved organizations from serial to parallel processing where data
problems are split into separate paths and sent to separate processors to divide and
conquer the work.

Velocity:It refers to how quickly data is generated and how quickly that data moves.
the ability of a single processor system to rapidly read and write data.When single
processors RDBMSs are used as a back end to a web storefront, the random bursts in
web traffic slow down response for everyone and tuning these systems can be costly
when both high read and write throughput is desired.

Variability:
The number of inconsistencies in the data. Capturing and reporting on exception data
struggle when attempting to use rigid database schema structures imposed by RDBMS
systems. For example, if a business unit wants to capture a few custom fields for a
particular customer, all customer rows within the database need to store this information
even though it doesn't apply. Adding new columns to an RDBMS requires the system to
be shut down and ALTER TABLE commands to be run. When a large database is large,
this process can impact system availability, losing time and money.

Agility: putting data into and getting data out of the database. If your data has nested
and repeated subgroups of data structures you need to include an object-relational
mapping layer. The responsibility of this layer is to generate the correct combination of
INSERT, UPDATE, DELETE and SELECT SQL statements to move object data to and
from the RDBMS persistence layer. This process is not simple and is associated with the
largest barrier to rapid change when developing new or modifying existing applications.

2) CAP theorem for Nosql


CAP stands for Consistency, Availability, and Partitioning.

Consistency: Refers, that all nodes in the network see the same data at the same
time. A transaction cannot be executed partially. It will always be ‘All or none. If
something goes wrong in between the execution of a transaction, the whole transaction
needs to be rolled back.

Availability: Availability is a guarantee that every request receives a response about


whether it was successful or failed. However, it does not guarantee that a read request
returns the most recent write. The more number of users a system can cater to better is
the availability.

Downloaded by Mervis Mascarenhas ([email protected])


lOMoARcPSD|23515852

Partition Tolerance is a guarantee that the system continues to operate despite arbitrary
message loss or failure of part of the system.

- RDBMS can provide only consistency but not partition tolerance. While HBASE
and Redis can provide Consistency and Partition tolerance. And MongoDB,
CouchDB, Cassandra, and Dynamo guarantee only availability but no
consistency. Such databases generally settle down for eventual consistency
meaning that after a while the system is going to be ok.

3) Base properties of NoSQL

Basic Availability: NoSQL databases spread data across many storage systems with a
high degree of replication. In the unlikely event that a failure disrupts access to a
segment of data, this does not necessarily result in a complete database outage.

Soft state indicates that the state of the system may change over time, even without
input. This is because of the eventual consistency model.

Eventual consistency indicates that the system will become consistent over time.

4) NoSQL databases (aka "not only SQL") are non-tabular databases and store data
differently than relational tables. NoSQL databases come in a variety of types based on
their data model. The main types are document, key-value, wide-column, and graph.
They provide flexible schemas and scale easily with large amounts of data and high user
loads.
1. Key-Value Store Database(key and value are stored)
2. Column Store Database(each individual column may contain multiple other columns
like traditional databases. )
3. Document Database( key-value pairs but here, the values are called as
Documents.can be a form of text, arrays, strings, JSON, XML or any such format)
4. Graph Database
(
Clearly, this architecture pattern deals with the storage and management of data in
graphs. Graphs are basically structures that depict connections between two or more
objects in some data. The objects or entities are called as nodes and are joined together
by relationships called Edges. Each edge has a unique identifier. Each node serves as a
point of contact for the graph. This pattern is very commonly used in social networks
where there are a large number of entities and each entity has one or many
characteristics which are connected by edges. The relational database pattern has
tables that are loosely connected, whereas graphs are often very strong and rigid in
nature.
)

Downloaded by Mervis Mascarenhas ([email protected])


lOMoARcPSD|23515852

5)

6) Shared nothing
Shared Nothing Architecture (SNA) is a distributed computing architecture that
consists of multiple separated nodes that don’t share resources. The nodes are
independent and self-sufficient as they have their own disk space and memory.

In such a system, the data set/workload is split into smaller sets (nodes) distributed into
different parts of the system. Each node has its own memory, storage, and independent
input/output interfaces. It communicates and synchronizes with other nodes through a
high-speed interconnect network. Such a connection ensures low latency, high
bandwidth, as well as high availability (with a backup interconnect available in case the
primary fails).

scale the distributed system horizontally and increase the transmission capacity.

SNA has no shared resources. The only thing connecting the nodes is the network layer,
which manages the system and communication among nodes.
Advantages:

Downloaded by Mervis Mascarenhas ([email protected])


lOMoARcPSD|23515852

Easier to Scale

Eliminates Single Points of Failure

Simplifies Upgrades and Prevents Downtime

Disadvantages
Cost

A node consists of its individual processor, memory, and disk. Having dedicated
resources essentially means higher costs when it comes to setting up the system.
Additionally, transmitting data that requires software interaction is more expensive
compared to architectures with shared disk space and/or memory.

Decreased Performance

Scaling up your system can eventually affect the overall performance if the cross-
communication layer isn’t set up correctly.

7) Sharding is a method for distributing a single dataset across multiple databases, which
can then be stored on multiple machines. This allows for larger datasets to be split into
smaller chunks and stored in multiple data nodes, increasing the total storage capacity
of the system. See more on the basics of sharding here.

Similarly, by distributing the data across multiple machines, a sharded database can
handle more requests than a single machine can.

Sharding is a form of scaling known as horizontal scaling or scale-out, as additional


nodes are brought on to share the load. Horizontal scaling allows for near-limitless
scalability to handle big data and intense workloads.
There is overhead and complexity in setting up shards, maintaining the data on each
shard, and properly routing requests across those shards.

WHY to use sharding


Sharding makes the Database smaller
Sharding makes the Database faster
Sharding makes the Database much more easily manageable
Sharding can be a complex operation sometimes
Sharding reduces the transaction cost of the Database

8)
Data stream management system diya hai in tech- knowledge! Follow that & Stream
queries 2 he explain kr na hai
1) Standing queries

Downloaded by Mervis Mascarenhas ([email protected])


lOMoARcPSD|23515852

2) Ad-hoc queries

9) Consistent hashing - Rehashing is a problem, where if a server crashes, the server


location for all keys will changed, for them too who are not crashed (this will increase
load on origin as there will cache miss of keys and rehashing will be done for all the
keys)
Consistent Hashing is a distributed hashing scheme that operates independently of the
number of servers or objects in a distributed hash table by assigning them a position on
an abstract circle, or hash ring. This allows servers and objects to scale without affecting
the overall system.

10) Blooms filter


A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently,
whether an element is present in a set.

The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it
tells us that the element either definitely is not in the set or may be in the set.
The base data structure of a Bloom filter is a Bit Vector

● Unlike a standard hash table, a Bloom filter of a fixed size can represent a set
with an arbitrarily large number of elements.
● Adding an element never fails. However, the false positive rate increases steadily
as elements are added until all bits in the filter are set to 1, at which point all
queries yield a positive result.
● Bloom filters never generate false negative result, i.e., telling you that a
username doesn’t exist when it actually exists.
● Deleting elements from filter is not possible because, if we delete a single
element by clearing bits at indices generated by k hash functions, it might cause
deletion of few other elements.

Downloaded by Mervis Mascarenhas ([email protected])


lOMoARcPSD|23515852

The applications of Bloom Filter are:

● Weak password detection


● Internet Cache Protocol
● Safe browsing in Google Chrome
● Wallet synchronization in Bitcoin
● Hash-based IP Traceback
11) DGIM
a) https://medium.com/fnplus/dgim-algorithm-169af6bb3b0c

12) Data visualization in R https://www.google.com/amp/s/www.geeksforgeeks.org/data-


visualization-in-r/amp/

Different applications of data visualization


● Healthcare Industries
● Business intelligence
● Military
● Data Science
● Finance industries
● Real estate business
● Food delivery apps
● Marketing

Downloaded by Mervis Mascarenhas ([email protected])

You might also like