bda-ia2-bda
bda-ia2-bda
Velocity:It refers to how quickly data is generated and how quickly that data moves.
the ability of a single processor system to rapidly read and write data.When single
processors RDBMSs are used as a back end to a web storefront, the random bursts in
web traffic slow down response for everyone and tuning these systems can be costly
when both high read and write throughput is desired.
Variability:
The number of inconsistencies in the data. Capturing and reporting on exception data
struggle when attempting to use rigid database schema structures imposed by RDBMS
systems. For example, if a business unit wants to capture a few custom fields for a
particular customer, all customer rows within the database need to store this information
even though it doesn't apply. Adding new columns to an RDBMS requires the system to
be shut down and ALTER TABLE commands to be run. When a large database is large,
this process can impact system availability, losing time and money.
Agility: putting data into and getting data out of the database. If your data has nested
and repeated subgroups of data structures you need to include an object-relational
mapping layer. The responsibility of this layer is to generate the correct combination of
INSERT, UPDATE, DELETE and SELECT SQL statements to move object data to and
from the RDBMS persistence layer. This process is not simple and is associated with the
largest barrier to rapid change when developing new or modifying existing applications.
Consistency: Refers, that all nodes in the network see the same data at the same
time. A transaction cannot be executed partially. It will always be ‘All or none. If
something goes wrong in between the execution of a transaction, the whole transaction
needs to be rolled back.
Partition Tolerance is a guarantee that the system continues to operate despite arbitrary
message loss or failure of part of the system.
- RDBMS can provide only consistency but not partition tolerance. While HBASE
and Redis can provide Consistency and Partition tolerance. And MongoDB,
CouchDB, Cassandra, and Dynamo guarantee only availability but no
consistency. Such databases generally settle down for eventual consistency
meaning that after a while the system is going to be ok.
Basic Availability: NoSQL databases spread data across many storage systems with a
high degree of replication. In the unlikely event that a failure disrupts access to a
segment of data, this does not necessarily result in a complete database outage.
Soft state indicates that the state of the system may change over time, even without
input. This is because of the eventual consistency model.
Eventual consistency indicates that the system will become consistent over time.
4) NoSQL databases (aka "not only SQL") are non-tabular databases and store data
differently than relational tables. NoSQL databases come in a variety of types based on
their data model. The main types are document, key-value, wide-column, and graph.
They provide flexible schemas and scale easily with large amounts of data and high user
loads.
1. Key-Value Store Database(key and value are stored)
2. Column Store Database(each individual column may contain multiple other columns
like traditional databases. )
3. Document Database( key-value pairs but here, the values are called as
Documents.can be a form of text, arrays, strings, JSON, XML or any such format)
4. Graph Database
(
Clearly, this architecture pattern deals with the storage and management of data in
graphs. Graphs are basically structures that depict connections between two or more
objects in some data. The objects or entities are called as nodes and are joined together
by relationships called Edges. Each edge has a unique identifier. Each node serves as a
point of contact for the graph. This pattern is very commonly used in social networks
where there are a large number of entities and each entity has one or many
characteristics which are connected by edges. The relational database pattern has
tables that are loosely connected, whereas graphs are often very strong and rigid in
nature.
)
5)
6) Shared nothing
Shared Nothing Architecture (SNA) is a distributed computing architecture that
consists of multiple separated nodes that don’t share resources. The nodes are
independent and self-sufficient as they have their own disk space and memory.
In such a system, the data set/workload is split into smaller sets (nodes) distributed into
different parts of the system. Each node has its own memory, storage, and independent
input/output interfaces. It communicates and synchronizes with other nodes through a
high-speed interconnect network. Such a connection ensures low latency, high
bandwidth, as well as high availability (with a backup interconnect available in case the
primary fails).
scale the distributed system horizontally and increase the transmission capacity.
SNA has no shared resources. The only thing connecting the nodes is the network layer,
which manages the system and communication among nodes.
Advantages:
Easier to Scale
Disadvantages
Cost
A node consists of its individual processor, memory, and disk. Having dedicated
resources essentially means higher costs when it comes to setting up the system.
Additionally, transmitting data that requires software interaction is more expensive
compared to architectures with shared disk space and/or memory.
Decreased Performance
Scaling up your system can eventually affect the overall performance if the cross-
communication layer isn’t set up correctly.
7) Sharding is a method for distributing a single dataset across multiple databases, which
can then be stored on multiple machines. This allows for larger datasets to be split into
smaller chunks and stored in multiple data nodes, increasing the total storage capacity
of the system. See more on the basics of sharding here.
Similarly, by distributing the data across multiple machines, a sharded database can
handle more requests than a single machine can.
8)
Data stream management system diya hai in tech- knowledge! Follow that & Stream
queries 2 he explain kr na hai
1) Standing queries
2) Ad-hoc queries
The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it
tells us that the element either definitely is not in the set or may be in the set.
The base data structure of a Bloom filter is a Bit Vector
● Unlike a standard hash table, a Bloom filter of a fixed size can represent a set
with an arbitrarily large number of elements.
● Adding an element never fails. However, the false positive rate increases steadily
as elements are added until all bits in the filter are set to 1, at which point all
queries yield a positive result.
● Bloom filters never generate false negative result, i.e., telling you that a
username doesn’t exist when it actually exists.
● Deleting elements from filter is not possible because, if we delete a single
element by clearing bits at indices generated by k hash functions, it might cause
deletion of few other elements.