MODULE 3 (3)
MODULE 3 (3)
1. Schema Flexibility:
○ NoSQL databases often have a schema-less or flexible schema model. This
means they can store unstructured or semi-structured data without needing to
define the structure in advance.
○ This allows data to evolve over time, making it ideal for use cases where data
requirements change frequently.
2. Horizontal Scalability:
○ NoSQL databases are designed for horizontal scaling, meaning you can add
more servers or nodes to handle increased load, rather than upgrading a
single server (vertical scaling).
○ This is crucial for handling large-scale data and provides cost-effective scaling
solutions for growing data needs.
3. Replication:
○ NoSQL systems often support data replication across multiple nodes to
ensure high availability and fault tolerance.
○ This means that even if some nodes fail, the data remains accessible from
other replicated nodes.
4. Sharding (Partitioning):
○ Sharding involves breaking down a large dataset into smaller, more
manageable pieces, called shards, which are stored across multiple servers.
○ Each shard contains a subset of the total data and operates independently,
helping distribute the data and load across the system.
5. BASE Properties:
○ NoSQL databases follow BASE (Basically Available, Soft state, Eventual
consistency) rather than ACID (Atomicity, Consistency, Isolation, Durability)
properties used in relational databases.
■ Basically Available: The system guarantees availability, but not
always consistency.
■ Soft state: The state of the system may change over time, even
without input, due to eventual consistency.
■ Eventual consistency: The system will eventually reach a consistent
state, but not immediately after a transaction.
6. CAP Theorem:
○ The CAP theorem states that a distributed database can only guarantee two
out of the following three properties at any given time:
■ Consistency: Every read receives the most recent write or an error.
■ Availability: Every request receives a response, without guarantee
that it contains the most recent write.
■ Partition Tolerance: The system continues to operate despite
network partition failures.
○ NoSQL databases typically prioritize availability and partition tolerance over
strict consistency.
7. Integrated Caching:
○ Many NoSQL databases come with built-in caching capabilities to improve
performance by storing frequently accessed data in memory.
8. Support for Semi-Structured Data:
○ NoSQL databases can handle semi-structured data formats, such as JSON or
XML, allowing for flexible data models that adapt to varying data types and
structures.
NoSQL databases are designed with an architecture that supports scalability, flexibility, and
high availability. Their architecture is tailored to handle large-scale data operations and is
optimized for distributed computing. Below are the key aspects of NoSQL architecture:
2. Data Distribution:
● Sharding: NoSQL databases use sharding to partition data across multiple nodes.
Each shard contains a subset of the data and operates as an independent database.
Sharding helps distribute the load and manage large datasets efficiently.
● Replication: Data replication ensures that copies of the data exist on multiple nodes.
This setup provides fault tolerance and improves data availability, as data can still be
accessed even if a node fails.
3. Consistency Models:
● BASE Properties: NoSQL systems often follow BASE (Basically Available, Soft
state, Eventual consistency), which means the system guarantees availability and will
eventually become consistent.
● CAP Theorem: NoSQL databases balance between Consistency, Availability, and
Partition Tolerance. According to the CAP theorem, a distributed database can only
guarantee two out of the three properties simultaneously.
4. Scalability:
6. Storage Models:
● Key-Value Store: Data is stored as key-value pairs, providing simple, fast access
(e.g., Redis, DynamoDB).
● Document Store: Data is stored as documents (e.g., MongoDB, CouchDB) that can
have complex nested structures, typically in JSON or BSON formats.
● Column-Family Store: Data is stored in columns rather than rows (e.g., Cassandra,
HBase), which is suitable for queries that access a subset of columns.
● Graph Database: Data is stored in nodes and edges to represent relationships (e.g.,
Neo4j), focusing on the connections between data points.
7. Integrated Caching:
● Many NoSQL databases come with built-in caching mechanisms to store frequently
accessed data in memory. This enhances the performance of read operations by
reducing the need to access disk storage.
MongoDB is a NoSQL database that supports horizontal scaling and high availability
through its sharding and replication mechanisms. These features are crucial for managing
large data volumes and ensuring that data is available even in the event of node failures.
1. Sharding in MongoDB:
2. Replication in MongoDB:
● Sharded Replicated Clusters: MongoDB can combine both sharding and replication
by having each shard in a sharded cluster configured as a replica set. This
configuration provides the benefits of horizontal scaling (via sharding) and high
availability (via replication).
● Operational Benefits:
○ Scalable and Fault-Tolerant: The combination allows MongoDB to scale
data across multiple nodes while ensuring that data remains available during
node failures.
○ Improved Performance: Write operations can be distributed across shards,
and read operations can be scaled out to secondary nodes.
Q4 Master-Slave Architecture
● Master Node:
○ Handles all the write operations in the system.
○ Updates made on the master node are propagated to the slave nodes.
○ Maintains the main copy of the data and ensures data consistency for writes.
● Slave Nodes:
○ Replicate the data from the master node and can handle read operations.
○ Provide additional copies of the data to distribute the load and improve read
performance.
○ Typically configured to be read-only to avoid write conflicts.
2. How It Works:
● Write Operations: All write operations are directed to the master node. Once the
master processes the write, it propagates the change to the slave nodes.
● Read Operations: Read operations can be performed on the master or distributed
across the slave nodes to balance the load and enhance read performance.
● Replication Process:
○ The master node logs all changes in a replication log or oplog.
○ The slave nodes pull updates from this log to maintain synchronization with
the master.
3. Advantages:
● High Availability: The system can still function if a slave node fails, as the master
continues to operate and can direct traffic to the remaining slaves.
● Improved Read Performance: Distributing read operations to slave nodes reduces
the load on the master, enabling better performance for high read-intensive
applications.
● Data Redundancy: Ensures data backup and safety by maintaining multiple copies
of the data across slave nodes.
4. Disadvantages:
● Single Point of Failure: The master node is a single point of failure in this model. If it
goes down, write operations halt until a new master is elected or manually
configured.
● Consistency Lag: Slave nodes may experience a lag in data synchronization,
leading to temporary inconsistencies between the master and slaves.
● Limited Scalability for Writes: Since all write operations are handled by the master
node, the architecture can become a bottleneck if the write load increases
significantly.
5. Use Cases:
● Read-Heavy Applications: Ideal for scenarios where the application has more read
operations than write operations, such as data analytics or reporting.
● Backup and Failover: Can be used to create backup copies of the database, and in
the event of a failure, one of the slaves can be promoted to act as the new master.
Implementation in MongoDB:
Master-slave replication is foundational for providing data reliability and read scalability, but it
has limitations that modern replication models, such as peer-to-peer or multi-master
architectures, aim to address.
Q5 CRUD Operations in Cassandra
Cassandra is a distributed NoSQL database known for handling large volumes of data
across multiple nodes with high availability and fault tolerance. It uses the Cassandra Query
Language (CQL) to perform CRUD (Create, Read, Update, Delete) operations. Below is an
explanation of how these operations work in Cassandra:
1. Create Operation:
Example:
sql
Copy code
INSERT INTO keyspace_name.table_name (column1, column2, column3)
VALUES ('value1', 'value2', 'value3');
●
● Explanation: The INSERT command specifies the keyspace, table, and column
values to be added. If the primary key already exists, the INSERT statement acts as
an UPDATE.
2. Read Operation:
Example:
sql
Copy code
SELECT column1, column2 FROM keyspace_name.table_name
WHERE primary_key_column = 'value';
●
● Explanation: The SELECT statement can retrieve specific columns or all columns
using SELECT *. Conditions in the WHERE clause help filter data based on criteria.
3. Update Operation:
Example:
sql
Copy code
UPDATE keyspace_name.table_name
SET column1 = 'new_value'
WHERE primary_key_column = 'value';
●
● Explanation: The UPDATE command specifies the table, column(s) to be updated,
and the condition to identify the rows. The WHERE clause must include the primary
key to locate the specific row(s) for updating.
4. Delete Operation:
Example:
sql
Copy code
DELETE column1 FROM keyspace_name.table_name
WHERE primary_key_column = 'value';
●
● Explanation: The DELETE command can remove specific columns or entire rows.
The WHERE clause is required to identify which row(s) should be deleted.
● Primary Key Requirement: Most CRUD operations require the WHERE clause to
specify the primary key, ensuring efficient data access.
● Denormalization: Cassandra encourages denormalized data models to optimize for
read-heavy operations. Data is often duplicated across tables to allow for faster
reads.
● Tunable Consistency: Cassandra allows specifying consistency levels for read and
write operations, such as ONE, QUORUM, or ALL, depending on the desired balance
between availability and consistency.
● Batch Operations: Cassandra supports batch operations for executing multiple
INSERT, UPDATE, or DELETE statements as a single atomic operation.
Q6 Data Storage in MongoDB
1. Document Model:
Example Document:
json
Copy code
{
"_id": ObjectId("507f1f77bcf86cd799439011"),
"name": "John Doe",
"email": "[email protected]",
"address": {
"street": "123 Main St",
"city": "Anytown",
"zipcode": "12345"
},
"phone_numbers": ["123-456-7890", "987-654-3210"]
}
2. Collections:
3. BSON Format:
4. Storage Engine:
6. Replication:
● Replica Sets: MongoDB supports data replication through replica sets, where data
is replicated across multiple nodes for high availability and redundancy. One node
acts as the primary (handling writes), while other nodes act as secondaries
(replicating data from the primary).
● Automatic Failover: If the primary node fails, a secondary node is elected as the
new primary, ensuring continuous data availability.
7. Storage Mechanics:
● Flexible Schema: Allows storing various types of data structures without enforcing a
rigid schema.
● Scalability: Supports sharding for horizontal scaling to handle large datasets
efficiently.
● High Availability: Ensures data availability through replica sets and automatic
failover.
● Rich Data Types: BSON format supports a wide range of data types, making
MongoDB versatile for different data storage needs.
Cassandra Query Language (CQL) is used to interact with Cassandra databases. It provides
SQL-like syntax for defining, querying, and manipulating data. Below is a breakdown of
commonly used CQL commands:
○
● ALTER: Modifies an existing keyspace or table structure.
○
● DROP: Deletes a keyspace, table, or index.
Example: Dropping a table.
sql
Copy code
DROP TABLE my_keyspace.users;
○
● SELECT: Retrieves data from a table.
○
● UPDATE: Modifies existing data in a table.
○
● DELETE: Removes data from a table.
○
○ Example: Deleting a full row.
sql
Copy code
DELETE FROM my_keyspace.users WHERE user_id =
'some-uuid';
● Definition: A simple, schema-less data store where each key maps directly to a
value. The value can be any type of data, such as text, images, or multimedia files,
stored as a BLOB (Binary Large Object).
● Characteristics:
○ High performance for quick data retrieval.
○ Excellent scalability for managing very large datasets.
○ High flexibility in storing different types of data.
1. Versatile Data Storage: Can store any type of data in the value field (text, hypertext,
images, video, audio). The system retrieves and returns the data in its entirety when
requested.
2. Fast Querying: Simple queries that request values using keys result in high-speed
data retrieval.
3. Schema Flexibility: Unlike traditional relational databases, key-value stores do not
enforce a rigid schema, allowing for dynamic and varied data structures.
4. Eventual Consistency: Ensures data availability across distributed systems, with
consistency eventually achieved across replicas.
5. Hierarchical or Ordered Storage: Some implementations allow for hierarchical
storage (nested structures) or ordered key-value pairs.
6. Flexible Key Representation:
○ Keys can be generated in different formats (e.g., hash values, logical paths,
REST endpoints).
○ Auto-generated or synthetic keys help in unique identification.
7. Scalability and Reliability: Designed to handle horizontal scaling efficiently, which
supports adding more nodes to manage increased data loads. Provides portability
and incurs low operational costs.
Limitations of Key-Value Stores
1. Lack of Indexed Search: Values are not indexed, making it impossible to search for
subsets or filter based on value conditions.
2. Basic Database Capabilities: Does not natively support advanced database
features like:
○ Atomic Transactions: No built-in support for complex multi-operation
transactions.
○ Consistency Guarantees: Ensuring strong consistency across transactions
can be difficult without additional implementation.
3. Key Uniqueness Management: Maintaining unique and identifiable keys can
become challenging as the volume of data grows.
4. Limited Querying: Cannot query specific fields within the value. Unlike SQL-based
databases, key-value stores do not support complex queries (e.g., WHERE clauses or
value-based filtering).