0% found this document useful (0 votes)
10 views21 pages

BDA Assign 1

Uploaded by

Sushant K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views21 pages

BDA Assign 1

Uploaded by

Sushant K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Assignment-1

Big Data Analysis(21CS71)

Last date to submit assignment 26-10-2024

1.Define Data ,Explain classification of Data. .Explain Big Data Characteristics.

2..Explain vertical Scalability and Horizontal scalability.

3.What is Grid Computing? List and explain features of Grid Computing.

4.What is meant by Data Quality? Explain factors affecting Data quality.

5.Define cloud computing. Explain types of services over cloud.

6.What is meant by Data Analysis? Discuss the phases in Data Analytics.

7.Discuss the functions of each of the five layers in Big Data Architecture Design.

8.With a neat diagram write a note on Hadoop Tool.

9.Explain any three Big Data Applications.

10.List and explain features of distributed computing architecture.

11.Explain ACID properties in SQL transactions.

12. .Explain Brewer’s CAP Theorem.

13.Discuss the BASE properties in NOSQL database.

14.What is Document store data architecture? Discuss the features of document store
NOSQL data architecture.

15. Explain Key-Value store NOSQL database architecture.


1.Define Data ,Explain classification of Data. .Explain Big Data Characteristics.

Definition of Data

Data refers to information that can be stored, processed, and analyzed. It includes facts,
statistics, and observations used for calculations or decision-making. Data can be
presented in various forms such as numbers, text, images, and videos.

Classification of Data

Data can be classified into the following types:

1. Structured Data
o Data that conforms to a specific schema or data model.
o Found in tables with rows and columns, like in relational databases
(RDBMS).
o Examples: SQL databases, spreadsheets.
o Features:
▪ Enables operations such as insert, delete, update, and append.
▪ Indexing for faster retrieval.
▪ Supports transaction processing following ACID rules.
2. Semi-Structured Data
o Data that contains tags or markers separating elements but does not
conform to a fixed schema.
o Examples: XML, JSON documents.
o Semi-structured data has some organizational properties but is more
flexible compared to structured data.
3. Multi-Structured Data
o Data that consists of multiple formats, including structured, semi-
structured, and unstructured data.
o Found in non-transactional systems and in various formats like streaming
data.
o Examples: Customer interaction logs, sensor data, enterprise server data.
4. Unstructured Data
o Data that lacks a predefined structure, such as text, images, and videos.
o Examples: Emails, social media posts, text documents, audio, and video
files.
o Unstructured data may have internal structures but do not follow a
formal schema.

Big Data Characteristics

Big Data refers to large and complex data sets that traditional data processing tools
cannot handle. It has the following main characteristics, known as the 3Vs, with an
additional V for veracity:

1. Volume
o Refers to the sheer size of the data.
o Big Data typically involves massive amounts of data being generated
daily.
2. Velocity
o Refers to the speed at which data is generated and processed.
o In the era of Big Data, data is often generated in real-time or near-real-
time.
3. Variety
o Refers to the different types of data formats.
o Big Data can be structured, semi-structured, or unstructured, including
text, images, videos, and more.
4. Veracity
o Refers to the quality and accuracy of the data.
o The reliability of the data can vary, and its accuracy impacts the analysis.

Examples of Big Data

• Social Networks: Data from platforms like Facebook, Twitter, and YouTube.
• Transactions Data: Credit card transactions, bookings, public records.
• Machine-Generated Data: Data from sensors, Internet of Things (IoT), and
machine logs.
• Human-Generated Data: Emails, documents, biometrics, and interactions
recorded in digital formats.

Examples of Big Data Usage

• Weather Data Monitoring and Prediction: Large volumes of weather data


collected for accurate forecasting.
• Predictive Maintenance Services: Data from connected cars for automotive
maintenance prediction.
• Vending Machine Analytics: Data collected from a large number of vending
machines to track product usage and demand.
2..Explain vertical Scalability and Horizontal scalability.

Vertical Scalability

• Definition: Vertical scalability involves adding more resources (like CPUs, RAM,
or storage) to a single system to enhance its performance.
• Capabilities: This approach improves analytics, reporting, and visualization
capabilities, making it suitable for handling more complex problems.
• Efficient Design: It focuses on designing algorithms that effectively use the
increased resources.
• Example: If processing xxx terabytes of data takes time ttt, and increasing
complexity requires a factor of nnn, scaling up should reduce processing time to
equal, less, or much less than n×tn \times tn×t.

Horizontal Scalability

• Definition: Horizontal scalability involves adding more systems (servers) to


work together, distributing the workload across these multiple systems.
• Workload Distribution: This approach enables the processing of different
datasets simultaneously from a larger dataset.
• Example: If rrr resources can process xxx terabytes in time ttt, then p×xp \times
xp×x terabytes can be processed on ppp parallel distributed nodes, ideally
keeping the time at ttt or slightly more due to communication between nodes.

3.What is Grid Computing? List and explain features of Grid Computing.

Grid Computing is a form of distributed computing where multiple computers located


at different geographic locations are connected together to work towards a common
goal. The computers, which may have different configurations and be geographically
dispersed, collectively form a "grid" to perform a task or process large datasets. The
system pools computing power, storage, and resources, allowing them to be shared
across multiple users.

The primary use of grid computing is to enable resource sharing among users, including
individuals and organizations, to solve complex, large-scale computational problems. A
grid typically dedicates itself to one task at a time, such as scientific simulations, large-
scale data analysis, or complex problem-solving.

Features of Grid Computing

1. Resource Sharing:
o Grid computing allows for the sharing of heterogeneous resources (e.g.,
different types of computers, storage systems, and network components)
across multiple organizations or users. Resources are dispersed across
different geographic locations and can be accessed and used efficiently.
2. Scalability:
o Like cloud computing, grid computing is highly scalable. It can grow by
adding more nodes to the grid, enabling the system to handle increasing
amounts of data or computational tasks. This makes grid computing ideal
for handling tasks that require significant resources.
3. Geographical Distribution:
o Grid computing resources are distributed across various locations rather
than being centralized. This allows for the efficient utilization of
computational resources spread over a wide area, regardless of their
physical location.
4. Large-Scale Resource Coordination:
o Grid computing provides a coordinated framework for using multiple
resources. The coordination ensures that all resources are working
together seamlessly to complete a common task efficiently and securely. It
allows large amounts of data and computing power to be integrated and
utilized effectively.
5. Data-Intensive Operations:
o Grid computing is particularly suited for handling data-intensive tasks. It
is more efficient for storing and processing large datasets spread across
different grid nodes than for handling smaller, less intensive operations.
6. Task-Specific Grids:
o There are different types of grids, such as data grids (focused on
managing and distributing large amounts of data) and computational
grids (focused on processing computationally intensive tasks). Each grid
serves a specific purpose, ensuring that the resources available are
utilized appropriately for the task.
7. Security and Flexibility:
o Grid computing ensures secure and flexible sharing of resources. The
architecture provides a secure framework that protects sensitive data and
operations while still allowing flexible access for users based on their
needs.
8. Cost Efficiency:
o By pooling and sharing resources across a grid, organizations can reduce
their costs. Rather than buying expensive dedicated hardware, they can
use the collective power of distributed resources, which is often more
cost-effective.

4.What is meant by Data Quality? Explain factors affecting Data quality.

Data Quality refers to the degree to which data is accurate, reliable, and suitable for its
intended use. High-quality data represents the real-world constructs accurately and is
essential for effective operations, analysis, decision-making, and knowledge discovery.
For example, high-quality data in an artificial intelligence system enables accurate
model predictions and decisions.

A common definition of data quality can be framed around the "5 R's":

1. Relevancy: Data must be relevant to the task or decision-making process.


2. Recency: Data must be up-to-date and recent.
3. Range: Data must cover the necessary scope and variability.
4. Robustness: Data should remain effective even with small variances or errors.
5. Reliability: Data must be consistent and dependable.
Factors Affecting Data Quality

1. Data Integrity:
o Definition: Data integrity refers to the accuracy and consistency of data
over its lifecycle. The integrity of data must be maintained to ensure that
it remains correct and uncorrupted.
o Example: In a student grading system, the grades should remain
consistent and unchanged through processing.
2. Noise:
o Definition: Noise in data refers to unwanted or meaningless information
that deviates from the actual or true value. Noisy data can negatively
impact the accuracy of analysis.
o Characteristics: Noise is often random and can cause both positive and
negative deviations.
o Impact: Noisy data leads to incorrect conclusions during analysis.
3. Outliers:
o Definition: Outliers are data points that deviate significantly from the
rest of the dataset. These values may be valid or due to errors.
o Impact: If outliers are not correctly handled, they can distort results.
Removing true outliers or misclassifying valid data as outliers can lead to
incorrect analysis outcomes.
4. Missing Values:
o Definition: Missing values occur when data is absent from the dataset.
o Impact: Missing values affect the completeness of data and can skew
analysis, leading to inaccurate results. Techniques such as imputation
(estimating missing values) are often used to address this issue.
5. Duplicate Values:
o Definition: Duplicate values occur when the same data is repeated more
than once in a dataset.
o Impact: Duplicate values can lead to bias in analysis and inflated results.
Removing duplicate entries is essential to ensure accurate data
interpretation.

5.Define cloud computing. Explain types of services over cloud.

Cloud Computing is a type of Internet-based computing that provides shared


processing resources and data to computers and other devices on demand. It allows
users to access computing infrastructure, platforms, and software services without
needing to manage physical hardware directly. The primary goal of cloud computing is
to perform parallel and distributed computing, ensuring high availability, flexibility, and
scalability.

Key features of cloud computing include:

1. On-demand service: Resources are available whenever needed.


2. Resource pooling: Resources are shared across multiple users.
3. Scalability: The system can scale up or down based on demand.
4. Accountability: Users can track resource usage and performance.
5. Broad network access: Services are accessible from anywhere over the Internet
6.What is meant by Data Analysis? Discuss the phases in Data Analytics

Data Analysis is the process of inspecting, cleaning, transforming, and modeling data
with the goal of discovering useful information, suggesting conclusions, and supporting
decision-making. It involves various statistical and mathematical techniques to extract
meaningful insights from data, ultimately aiding in making informed business decisions.

Data analysis provides value by turning raw data into actionable insights, which helps
businesses understand trends, behaviors, and opportunities.

Phases in Data Analytic

Data analytics can be broken down into the following key phases:

1. Descriptive Analytics:
o Definition: Focuses on summarizing historical data to derive insights. It
helps in understanding what has happened in the past through reports
and visualizations.
o Examples: Sales reports, data visualizations like charts or dashboards
that show trends over time.
2. Predictive Analytics:
o Definition: Uses statistical models and machine learning techniques to
forecast future outcomes based on historical data.
o Examples: Predicting customer churn, demand forecasting for products,
stock price prediction.
3. Prescriptive Analytics:
o Definition: Suggests actions or decisions by analyzing data and
recommending the best course of action to maximize outcomes or profits.
o Examples: Recommending product pricing strategies, optimizing supply
chain operations.
4. Cognitive Analytics:
o Definition: Uses advanced algorithms and artificial intelligence to
simulate human thought processes, enabling systems to make better
decisions in complex scenarios.
o Examples: Intelligent virtual assistants like Siri or Alexa, systems that
understand natural language and improve decision-making.
7.Discuss the functions of each of the five layers in Big Data Architecture Design.
Functions of Each Layer in Big Data Architecture Design

Big Data architecture is organized into five layers, each with specific functions that
contribute to the overall management and processing of large volumes of data. Here's a
detailed overview of the functions of each layer:

Layer 1: Data Source Identification (L1)

Function: This layer focuses on identifying and categorizing the various data sources
from which data will be collected.

• Key Functions:
o Identify Data Sources: Determine whether the sources are internal (e.g.,
company databases) or external (e.g., social media, public datasets).
o Data Source Types: Understand the nature of the data sources, such as
structured (databases), semi-structured (JSON, XML), or unstructured
(text documents, images).
o Data Volume Considerations: Estimate the amount of data to be
ingested and plan accordingly.
o Data Formats: Recognize the formats of incoming data and prepare for
conversion or processing as necessary.

Layer 2: Data Ingestion (L2)

Function: The data ingestion layer is responsible for absorbing data from various
sources and preparing it for further processing.

• Key Functions:
o Data Acquisition: Collect data in real-time or through batch processing.
o Ingestion Processes: Utilize ETL (Extract, Transform, Load) processes to
prepare data for storage and analysis.
o Push vs. Pull Mechanisms: Implement push (data sent automatically
from the source) or pull (data requested from the source) strategies for
data ingestion.
o Real-time vs. Batch Ingestion: Decide whether to process data
continuously as it arrives or in scheduled intervals.

Layer 3: Data Storage (L3)

Function: This layer provides a reliable storage solution for the ingested data, ensuring
that it is organized and easily accessible for processing.

• Key Functions:
o Data Storage Solutions: Choose appropriate storage technologies, such
as Hadoop Distributed File System (HDFS), NoSQL databases (e.g.,
Cassandra, MongoDB), or traditional relational databases.
o Data Formats and Compression: Decide on data formats for storage and
apply compression techniques to optimize space.
o Historical vs. Incremental Storage: Determine whether to store
historical data or manage incremental updates to datasets.
o Access Patterns: Understand how the data will be queried and accessed
for subsequent processing and analytics.

Layer 4: Data Processing (L4)

Function: This layer involves transforming and analyzing the stored data using various
data processing tools and frameworks.

• Key Functions:
o Data Processing Tools: Utilize frameworks and software such as
MapReduce, Apache Spark, Hive, and Pig for data processing tasks.
o Processing Modes: Implement scheduled batch processing, real-time
processing, or a hybrid approach based on application needs.
o Synchronous vs. Asynchronous Processing: Manage processing
requirements depending on how data consumption occurs at the upper
layer (L5).
o Data Transformation: Perform necessary transformations to clean,
aggregate, or enrich data for analytics.

Layer 5: Data Consumption (L5)

Function: This layer is dedicated to the consumption of processed data through various
applications, reporting tools, and analytics platforms.

• Key Functions:
o Data Integration: Integrate processed data with business intelligence
tools and analytics platforms for seamless usage.
o Reporting and Visualization: Provide capabilities for data visualization
and reporting using tools like Tableau, Power BI, or custom dashboards.
o Analytics Applications: Support various analytics processes, including
descriptive analytics, predictive analytics, data mining, and machine
learning.
o Export Capabilities: Allow for exporting datasets to cloud storage, web
applications, or other systems for additional processing or sharing.

1. Layer 1 (L1): Identifies and categorizes data sources.


2. Layer 2 (L2): Manages data ingestion from various sources.
3. Layer 3 (L3): Provides storage solutions for ingested data.
4. Layer 4 (L4): Processes and transforms data for analysis.
5. Layer 5 (L5): Consumes processed data for reporting, visualization, and
analytics.
8.With a neat diagram write a note on Hadoop Tool.

Hadoop is a powerful framework designed to store and process vast amounts of data in
a distributed environment. It consists of various core components and features that
enhance its capabilities in handling Big Data. Below is an overview of the Hadoop core
components and its features.

Core Components of Hadoop

1. Hadoop Common:
o This module includes the libraries and utilities essential for the other
Hadoop modules.
o It provides various components and interfaces for distributed file systems
and general input/output, such as serialization, Java RPC (Remote
Procedure Call), and file-based data structures.
2. Hadoop Distributed File System (HDFS):
o HDFS is a Java-based distributed file system that can store all types of
data on disks across clusters.
o It is designed to provide high-throughput access to application data and is
optimized for large datasets.
3. MapReduce v1:
o This is the original programming model in Hadoop, utilizing the Mapper
and Reducer functions.
o It processes large sets of data in parallel and in batches, breaking tasks
into smaller, manageable sub-tasks.
4. YARN (Yet Another Resource Negotiator):
o YARN is responsible for managing resources across the Hadoop
ecosystem.
o It allows user application tasks or sub-tasks to run in parallel, using
scheduling to handle resource requests in a distributed environment.
5. MapReduce v2:
o This is the improved version of the MapReduce framework introduced
with Hadoop 2, built on the YARN architecture.
o It enhances parallel processing of large datasets and enables distributed
processing of application tasks.
Features of Hadoop

1. Scalable, Flexible, and Modular Design:


o Hadoop features a simple and modular programming model that allows
for high scalability.
o New nodes can be added to handle larger data sets, making it suitable for
storing, managing, processing, and analyzing Big Data.
2. Robust Design of HDFS:
o HDFS is designed to ensure that Big Data applications continue running
even if an individual server or cluster fails.
o It implements backup through data replication (at least three copies for
each data block) and includes a data recovery mechanism, ensuring high
reliability.
3. Ability to Store and Process Big Data:
o Hadoop is capable of processing data characterized by the 3V attributes:
Volume, Velocity, and Variety.
4. Distributed Clusters Computing Model with Data Locality:
o The framework processes Big Data quickly as application tasks and sub-
tasks are submitted to DataNodes, enabling greater computing power by
increasing the number of computing nodes.
o This design allows for efficient processing across multiple DataNodes,
resulting in faster computation and aggregated results.
5. Hardware Fault Tolerance:
o Faults in individual nodes do not impact data and application processing,
as other nodes can manage the residual workload.
o Hadoop automatically replicates data blocks (default is three copies) to
ensure data integrity and availability.
6. Open-source Framework:
o Hadoop's open-source nature and access to cloud services facilitate large
data storage capabilities.
o It can operate on clusters of multiple inexpensive servers or in the cloud.
7. Java and Linux Based:
o Hadoop is primarily developed using Java interfaces and is designed to
run on Linux systems, featuring its own set of shell commands for
operations.

9.Explain any three Big Data Applications.

1. Big Data in Marketing and Sales

• Customer Value Analytics (CVA): Analyzes customer needs and preferences to


help businesses understand what products to offer. For instance, companies like
Amazon use CVA to provide tailored shopping experiences.
• Operational Analytics: Focuses on improving internal company operations by
analyzing data for efficiency.
• Fraud Detection: Identifies and prevents fraudulent activities, such as
individuals taking loans against the same asset from multiple banks.
• Product Innovations: Uses data to create new products or improve existing
services, like developing apps for ride-sharing.
• Enterprise Data Warehouse Optimization: Enhances how businesses manage
and utilize their data for better decision-making.

2. Big Data Analytics in Detection of Marketing Frauds

• Data Fusion: Combines data from different sources (like social media and
internal databases) to provide a comprehensive view that aids in fraud detection.
• Multiple Data Sources: Uses various types of data to enhance insights and
reporting.
• Real-Time Analytics: Analyzes data quickly to detect potential fraud before it
causes significant harm.

3. Big Data Risks

• Data Quality: Concerns about the accuracy and reliability of data, which can lead
to incorrect analyses.
• Security and Privacy: Risks related to data breaches and unauthorized access to
sensitive information.
• Financial Costs: The high costs associated with managing large volumes of data,
which can impact profitability.

4. Big Data in Credit Risk Management

• Loan Default Predictions: Analyzes data to assess the likelihood of borrowers


defaulting on loans.
• Risk Identification: Helps financial institutions identify high-risk customers and
sectors.
• Credit Opportunities: Uses data insights to find new customers and improve
lending strategies.

5. Big Data and Algorithmic Trading

• Automated Trading: Uses algorithms to execute trades based on market data


analysis, optimizing buying and selling strategies.
• Risk Analysis: Big data helps traders make informed decisions by analyzing
large volumes of market data quickly.

6. Big Data and Healthcare

• Patient Monitoring: Uses data from wearable devices to track patient health in
real time.
• Fraud Prevention: Identifies duplicate claims and unnecessary medical tests to
reduce costs in the healthcare system.
• Improving Outcomes: Predictive analytics helps in early diagnosis and
treatment of conditions, leading to better patient outcomes.

7. Big Data in Medicine


• Personalized Medicine: Analyzes large data sets to tailor treatments to
individual patients based on their unique health profiles.
• Research Advancements: Facilitates research by integrating data from various
sources (like genetics and clinical studies) to improve understanding of diseases.

10.List and explain features of distributed computing architecture.

11.Explain ACID properties in SQL transactions.

ACID Properties in SQL Transactions

1. Atomicity
o Definition: All operations within a transaction must complete
successfully. If any operation fails, the entire transaction is rolled back,
meaning no changes are made to the database.
o Example: Consider a bank transaction where a customer withdraws
money. If the first operation deducts the amount from the account but the
second operation (updating the balance) fails, the entire transaction is
rolled back. This ensures that either both operations succeed, or none do.
2. Consistency
o Definition: A transaction must bring the database from one valid state to
another, maintaining all predefined rules and constraints (like integrity
constraints).
o Example: In a bank, the total of all deposits and withdrawals must equal
the current balance. If a transaction results in an inconsistency (e.g., a
balance that does not match the sum of transactions), it must be rolled
back.
3. Isolation
o Definition: Transactions must be executed in isolation from one another,
meaning that concurrent transactions do not interfere with each other.
The result of a transaction should not be visible to other transactions until
it is committed.
o Example: If two customers are trying to withdraw money from the same
account at the same time, isolation ensures that each transaction is
processed independently. This prevents situations where both
transactions see an outdated balance.
4. Durability
o Definition: Once a transaction is committed, its changes are permanent
and must survive system failures (like crashes or power losses).
o Example: After successfully withdrawing money, even if the system
crashes, the change (the deduction from the account) must still exist in
the database once the system is restored.
12. .Explain Brewer’s CAP Theorem.

CAP Theorem

The CAP Theorem, proposed by Eric Brewer, is a fundamental principle in distributed


database systems. It states that in a distributed system, it is impossible to
simultaneously provide all three of the following guarantees:

1. Consistency (C)
2. Availability (A)
3. Partition Tolerance (P)

Components of CAP Theorem

1. Consistency (C)

• Definition: In a distributed database, consistency ensures that all nodes see the
same data at the same time. If one node updates data, all other nodes should
immediately reflect this change.
• Example: In a sales database, if a sale is recorded at one showroom, the updated
sales data must be visible in all related tables across different nodes that rely on
that information. This means that if a user queries the sales data, they should see
the latest information regardless of which node they access.

2. Availability (A)

• Definition: Availability guarantees that every request receives a response,


whether it is a success or failure. The system remains operational and can
provide responses to queries even during failures.
• Example: If a user tries to access the database and one partition is down,
another partition should still provide a response. This means that if some nodes
are unreachable, at least one node must remain available to handle requests.

3. Partition Tolerance (P)

• Definition: Partition tolerance refers to the system's ability to continue


operating even when network failures or communication issues prevent some
nodes from connecting to others. The system can tolerate the loss of
communication between nodes.
• Example: If a network segment fails, the distributed database should still
function normally within the remaining segments. It should handle requests and
updates even if some nodes cannot communicate with others.

Trade-offs in the CAP Theorem

Brewer's CAP Theorem implies that in the presence of a network partition, a distributed
system can only guarantee two of the three properties at any time:

• CA (Consistency and Availability):


o The system will ensure consistency and availability, but during a network
partition, it may become unavailable to clients (i.e., it might refuse to
serve requests if it can't guarantee the latest data).
• AP (Availability and Partition Tolerance):
o The system remains available even when partitions occur, but it may
provide outdated or inconsistent data since it prioritizes availability over
consistency.
• CP (Consistency and Partition Tolerance):
o The system ensures consistency and can tolerate partitions, but it might
sacrifice availability, meaning it may not respond to requests until it can
guarantee the most recent data.

Practical Implications

When designing distributed systems, developers must consider the trade-offs presented
by the CAP theorem based on the application's needs:

• For systems where data accuracy is critical (e.g., banking applications):


Consistency and Partition Tolerance (CP) are often prioritized.
• For systems that require high availability (e.g., social media platforms):
Availability and Partition Tolerance (AP) are typically prioritized, possibly at the
expense of strict consistency.

Conclusion

The CAP Theorem is essential for understanding the limitations and trade-offs involved
in building distributed systems. It helps developers make informed decisions based on
the specific requirements of their applications regarding consistency, availability, and
partition tolerance.
13.Discuss the BASE properties in NOSQL database.

BASE is an acronym that represents three key properties of NoSQL databases, offering
an alternative approach to the traditional ACID properties found in SQL databases. The
components of BASE are:

1. Basic Availability
2. Soft State
3. Eventual Consistency

1. Basic Availability

• Definition: Basic availability ensures that the database system remains


operational and responsive, even when some parts of the system fail. This is
achieved through data distribution and replication.
• Implementation: Data is partitioned into shards and distributed across multiple
nodes. Each shard has multiple replicas, so if one node fails, the system can still
access data from other nodes.
• Example: In a distributed NoSQL database, if a segment containing a portion of
the data becomes unavailable due to a server failure, the system can still serve
requests using data from replicas located on other servers.

2. Soft State

• Definition: Soft state means that the state of the system may change over time,
even without new inputs, due to eventual consistency. This property allows the
system to operate even in the presence of temporary inconsistencies.
• Implementation: Unlike traditional databases that require immediate
consistency, NoSQL databases can accept data in a partially inconsistent state
and resolve inconsistencies over time. Applications are designed to handle these
inconsistencies during processing.
• Example: In a social media application, when a user updates their status, it may
take some time for that update to propagate across all nodes in the system.
During this time, some nodes may show the old status, but the application
continues to function without interruption.

3. Eventual Consistency

• Definition: Eventual consistency is a model where the database guarantees that,


given enough time without new updates, all replicas of the data will converge to
a consistent state. However, there is no strict time frame for achieving this
consistency.
• Implementation: This model allows for greater flexibility in how data is
managed, enabling high availability and partition tolerance while accepting that
data may be temporarily inconsistent.
• Example: In an e-commerce platform, if a product's price is updated, there may
be a delay before all nodes reflect the new price. However, the system guarantees
that eventually, all nodes will update to show the correct price after a certain
period.
14.What is Document store data architecture? Discuss the features of document
store NOSQL data architecture.

Document Store Data Architecture

Document store data architecture is a type of NoSQL database that stores, retrieves, and
manages semi-structured data in the form of documents. Unlike traditional relational
databases that use tables to organize data, document stores use flexible formats like
JSON, XML, or BSON to allow for a more dynamic schema. This flexibility makes
document stores suitable for applications that require rapid development and iterative
changes to data structures.

Features of Document Store NoSQL Data Architecture

1. Schema Flexibility:
o Document stores do not enforce a rigid schema. Each document can have
its own structure, allowing for diverse data types and fields within the
same collection.
2. Storage of Unstructured Data:
o They excel at managing unstructured or semi-structured data, which does
not fit neatly into traditional rows and columns.
3. Hierarchical Structure:
o Data is organized in a nested hierarchy. For example, JSON documents can
contain arrays and objects, allowing for complex data structures to be
stored in a single document.
4. Easy Querying:
o Document stores provide intuitive querying capabilities using document
attributes, enabling users to retrieve specific parts of a document
efficiently.
5. No Object-Relational Mapping (ORM):
o Unlike relational databases that require ORM for data mapping, document
stores allow direct access to data structures, making it easier to navigate
and manipulate data.
6. ACID Transactions:
o Document stores can support ACID (Atomicity, Consistency, Isolation,
Durability) properties, ensuring reliable transactions, especially
important in multi-user environments.
7. High Performance:
o They are designed for high-performance reads and writes, making them
suitable for applications that require quick data access.
8. Scalability:
o Document stores can be easily scaled horizontally by adding more servers
to accommodate growing data volumes and user loads.
9. Rich Indexing Options:
o They offer various indexing capabilities, such as full-text search and
indexing on nested fields, which enhance query performance.
10. Data Retrieval by Path:
o Users can perform queries based on the path through the document tree,
enabling fine-grained retrieval of nested data.
Typical Use Cases

Document stores are commonly used in various applications, including:

• Content management systems


• E-commerce applications
• Real-time analytics
• User profiles and settings
• Internet of Things (IoT) applications

Examples of Document Stores

• MongoDB: Widely used for its flexibility and scalability, allowing developers to
store data in JSON-like format.
• CouchDB: Uses a schema-free JSON document storage model, providing features
like multi-master replication and MapReduce querying.

15. Explain Key-Value store NOSQL database architecture

A key-value store is a type of NoSQL database that uses a simple schema-less data
model. It operates using key-value pairs, where each key is a unique identifier that maps
to a specific value, which can be a string, object, or BLOB (Binary Large Object). This
model is akin to a hash table, where the unique key points to a particular piece of data.

Characteristics

• High Performance: Key-value stores are designed for fast data retrieval,
allowing for quick access to data using the primary key.
• Scalability: They can easily scale to accommodate large datasets, making them
suitable for applications with varying data sizes.
• Flexibility: Data types stored in the value field can vary, supporting a wide range
of data formats
Advantages of Key-Value Stores

1. Versatile Data Types: Key-value stores can handle any data type as the value
(text, images, video, etc.). For example, querying a key retrieves the associated
data much like looking up a word in a dictionary.
2. Simple Queries: Queries return the values as a single item, simplifying data
retrieval.
3. Eventual Consistency: Data in key-value stores is eventually consistent,
meaning that updates may not be immediately visible across all nodes but will
propagate over time.
4. Hierarchical Structures: Key-value stores can support hierarchical data models
or ordered key-value stores.
5. Conversion of Returned Values: Retrieved values can be converted into
various formats, such as lists, tables, or data frame columns, enhancing usability.
6. Operational Efficiency: Key-value stores offer scalability, reliability, portability,
and low operational costs.
7. Flexible Key Representation: Keys can be synthetic, auto-generated, or
logically represent files and web-service calls. This flexibility allows for diverse
applications.

Limitations of Key-Value Stores

1. Lack of Indexing: Key-value stores do not maintain indexes on values, making it


impossible to search for a subset of values efficiently.
2. Limited Transactional Capabilities: They do not support traditional database
features like atomicity or consistency for simultaneous transactions, requiring
applications to implement these features.
3. Unique Key Management: As data volume increases, ensuring unique keys can
become challenging, making it hard to retrieve specific data when keys are not
unique.
4. Query Limitations: Queries cannot filter individual values like relational
databases (e.g., no "WHERE" clause), limiting data retrieval capabilities.

Typical Uses of Key-Value Stores

• Image Storage: Efficiently managing large collections of images.


• Document or File Store: Storing and retrieving documents or files.
• Lookup Tables: Fast access to frequently used data points.
• Query Cache: Storing the results of database queries for rapid retrieval.

You might also like