BDA Assign 1
BDA Assign 1
7.Discuss the functions of each of the five layers in Big Data Architecture Design.
14.What is Document store data architecture? Discuss the features of document store
NOSQL data architecture.
Definition of Data
Data refers to information that can be stored, processed, and analyzed. It includes facts,
statistics, and observations used for calculations or decision-making. Data can be
presented in various forms such as numbers, text, images, and videos.
Classification of Data
1. Structured Data
o Data that conforms to a specific schema or data model.
o Found in tables with rows and columns, like in relational databases
(RDBMS).
o Examples: SQL databases, spreadsheets.
o Features:
▪ Enables operations such as insert, delete, update, and append.
▪ Indexing for faster retrieval.
▪ Supports transaction processing following ACID rules.
2. Semi-Structured Data
o Data that contains tags or markers separating elements but does not
conform to a fixed schema.
o Examples: XML, JSON documents.
o Semi-structured data has some organizational properties but is more
flexible compared to structured data.
3. Multi-Structured Data
o Data that consists of multiple formats, including structured, semi-
structured, and unstructured data.
o Found in non-transactional systems and in various formats like streaming
data.
o Examples: Customer interaction logs, sensor data, enterprise server data.
4. Unstructured Data
o Data that lacks a predefined structure, such as text, images, and videos.
o Examples: Emails, social media posts, text documents, audio, and video
files.
o Unstructured data may have internal structures but do not follow a
formal schema.
Big Data refers to large and complex data sets that traditional data processing tools
cannot handle. It has the following main characteristics, known as the 3Vs, with an
additional V for veracity:
1. Volume
o Refers to the sheer size of the data.
o Big Data typically involves massive amounts of data being generated
daily.
2. Velocity
o Refers to the speed at which data is generated and processed.
o In the era of Big Data, data is often generated in real-time or near-real-
time.
3. Variety
o Refers to the different types of data formats.
o Big Data can be structured, semi-structured, or unstructured, including
text, images, videos, and more.
4. Veracity
o Refers to the quality and accuracy of the data.
o The reliability of the data can vary, and its accuracy impacts the analysis.
• Social Networks: Data from platforms like Facebook, Twitter, and YouTube.
• Transactions Data: Credit card transactions, bookings, public records.
• Machine-Generated Data: Data from sensors, Internet of Things (IoT), and
machine logs.
• Human-Generated Data: Emails, documents, biometrics, and interactions
recorded in digital formats.
Vertical Scalability
• Definition: Vertical scalability involves adding more resources (like CPUs, RAM,
or storage) to a single system to enhance its performance.
• Capabilities: This approach improves analytics, reporting, and visualization
capabilities, making it suitable for handling more complex problems.
• Efficient Design: It focuses on designing algorithms that effectively use the
increased resources.
• Example: If processing xxx terabytes of data takes time ttt, and increasing
complexity requires a factor of nnn, scaling up should reduce processing time to
equal, less, or much less than n×tn \times tn×t.
Horizontal Scalability
The primary use of grid computing is to enable resource sharing among users, including
individuals and organizations, to solve complex, large-scale computational problems. A
grid typically dedicates itself to one task at a time, such as scientific simulations, large-
scale data analysis, or complex problem-solving.
1. Resource Sharing:
o Grid computing allows for the sharing of heterogeneous resources (e.g.,
different types of computers, storage systems, and network components)
across multiple organizations or users. Resources are dispersed across
different geographic locations and can be accessed and used efficiently.
2. Scalability:
o Like cloud computing, grid computing is highly scalable. It can grow by
adding more nodes to the grid, enabling the system to handle increasing
amounts of data or computational tasks. This makes grid computing ideal
for handling tasks that require significant resources.
3. Geographical Distribution:
o Grid computing resources are distributed across various locations rather
than being centralized. This allows for the efficient utilization of
computational resources spread over a wide area, regardless of their
physical location.
4. Large-Scale Resource Coordination:
o Grid computing provides a coordinated framework for using multiple
resources. The coordination ensures that all resources are working
together seamlessly to complete a common task efficiently and securely. It
allows large amounts of data and computing power to be integrated and
utilized effectively.
5. Data-Intensive Operations:
o Grid computing is particularly suited for handling data-intensive tasks. It
is more efficient for storing and processing large datasets spread across
different grid nodes than for handling smaller, less intensive operations.
6. Task-Specific Grids:
o There are different types of grids, such as data grids (focused on
managing and distributing large amounts of data) and computational
grids (focused on processing computationally intensive tasks). Each grid
serves a specific purpose, ensuring that the resources available are
utilized appropriately for the task.
7. Security and Flexibility:
o Grid computing ensures secure and flexible sharing of resources. The
architecture provides a secure framework that protects sensitive data and
operations while still allowing flexible access for users based on their
needs.
8. Cost Efficiency:
o By pooling and sharing resources across a grid, organizations can reduce
their costs. Rather than buying expensive dedicated hardware, they can
use the collective power of distributed resources, which is often more
cost-effective.
Data Quality refers to the degree to which data is accurate, reliable, and suitable for its
intended use. High-quality data represents the real-world constructs accurately and is
essential for effective operations, analysis, decision-making, and knowledge discovery.
For example, high-quality data in an artificial intelligence system enables accurate
model predictions and decisions.
A common definition of data quality can be framed around the "5 R's":
1. Data Integrity:
o Definition: Data integrity refers to the accuracy and consistency of data
over its lifecycle. The integrity of data must be maintained to ensure that
it remains correct and uncorrupted.
o Example: In a student grading system, the grades should remain
consistent and unchanged through processing.
2. Noise:
o Definition: Noise in data refers to unwanted or meaningless information
that deviates from the actual or true value. Noisy data can negatively
impact the accuracy of analysis.
o Characteristics: Noise is often random and can cause both positive and
negative deviations.
o Impact: Noisy data leads to incorrect conclusions during analysis.
3. Outliers:
o Definition: Outliers are data points that deviate significantly from the
rest of the dataset. These values may be valid or due to errors.
o Impact: If outliers are not correctly handled, they can distort results.
Removing true outliers or misclassifying valid data as outliers can lead to
incorrect analysis outcomes.
4. Missing Values:
o Definition: Missing values occur when data is absent from the dataset.
o Impact: Missing values affect the completeness of data and can skew
analysis, leading to inaccurate results. Techniques such as imputation
(estimating missing values) are often used to address this issue.
5. Duplicate Values:
o Definition: Duplicate values occur when the same data is repeated more
than once in a dataset.
o Impact: Duplicate values can lead to bias in analysis and inflated results.
Removing duplicate entries is essential to ensure accurate data
interpretation.
Data Analysis is the process of inspecting, cleaning, transforming, and modeling data
with the goal of discovering useful information, suggesting conclusions, and supporting
decision-making. It involves various statistical and mathematical techniques to extract
meaningful insights from data, ultimately aiding in making informed business decisions.
Data analysis provides value by turning raw data into actionable insights, which helps
businesses understand trends, behaviors, and opportunities.
Data analytics can be broken down into the following key phases:
1. Descriptive Analytics:
o Definition: Focuses on summarizing historical data to derive insights. It
helps in understanding what has happened in the past through reports
and visualizations.
o Examples: Sales reports, data visualizations like charts or dashboards
that show trends over time.
2. Predictive Analytics:
o Definition: Uses statistical models and machine learning techniques to
forecast future outcomes based on historical data.
o Examples: Predicting customer churn, demand forecasting for products,
stock price prediction.
3. Prescriptive Analytics:
o Definition: Suggests actions or decisions by analyzing data and
recommending the best course of action to maximize outcomes or profits.
o Examples: Recommending product pricing strategies, optimizing supply
chain operations.
4. Cognitive Analytics:
o Definition: Uses advanced algorithms and artificial intelligence to
simulate human thought processes, enabling systems to make better
decisions in complex scenarios.
o Examples: Intelligent virtual assistants like Siri or Alexa, systems that
understand natural language and improve decision-making.
7.Discuss the functions of each of the five layers in Big Data Architecture Design.
Functions of Each Layer in Big Data Architecture Design
Big Data architecture is organized into five layers, each with specific functions that
contribute to the overall management and processing of large volumes of data. Here's a
detailed overview of the functions of each layer:
Function: This layer focuses on identifying and categorizing the various data sources
from which data will be collected.
• Key Functions:
o Identify Data Sources: Determine whether the sources are internal (e.g.,
company databases) or external (e.g., social media, public datasets).
o Data Source Types: Understand the nature of the data sources, such as
structured (databases), semi-structured (JSON, XML), or unstructured
(text documents, images).
o Data Volume Considerations: Estimate the amount of data to be
ingested and plan accordingly.
o Data Formats: Recognize the formats of incoming data and prepare for
conversion or processing as necessary.
Function: The data ingestion layer is responsible for absorbing data from various
sources and preparing it for further processing.
• Key Functions:
o Data Acquisition: Collect data in real-time or through batch processing.
o Ingestion Processes: Utilize ETL (Extract, Transform, Load) processes to
prepare data for storage and analysis.
o Push vs. Pull Mechanisms: Implement push (data sent automatically
from the source) or pull (data requested from the source) strategies for
data ingestion.
o Real-time vs. Batch Ingestion: Decide whether to process data
continuously as it arrives or in scheduled intervals.
Function: This layer provides a reliable storage solution for the ingested data, ensuring
that it is organized and easily accessible for processing.
• Key Functions:
o Data Storage Solutions: Choose appropriate storage technologies, such
as Hadoop Distributed File System (HDFS), NoSQL databases (e.g.,
Cassandra, MongoDB), or traditional relational databases.
o Data Formats and Compression: Decide on data formats for storage and
apply compression techniques to optimize space.
o Historical vs. Incremental Storage: Determine whether to store
historical data or manage incremental updates to datasets.
o Access Patterns: Understand how the data will be queried and accessed
for subsequent processing and analytics.
Function: This layer involves transforming and analyzing the stored data using various
data processing tools and frameworks.
• Key Functions:
o Data Processing Tools: Utilize frameworks and software such as
MapReduce, Apache Spark, Hive, and Pig for data processing tasks.
o Processing Modes: Implement scheduled batch processing, real-time
processing, or a hybrid approach based on application needs.
o Synchronous vs. Asynchronous Processing: Manage processing
requirements depending on how data consumption occurs at the upper
layer (L5).
o Data Transformation: Perform necessary transformations to clean,
aggregate, or enrich data for analytics.
Function: This layer is dedicated to the consumption of processed data through various
applications, reporting tools, and analytics platforms.
• Key Functions:
o Data Integration: Integrate processed data with business intelligence
tools and analytics platforms for seamless usage.
o Reporting and Visualization: Provide capabilities for data visualization
and reporting using tools like Tableau, Power BI, or custom dashboards.
o Analytics Applications: Support various analytics processes, including
descriptive analytics, predictive analytics, data mining, and machine
learning.
o Export Capabilities: Allow for exporting datasets to cloud storage, web
applications, or other systems for additional processing or sharing.
Hadoop is a powerful framework designed to store and process vast amounts of data in
a distributed environment. It consists of various core components and features that
enhance its capabilities in handling Big Data. Below is an overview of the Hadoop core
components and its features.
1. Hadoop Common:
o This module includes the libraries and utilities essential for the other
Hadoop modules.
o It provides various components and interfaces for distributed file systems
and general input/output, such as serialization, Java RPC (Remote
Procedure Call), and file-based data structures.
2. Hadoop Distributed File System (HDFS):
o HDFS is a Java-based distributed file system that can store all types of
data on disks across clusters.
o It is designed to provide high-throughput access to application data and is
optimized for large datasets.
3. MapReduce v1:
o This is the original programming model in Hadoop, utilizing the Mapper
and Reducer functions.
o It processes large sets of data in parallel and in batches, breaking tasks
into smaller, manageable sub-tasks.
4. YARN (Yet Another Resource Negotiator):
o YARN is responsible for managing resources across the Hadoop
ecosystem.
o It allows user application tasks or sub-tasks to run in parallel, using
scheduling to handle resource requests in a distributed environment.
5. MapReduce v2:
o This is the improved version of the MapReduce framework introduced
with Hadoop 2, built on the YARN architecture.
o It enhances parallel processing of large datasets and enables distributed
processing of application tasks.
Features of Hadoop
• Data Fusion: Combines data from different sources (like social media and
internal databases) to provide a comprehensive view that aids in fraud detection.
• Multiple Data Sources: Uses various types of data to enhance insights and
reporting.
• Real-Time Analytics: Analyzes data quickly to detect potential fraud before it
causes significant harm.
• Data Quality: Concerns about the accuracy and reliability of data, which can lead
to incorrect analyses.
• Security and Privacy: Risks related to data breaches and unauthorized access to
sensitive information.
• Financial Costs: The high costs associated with managing large volumes of data,
which can impact profitability.
• Patient Monitoring: Uses data from wearable devices to track patient health in
real time.
• Fraud Prevention: Identifies duplicate claims and unnecessary medical tests to
reduce costs in the healthcare system.
• Improving Outcomes: Predictive analytics helps in early diagnosis and
treatment of conditions, leading to better patient outcomes.
1. Atomicity
o Definition: All operations within a transaction must complete
successfully. If any operation fails, the entire transaction is rolled back,
meaning no changes are made to the database.
o Example: Consider a bank transaction where a customer withdraws
money. If the first operation deducts the amount from the account but the
second operation (updating the balance) fails, the entire transaction is
rolled back. This ensures that either both operations succeed, or none do.
2. Consistency
o Definition: A transaction must bring the database from one valid state to
another, maintaining all predefined rules and constraints (like integrity
constraints).
o Example: In a bank, the total of all deposits and withdrawals must equal
the current balance. If a transaction results in an inconsistency (e.g., a
balance that does not match the sum of transactions), it must be rolled
back.
3. Isolation
o Definition: Transactions must be executed in isolation from one another,
meaning that concurrent transactions do not interfere with each other.
The result of a transaction should not be visible to other transactions until
it is committed.
o Example: If two customers are trying to withdraw money from the same
account at the same time, isolation ensures that each transaction is
processed independently. This prevents situations where both
transactions see an outdated balance.
4. Durability
o Definition: Once a transaction is committed, its changes are permanent
and must survive system failures (like crashes or power losses).
o Example: After successfully withdrawing money, even if the system
crashes, the change (the deduction from the account) must still exist in
the database once the system is restored.
12. .Explain Brewer’s CAP Theorem.
CAP Theorem
1. Consistency (C)
2. Availability (A)
3. Partition Tolerance (P)
1. Consistency (C)
• Definition: In a distributed database, consistency ensures that all nodes see the
same data at the same time. If one node updates data, all other nodes should
immediately reflect this change.
• Example: In a sales database, if a sale is recorded at one showroom, the updated
sales data must be visible in all related tables across different nodes that rely on
that information. This means that if a user queries the sales data, they should see
the latest information regardless of which node they access.
2. Availability (A)
Brewer's CAP Theorem implies that in the presence of a network partition, a distributed
system can only guarantee two of the three properties at any time:
Practical Implications
When designing distributed systems, developers must consider the trade-offs presented
by the CAP theorem based on the application's needs:
Conclusion
The CAP Theorem is essential for understanding the limitations and trade-offs involved
in building distributed systems. It helps developers make informed decisions based on
the specific requirements of their applications regarding consistency, availability, and
partition tolerance.
13.Discuss the BASE properties in NOSQL database.
BASE is an acronym that represents three key properties of NoSQL databases, offering
an alternative approach to the traditional ACID properties found in SQL databases. The
components of BASE are:
1. Basic Availability
2. Soft State
3. Eventual Consistency
1. Basic Availability
2. Soft State
• Definition: Soft state means that the state of the system may change over time,
even without new inputs, due to eventual consistency. This property allows the
system to operate even in the presence of temporary inconsistencies.
• Implementation: Unlike traditional databases that require immediate
consistency, NoSQL databases can accept data in a partially inconsistent state
and resolve inconsistencies over time. Applications are designed to handle these
inconsistencies during processing.
• Example: In a social media application, when a user updates their status, it may
take some time for that update to propagate across all nodes in the system.
During this time, some nodes may show the old status, but the application
continues to function without interruption.
3. Eventual Consistency
Document store data architecture is a type of NoSQL database that stores, retrieves, and
manages semi-structured data in the form of documents. Unlike traditional relational
databases that use tables to organize data, document stores use flexible formats like
JSON, XML, or BSON to allow for a more dynamic schema. This flexibility makes
document stores suitable for applications that require rapid development and iterative
changes to data structures.
1. Schema Flexibility:
o Document stores do not enforce a rigid schema. Each document can have
its own structure, allowing for diverse data types and fields within the
same collection.
2. Storage of Unstructured Data:
o They excel at managing unstructured or semi-structured data, which does
not fit neatly into traditional rows and columns.
3. Hierarchical Structure:
o Data is organized in a nested hierarchy. For example, JSON documents can
contain arrays and objects, allowing for complex data structures to be
stored in a single document.
4. Easy Querying:
o Document stores provide intuitive querying capabilities using document
attributes, enabling users to retrieve specific parts of a document
efficiently.
5. No Object-Relational Mapping (ORM):
o Unlike relational databases that require ORM for data mapping, document
stores allow direct access to data structures, making it easier to navigate
and manipulate data.
6. ACID Transactions:
o Document stores can support ACID (Atomicity, Consistency, Isolation,
Durability) properties, ensuring reliable transactions, especially
important in multi-user environments.
7. High Performance:
o They are designed for high-performance reads and writes, making them
suitable for applications that require quick data access.
8. Scalability:
o Document stores can be easily scaled horizontally by adding more servers
to accommodate growing data volumes and user loads.
9. Rich Indexing Options:
o They offer various indexing capabilities, such as full-text search and
indexing on nested fields, which enhance query performance.
10. Data Retrieval by Path:
o Users can perform queries based on the path through the document tree,
enabling fine-grained retrieval of nested data.
Typical Use Cases
• MongoDB: Widely used for its flexibility and scalability, allowing developers to
store data in JSON-like format.
• CouchDB: Uses a schema-free JSON document storage model, providing features
like multi-master replication and MapReduce querying.
A key-value store is a type of NoSQL database that uses a simple schema-less data
model. It operates using key-value pairs, where each key is a unique identifier that maps
to a specific value, which can be a string, object, or BLOB (Binary Large Object). This
model is akin to a hash table, where the unique key points to a particular piece of data.
Characteristics
• High Performance: Key-value stores are designed for fast data retrieval,
allowing for quick access to data using the primary key.
• Scalability: They can easily scale to accommodate large datasets, making them
suitable for applications with varying data sizes.
• Flexibility: Data types stored in the value field can vary, supporting a wide range
of data formats
Advantages of Key-Value Stores
1. Versatile Data Types: Key-value stores can handle any data type as the value
(text, images, video, etc.). For example, querying a key retrieves the associated
data much like looking up a word in a dictionary.
2. Simple Queries: Queries return the values as a single item, simplifying data
retrieval.
3. Eventual Consistency: Data in key-value stores is eventually consistent,
meaning that updates may not be immediately visible across all nodes but will
propagate over time.
4. Hierarchical Structures: Key-value stores can support hierarchical data models
or ordered key-value stores.
5. Conversion of Returned Values: Retrieved values can be converted into
various formats, such as lists, tables, or data frame columns, enhancing usability.
6. Operational Efficiency: Key-value stores offer scalability, reliability, portability,
and low operational costs.
7. Flexible Key Representation: Keys can be synthetic, auto-generated, or
logically represent files and web-service calls. This flexibility allows for diverse
applications.