UNIT- 1 DDB
UNIT- 1 DDB
Course Objectives
The purpose of the course is to enrich the previous knowledge of database
systems and expose the need for distributed database technology to confront the
deficiencies of the centralized database systems.
Introduce basic principles and implementation techniques
of distributed database systems.
Equip students with principles and knowledge of parallel and object-oriented
databases.
Topics include distributed DBMS architecture and design; query processing and
optimization; distributed transaction management and reliability; parallel and
object database management systems.
Course Outcomes:
Understand theoretical and practical aspects of distributed database systems.
Study and identify various issues related to the development of distributed atabase
systems.
Understand the design aspects of object-oriented database systems and related
developments.
Understand to generate queries and optimize it.
Understand and implement principles of object oriented databases.
Database systems that run over multiprocessor systems are called parallel database
systems
Distributed Databases:
These are groups of databases that are spread across different locations but are still
connected.
They work together as if they were a single database.
Delivery of Data:
Pull-only: Data is only sent when a user or system asks for it.
Push-only: Data is automatically sent without waiting for a request.
Hybrid: A mix of both approaches, where data can be both requested and automatically
sent.
Communication Methods:
Unicast: Data is sent directly from one computer to another.
One-to-Many (Multicast/Broadcast): Data is sent from one computer to multiple
computers at the same time.
Data Delivery frequency measurements
Periodic:
In periodic delivery, data are sent from the server to clients at regular intervals.
The intervals can be defined by system default or by clients using their profiles.
Conditional:
Data are sent from servers whenever certain conditions installed by clients in their
profiles are satisfied.
Ad-hoc or irregular
Performed mostly in a pure pull-based system. Data are pulled from servers to
clients in an ad-hoc fashion whenever clients request it.
Benefits of Distributed Database Systems
Transparency: Users don’t need to know where the data is physically located;
the system makes it look like all the data is in one place.
Reliability: If one part of the system fails, the rest can keep working, making the
overall system more reliable.
Performance: By spreading out the work across multiple computers, tasks can
be completed faster.
Scalability: The system can easily grow by adding more computers or storage
locations as needed.
Logical Data Independence: This means that changes to the logical structure or
schema of the database do not affect user applications.
Physical Data Independence: This means that changes in how data is physically
stored do not affect user applications.
In essence, when a user application is created, it should not
need to be changed just because the way data is organized or stored changes. This
separation allows the system to make adjustments for better performance without
disrupting the user's work.
Network Transparency
In a centralized database system, the main resource to manage is the data itself.
However, in a distributed database system, the network is an additional resource
that needs to be managed, and users should be shielded from its complexities.
The goal is to make the distributed database feel as seamless as a centralized one,
a concept known as network transparency or distribution transparency.
Network Transparency: Users should not have to worry about the network's
operational details or even be aware of its existence when using a distributed database.
Service Uniformity: It is important to have a consistent way to access services,
regardless of whether the database is centralized or distributed.
Distribution Transparency: Users should not need to know where the data is located.
The system handles data location details.
Location Transparency: The command to perform a task should work regardless of
where the data is stored or on which system the operation is executed.
Naming Transparency: Each object in the database should have a unique name, so
users don't need to include the location in the object's name. Without naming
transparency, users would need to specify the location as part of the object's name.
Replication Transparency
Replication of data in a distributed database involves storing copies of the same
data on different machines across a network. This has benefits but also introduces
complexities. The key points are:
Reasons for Replication: Data is often replicated to improve performance,
reliability, and availability. For example, placing commonly accessed data on
multiple machines can make access faster and ensure data is available even if one
machine fails.
User Perspective: Ideally, users should not need to know about the existence of
multiple copies of the data. They should interact with the database as if there is
only one copy, letting the system manage the replication.
System Perspective: While hiding replication from users makes their experience
simpler, it complicates the system’s management. If users are responsible for
specifying actions on multiple copies, it simplifies transaction management but
reduces flexibility and data independence.
Replication Transparency: This is the idea that users should not have to deal
with the existence of data copies. The system should handle it. This is separate
from network transparency, which deals with where these copies are stored.
Fragmentation Transparency
In a distributed database system, fragmentation transparency is an important
concept. It involves dividing a database into smaller parts, known as fragments,
which are treated as separate database objects.
Purpose of Fragmentation:
Performance: Smaller fragments can be
processed more efficiently.
Availability and Reliability: Data can be more reliably accessed if it's
fragmented.
Reducing Replication Overhead: Only subsets of data are replicated,
requiring less storage and management.
Types of Fragmentation:
Horizontal Fragmentation: The relation is divided into sub-relations, each
containing a subset of rows (tuples) of the original relation.
Vertical Fragmentation: The relation is divided into sub-relations, each
containing a subset of columns (attributes) of the original relation.
Query Processing with Fragments:
When a user submits a query for the entire relation, the system must
translate it into multiple queries for the relevant fragments.
This process ensures that even though the user queries the whole relation,
the system efficiently executes the query on the smaller fragments.
Necessity of Transparency
Different forms of transparency in distributed computing, focusing on how to
balance ease of use with the challenges and costs of providing full transparency.
Here are the key points:
Full Transparency vs. Complexity:
Full transparency makes it easier for users to access DBMS services but can
make managing distributed data more difficult.
Some experts argue that full transparency leads to poor manageability,
modularity, and performance in distributed databases.
Remote Procedure Call (RPC) Approach:
Instead of full transparency, a remote procedure call mechanism allows
users to direct queries to specific DBMSs, which is a common approach in
client/server systems.
Hierarchy of Transparencies:
A visual hierarchy illustrates the different levels of transparency,
including an added "language transparency" layer, which allows high-
level access to data (e.g., through graphical user interfaces or natural
language access).
Three Layers of Providing Transparency:
Access Layer: Transparency can be built into the user language, where the
compiler or interpreter translates requests into operations, shielding users from
the underlying complexity.
Operating System Layer: Some transparency is provided by the operating
system, such as handling device drivers. This can be extended to distributed
systems, though not all systems offer sufficient network transparency, and
some applications may need to bypass it for performance tuning.
DBMS Layer: The DBMS often provides transparency by handling
translations from the operating system to the user interface. This is the most
common method, but it comes with challenges related to the interaction
between the operating system and the distributed DBMS.
Easier Expansion:
• In a distributed system, it's easier to increase database size by
adding more processing and storage power to the network.
• Major system overhauls are rarely needed, though the increase in
power may not be perfectly linear due to distribution overhead.
Cost-Effective:
• Building a distributed system with multiple smaller computers is
often cheaper than investing in one large, powerful machine.
Topic 5: Problem areas
Problem 1:
Data Replication: In a distributed database system, data may be stored in different
locations on a computer network. Not every location needs to have the entire
database, but the data is spread across more than one site for reliability and
efficiency.
Data Access and Updates: The system is responsible for selecting the correct copy
of the data when retrieving information. It must also ensure that any updates to
the data are reflected on all copies of that data across different sites.
Problem 2:
If a site or communication link fails during an update, the system needs to ensure
that the update is applied to the affected sites as soon as they are back online.
Problem 3:
Because each site doesn't instantly know what actions are happening at other
sites, it is much more difficult to synchronize transactions across multiple sites
compared to a centralized system.
Issues:
1. Tight Integration: All databases are logically combined into one, giving users the
impression of a single database, even if the data is in multiple databases. A central data
manager controls user requests, even if they involve multiple databases.
2. Semi-Autonomous Systems: DBMSs usually operate independently but collaborate to
share their data. Each DBMS decides what data to share. They aren't fully autonomous
because they need some modifications to share information with others.
3. Total Isolation: DBMSs operate completely independently, unaware of other DBMSs.
Processing transactions that involve multiple databases is difficult because there's no
global control over them.
These are just three common types of autonomous systems, but there are other
possibilities as well.
Distribution:
There are two main ways that DBMSs are distributed:
Client/Server Distribution:
In this setup, servers handle data management, while clients take care of the
application environment, including the user interface. Both clients and servers share
the communication tasks. This is a balanced way of distributing tasks between
machines, with some setups being more distributed than others. The key point is
that in a client/server model, machines are categorized as either "clients" or
"servers," and they have different roles.
Peer-to-Peer Distribution:
In this system, there is no difference between clients and servers. Every machine
has the full capabilities of a DBMS and can work with other machines to run
queries and transactions. Early distributed database systems were mostly based on
peer-to-peer architecture. It is also called fully distributed systems), although many
of the techniques also apply to client/server systems.
Heterogeneity:
Heterogeneity in distributed systems means that there can be differences in various
aspects, like hardware, networking protocols, and data management methods. The most
important differences in this context are related to data models, query languages, and
transaction management protocols.
Data Models: Different tools for representing data can cause heterogeneity because
each data model has its own strengths and limitations.
Query Languages: Differences in query languages can arise in several ways. For
example, some systems might access data one record at a time (like in some object-
oriented systems), while others access data in sets (like in relational systems). Even
within the same data model, such as SQL for relational databases, different vendors
may have their own versions of the language, which can behave slightly differently.
Peer-to-Peer Systems:
Peer-to-Peer Evolution: The concept of peer-to-peer (P2P) systems has evolved. Unlike
early systems with a few sites, modern P2P systems handle thousands of sites with diverse
and independent systems.
Classical vs. Modern P2P: The book initially focuses on the classical peer-to-peer
architecture (where all sites have the same functions) and later addresses modern P2P
database issues.
Data Organization in P2P Systems:
Local Internal Schema (LIS): Each site may have different physical data
organization.
Global Conceptual Schema (GCS): Represents the overall logical structure of data
across all sites.
Local Conceptual Schema (LCS): Describes the logical organization of data at
each site.
Transparency and Independence: The architecture supports data independence,
location, and replication transparency, meaning users can query data without worrying
about its physical location.
Components of Distributed DBMS:
User Processor: Handles user interaction, query processing, and coordination of
distributed execution.
Data Processor: Manages local query optimization, data consistency, and physical
data access.
Function Placement in P2P: Unlike client/server systems, both user and data processors
are typically found on each machine in a P2P system, but there are suggestions to have
"query-only" sites with limited functionality.
Client/Server Comparison: In client/server systems, the client handles user interaction,
while the server manages data processing. Multiple server setups can have more complex
module distributions.
Requirements Analysis:
Purpose: Defines the system environment and gathers the data and processing
needs of all potential users.
Objectives: Considers performance, reliability, availability, cost-effectiveness, and
flexibility of the system.
Parallel Design Activities:
View Design: Focuses on creating interfaces for end users.
Conceptual Design: Involves analyzing the organization to identify key entities and
their relationships, split into:
Entity Analysis: Identifies entities, their attributes, and relationships.
Functional Analysis: Identifies the main functions of the organization and
how they relate to entities.
Integration of Designs:
Relationship: Conceptual design is an integration of different user views.
Future-Proofing: The conceptual model should support both current and future
applications.
View Integration: Ensures all user requirements are captured in the conceptual
schema.
Data and Application Specification:
User Specifications: Define data entities, determine applications to run on the
database, and provide statistical information like usage frequency and data volume.
Outcome: This process results in the global conceptual schema, which is crucial for
both centralized and distributed database design.
Focus on Centralized Design:
Up to this point, the design process mirrors that of centralized databases, without
yet considering the complexities of a distributed environment.
Fragmentation Alternatives
Finding alternative ways of dividing a table into smaller ones. There are clearly two
alternatives for this: 1) dividing it horizontally
2) dividing it vertically.
Divided horizontally into two relations.
If both the nesting's are of different types, one gets hybrid fragmentation.
Degree of Fragmentation
Fragmentation Levels:
Fragmentation can range from no fragmentation to very detailed fragmentation, like
dividing data into individual tuples (horizontal) or individual attributes (vertical).
The right level of fragmentation is a balance between extremes, depending on the
applications using the database.
Correctness Rules for Fragmentation:
Completeness: Every piece of data in the original relation should be found in one or more
fragments after fragmentation.
Reconstruction: It should be possible to rebuild the original relation from its fragments
using a relational operator.
Disjointness: Fragments should not overlap. For horizontal fragmentation, data should not
be repeated across fragments. In vertical fragmentation, only non-primary key attributes
should be disjoint.
Allocation Alternatives:
After proper fragmentation, fragments need to be allocated to different sites in a network.
Replication: Data can be replicated for reliability and efficiency, especially for read-only
queries, but updating replicated data can be challenging.
Non-replicated Database: Contains only one copy of each fragment.
Fully Replicated Database: The entire database is replicated at each site.
Partially Replicated Database: Some fragments are replicated at multiple sites.
PHORIZONTAL Algorithm
Input:
R: The relation (table) that needs to be fragmented.
Pr: A set of simple predicates (conditions) that could be used to fragment the
relation.
Step 1: Apply COM MIN:
Use the COM MIN algorithm on the relation R with the set of predicates Pr.
The output is a new set of predicates Pr' that are minimal and complete.
Step 2: Identify Minterm Predicates:
Determine a set of minterm predicates M. These are combinations of the predicates
from Pr' that can be used to fragment the relation.
Step 3: Check for Contradictions:
Identify any contradictions among the minterm predicates in M.
If any predicate contradicts another, remove it from M.
Final Output:
The algorithm produces M, the final set of minterm fragments, which effectively
fragments the relation R without contradictions.
This algorithm is used to systematically divide a relation into smaller, non-conflicting
fragments based on given conditions.
Vertical Fragmentation
Definition:
Divides a table into smaller fragments based on columns, with each fragment
containing some columns and the primary key.
Objective:
To create fragments that minimize the execution time of applications that access the
database.
Challenges:
More complex than horizontal fragmentation due to more possible fragment
combinations.
Heuristic Approaches:
Grouping: Starts with individual attributes and combines them based on criteria.
Splitting: Starts with the full relation and splits it based on application access
patterns.
Key Consideration:
Replicate primary keys across fragments to allow reconstruction of the original
table.
This breakdown simplifies the detailed explanation into more digestible points,
maintaining the core ideas.
Clustering Algorithm
Grouping Attributes:
Attributes with high affinity are grouped together using a bond energy algorithm.
Goal of the Algorithm:
Maximize the similarity of grouped attributes.
Keep the computation time reasonable.
Steps:
Initialize with some attributes.
Iteratively add more attributes to maximize the grouping benefit.
Adjust rows and columns in the matrix to align similar attributes together.
Partitioning Algorithm:
Objective:
Split the attributes into groups (fragments) that are accessed mainly by specific
applications.
Optimization:
Find the best split point in the matrix that maximizes access efficiency and minimizes
overlap.
Handling Multiple Splits:
For complex datasets, multiple splits may be required, but this can increase computational
complexity.
Algorithm's Outcome:
The result is a set of fragments where attributes are grouped based on how applications use
them.
Correctness:
Completeness:
Ensure that all attributes are assigned to at least one fragment.
Reconstruction:
The original table can be rebuilt by joining the fragments together.
Disjointness:
Fragments should not overlap, except for necessary attributes like keys.
Hybrid Fragmentation
Combination of Fragmentations: You can combine both horizontal and vertical
fragmentations.
Practical Limits: Usually, you don't need more than 2 levels of fragmentation because it
gets costly and complex.
Reconstruction: To reconstruct the original data, you start from the smallest fragments and
join or union them.