0% found this document useful (0 votes)
155 views

UNIT- 1 DDB

Uploaded by

praneet trimukhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
155 views

UNIT- 1 DDB

Uploaded by

praneet trimukhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT - 1

INTRODUCTION TO DISTRIBUTED DATA BASE SYSTEMS

Course Objectives
 The purpose of the course is to enrich the previous knowledge of database
systems and expose the need for distributed database technology to confront the
deficiencies of the centralized database systems.
 Introduce basic principles and implementation techniques
of distributed database systems.
 Equip students with principles and knowledge of parallel and object-oriented
databases.
 Topics include distributed DBMS architecture and design; query processing and
optimization; distributed transaction management and reliability; parallel and
object database management systems.

Course Outcomes:
 Understand theoretical and practical aspects of distributed database systems.
 Study and identify various issues related to the development of distributed atabase
systems.
 Understand the design aspects of object-oriented database systems and related
developments.
 Understand to generate queries and optimize it.
 Understand and implement principles of object oriented databases.

UNIT -1: Introduction; Distributed Data Processing, Distributed Database System,


Promises of DDBSs, Problem areas.
Distributed DBMS Architecture: Architectural Models for Distributed DBMS, DDMBS
Architecture. Distributed Database Design: Alternative Design Strategies, Distribution
Design issues, Fragmentation, Allocation.
Topic 1: Introduction

Distributed database system (DDBS)


Distributed database system (DDBS) technology is the union of what appear to be
two diametrically opposed approaches to data processing:
 Database system and
 Computer network technologies.
A database system is a way to store and manage data so that it can be easily
accessed and updated. Instead of storing data in separate files, like in older systems,
a database keeps all the data together and manages it to avoid duplication.

Distributed Database System (DDBS)

 A Distributed Database System (DDBS) is like a regular database but spread


out across different computers connected by a network.
 This means that the data is not all stored in one place but is distributed across
several locations.
 Even though the data is spread out, it is managed in a way that makes it look
like a single database to the users.

Necessity of Distributed Database System


 Integration without Centralization: The goal is to combine data from different
places without needing to put everything in one central location.
 This allows for better use of resources, flexibility, and resilience (meaning the
system can keep working even if part of it fails).

Topic 2: Distributed Data Processing

Distributed Processing involves using multiple computers to work together on a


task. These computers are connected by a network, and each one can handle part of the
work. This approach can be more efficient than using a single computer.

Different Types of Distribution of task:


Processing Logic: The rules or procedures for processing data can be spread across multiple
computers.
Distribution according to function.
Distribution according to data.
Data Distribution: The actual data can be stored in different locations.
Control Distribution: Decision-making processes can be spread out, so no single
computer is in charge of everything.

Necessity of Distributed Data Processing

 Distributed processing better corresponds to the organizational structure of


today’s widely
distributed enterprises, and that such a system is more reliable and more
responsive.
 Web-based applications, electronic commerce business over the Internet,
multimedia applications such as news- on-demand or medical imaging,
manufacturing control systems are all examples of such applications.
 The fundamental reason behind distributed processing is to be better able to cope
with the large-scale data management problems that we face today, by using a
variation of the well-known divide-and-conquer rule.
Topic 3: Distributed Database System

 Distributed database as a collection of multiple, logically interrelated databases


distributed over a computer network.
 A distributed database management system (distributed DBMS) is then defined
as the software system that permits the management of the distributed database
and makes the distribution transparent to the users
 The communication between them is done over a network instead of through
shared memory or shared disk (as would be the case with multiprocessor
systems), with the network as the only shared resource.

Components of a Distributed Database System

 Database systems that run over multiprocessor systems are called parallel database
systems
Distributed Databases:
 These are groups of databases that are spread across different locations but are still
connected.
 They work together as if they were a single database.

Data Delivery Alternatives


Three orthogonal dimensions:
o Delivery modes,
o Frequency and
o Communication methods

Delivery of Data:
Pull-only: Data is only sent when a user or system asks for it.
Push-only: Data is automatically sent without waiting for a request.
Hybrid: A mix of both approaches, where data can be both requested and automatically
sent.
Communication Methods:
Unicast: Data is sent directly from one computer to another.
One-to-Many (Multicast/Broadcast): Data is sent from one computer to multiple
computers at the same time.
Data Delivery frequency measurements

 Periodic:
In periodic delivery, data are sent from the server to clients at regular intervals.
The intervals can be defined by system default or by clients using their profiles.
 Conditional:
Data are sent from servers whenever certain conditions installed by clients in their
profiles are satisfied.
 Ad-hoc or irregular
Performed mostly in a pure pull-based system. Data are pulled from servers to
clients in an ad-hoc fashion whenever clients request it.
Benefits of Distributed Database Systems
 Transparency: Users don’t need to know where the data is physically located;
the system makes it look like all the data is in one place.
 Reliability: If one part of the system fails, the rest can keep working, making the
overall system more reliable.
 Performance: By spreading out the work across multiple computers, tasks can
be completed faster.
 Scalability: The system can easily grow by adding more computers or storage
locations as needed.

In a fully distributed environment, data is spread across multiple computers,


each of which can manage its own part of the data. This setup is more flexible
and scalable.
Topic 4: Promises of DDBSs
• transparent management of distributed and replicated data,
• reliable access to data through distributed transactions,
• improved performance, and
• easier system expansion.

1. Transparent Management of Distributed and Replicated Data


 Transparency in DBMS: Separation of higher-level semantics from implementation
details, allowing users to focus on their tasks without worrying about system
complexities.
 Centralized vs. Distributed DBMS: In a centralized DBMS, data is stored in a single
location, whereas in a distributed DBMS, data is localized, fragmented, and possibly
replicated across different sites.

 Example: An engineering firm with multiple offices (Boston, Waterloo, Paris,


San Francisco) maintains employee, project, salary, and assignment data. In a
centralized system, a simple SQL query retrieves specific information. In a
distributed system, data is localized and fragmented for efficiency.
 Fragmentation: The process of dividing the database into partitions and storing
each partition at different sites.
 Replication: Duplicating data across multiple sites to improve performance and
reliability.
 Fully Transparent Access: Users can query the database without needing to
understand how data is fragmented, replicated, or where it is located, as the
system resolves these issues internally.
Data Independence
 Data independence is a key aspect of transparency in a Database Management
System (DBMS). It means that user applications remain unaffected by changes in
how data is defined and organized. This is especially important in a centralized
DBMS.
 Data definition happens at two levels:
 Logical Level (Schema Definition): This defines the logical structure of the
data, such as tables and relationships.
 Physical Level (Physical Data Description): This defines how the data is stored
on the hardware.

Two types of data independence:

 Logical Data Independence: This means that changes to the logical structure or
schema of the database do not affect user applications.
 Physical Data Independence: This means that changes in how data is physically
stored do not affect user applications.
In essence, when a user application is created, it should not
need to be changed just because the way data is organized or stored changes. This
separation allows the system to make adjustments for better performance without
disrupting the user's work.
Network Transparency
 In a centralized database system, the main resource to manage is the data itself.
 However, in a distributed database system, the network is an additional resource
that needs to be managed, and users should be shielded from its complexities.
 The goal is to make the distributed database feel as seamless as a centralized one,
a concept known as network transparency or distribution transparency.
 Network Transparency: Users should not have to worry about the network's
operational details or even be aware of its existence when using a distributed database.
 Service Uniformity: It is important to have a consistent way to access services,
regardless of whether the database is centralized or distributed.
 Distribution Transparency: Users should not need to know where the data is located.
The system handles data location details.
 Location Transparency: The command to perform a task should work regardless of
where the data is stored or on which system the operation is executed.
 Naming Transparency: Each object in the database should have a unique name, so
users don't need to include the location in the object's name. Without naming
transparency, users would need to specify the location as part of the object's name.
Replication Transparency
 Replication of data in a distributed database involves storing copies of the same
data on different machines across a network. This has benefits but also introduces
complexities. The key points are:
 Reasons for Replication: Data is often replicated to improve performance,
reliability, and availability. For example, placing commonly accessed data on
multiple machines can make access faster and ensure data is available even if one
machine fails.
 User Perspective: Ideally, users should not need to know about the existence of
multiple copies of the data. They should interact with the database as if there is
only one copy, letting the system manage the replication.
 System Perspective: While hiding replication from users makes their experience
simpler, it complicates the system’s management. If users are responsible for
specifying actions on multiple copies, it simplifies transaction management but
reduces flexibility and data independence.
 Replication Transparency: This is the idea that users should not have to deal
with the existence of data copies. The system should handle it. This is separate
from network transparency, which deals with where these copies are stored.

Fragmentation Transparency
 In a distributed database system, fragmentation transparency is an important
concept. It involves dividing a database into smaller parts, known as fragments,
which are treated as separate database objects.
 Purpose of Fragmentation:
 Performance: Smaller fragments can be
processed more efficiently.
 Availability and Reliability: Data can be more reliably accessed if it's
fragmented.
 Reducing Replication Overhead: Only subsets of data are replicated,
requiring less storage and management.
 Types of Fragmentation:
 Horizontal Fragmentation: The relation is divided into sub-relations, each
containing a subset of rows (tuples) of the original relation.
 Vertical Fragmentation: The relation is divided into sub-relations, each
containing a subset of columns (attributes) of the original relation.
 Query Processing with Fragments:
 When a user submits a query for the entire relation, the system must
translate it into multiple queries for the relevant fragments.
 This process ensures that even though the user queries the whole relation,
the system efficiently executes the query on the smaller fragments.
Necessity of Transparency
 Different forms of transparency in distributed computing, focusing on how to
balance ease of use with the challenges and costs of providing full transparency.
Here are the key points:
 Full Transparency vs. Complexity:
 Full transparency makes it easier for users to access DBMS services but can
make managing distributed data more difficult.
 Some experts argue that full transparency leads to poor manageability,
modularity, and performance in distributed databases.
 Remote Procedure Call (RPC) Approach:
 Instead of full transparency, a remote procedure call mechanism allows
users to direct queries to specific DBMSs, which is a common approach in
client/server systems.

Hierarchy of Transparencies:
A visual hierarchy illustrates the different levels of transparency,
including an added "language transparency" layer, which allows high-
level access to data (e.g., through graphical user interfaces or natural
language access).
 Three Layers of Providing Transparency:
 Access Layer: Transparency can be built into the user language, where the
compiler or interpreter translates requests into operations, shielding users from
the underlying complexity.
 Operating System Layer: Some transparency is provided by the operating
system, such as handling device drivers. This can be extended to distributed
systems, though not all systems offer sufficient network transparency, and
some applications may need to bypass it for performance tuning.
 DBMS Layer: The DBMS often provides transparency by handling
translations from the operating system to the user interface. This is the most
common method, but it comes with challenges related to the interaction
between the operating system and the distributed DBMS.

2. Reliability Through Distributed Transactions


 Distributed Database Management Systems (DBMSs) are designed to improve
reliability and ensure data access even when parts of the system fail. Here are the key
points:
 Reliability through Replication:
 Distributed DBMSs have replicated components, so the failure of one site or a
communication link doesn't bring down the entire system.
 Even if some data becomes unreachable, users can still access other parts of the
database with the help of distributed transactions and protocols.
 Transactions and Consistency:
 A transaction is a sequence of database operations treated as a single, atomic
action.
 Transactions ensure that the database remains consistent even when multiple
transactions run concurrently (concurrency transparency) or when failures occur
(failure atomicity).
 If a failure happens during a transaction, the system can recover and complete
the transaction or start it over.
3. Example of a Transaction:
 Updating employee salaries by 10% can be encapsulated in a transaction. If the
system fails midway, the DBMS ensures that the update completes correctly after
recovery.
 Transactions also prevent errors during concurrent operations,
like calculating average salaries while the update is in progress.
4. Distributed Transactions:
 Distributed transactions execute across multiple sites (e.g., Boston, Waterloo, Paris,
and San Francisco).
 With full support for distributed transactions, user applications interact with a
single, logical view of the database, and the DBMS ensures correct execution
regardless of system failures or the need to coordinate between local databases.
5. Transaction Support and Transparency:
 Supporting distributed transactions involves complex protocols for concurrency
control, reliability, and recovery, such as the two- phase commit (2PC) protocol.
 These protocols are more complicated in distributed systems than in centralized
ones and also manage replica access according to specified rules.

III. Improved Performance

 The performance of distributed Database Management Systems (DBMSs) is


generally considered better due to the following reasons:
 Data Localization:
 Distributed DBMSs store data close to where it is needed (data
localization).
 This reduces the load on any single site, decreasing competition for CPU
and I/O resources.
 It also cuts down on delays in accessing data over long distances, like in
satellite-based systems.
 Optimized Fragmentation:
 To maximize benefits, the database must be properly fragmented
and distributed.
 This minimizes the need for remote communication, which can be slow and
costly.
 Centralization vs. Distribution:
 Some argue that with high-speed networks, it's better to centralize data and access
it remotely.
 However, this view overlooks the issue of latency—how long it takes data to
travel—especially in distributed environments like satellite links.
 Latency cannot be eliminated and can cause unacceptable delays in some
applications.
 Parallelism in Distributed Systems:
 Distributed systems can run multiple queries at the same time (inter-query
parallelism) or break a single query into parts to run simultaneously (intra-query
parallelism).
 If the system only handled read-only queries, it would make sense to replicate the
database as much as possible.
 However, because databases also need to handle updates, complex protocols for
concurrency control and committing changes are necessary.
 Easier System Expansion

 Easier Expansion:
• In a distributed system, it's easier to increase database size by
adding more processing and storage power to the network.
• Major system overhauls are rarely needed, though the increase in
power may not be perfectly linear due to distribution overhead.
 Cost-Effective:
• Building a distributed system with multiple smaller computers is
often cheaper than investing in one large, powerful machine.
Topic 5: Problem areas
 Problem 1:
Data Replication: In a distributed database system, data may be stored in different
locations on a computer network. Not every location needs to have the entire
database, but the data is spread across more than one site for reliability and
efficiency.
Data Access and Updates: The system is responsible for selecting the correct copy
of the data when retrieving information. It must also ensure that any updates to
the data are reflected on all copies of that data across different sites.
 Problem 2:
If a site or communication link fails during an update, the system needs to ensure
that the update is applied to the affected sites as soon as they are back online.

 Problem 3:
Because each site doesn't instantly know what actions are happening at other
sites, it is much more difficult to synchronize transactions across multiple sites
compared to a centralized system.
Issues:

 Distributed Database Design: Data can be partitioned or replicated across


multiple sites for efficiency and reliability.
 Distributed Directory Management: Directories track where data is stored, and
can be centralized or distributed.
 Distributed Query Processing: Efficient strategies are needed to run queries
across the distributed network while minimizing costs.
 Distributed Concurrency Control: Synchronization methods ensure database
operations are consistent and avoid conflicts.
 Distributed Deadlock Management: Mechanisms are in place to prevent or
resolve conflicts when multiple processes compete for data.
 Reliability of Distributed DBMS: Systems maintain consistency and recover
from failures, even when parts are inaccessible.
 Replication: Ensures data copies are consistent across sites, using either
immediate or delayed update protocols.
 Interconnection of Problems: Design decisions in one area affect others,
with interconnected issues in distribution, query processing, and reliability.
 Additional Issues: New challenges arise with modern systems, affecting the
relationship between distributed and parallel databases.

Topic 6: Distributed DBMS Architecture


 In late 1972, ANSI's Computer and Information Processing Committee (X3)
created a Study Group on Database Management Systems (DBMS) under its
Standards Planning and Requirements Committee (SPARC).
 The group aimed to assess the feasibility of establishing standards for DBMS and
to identify which aspects should be standardized.
 The study group released an interim report in 1975 and a final report in 1977,
which introduced the "ANSI/SPARC architecture," officially called the
"ANSI/X3/SPARC DBMS Framework."

Three Levels of Data Views:


 External View: This is the user's perspective, showing how they interact with
the database. It includes individual user views and the relationships between
the data they access.
 Conceptual View: This represents an abstract, enterprise-level view of the
data, independent of individual user needs or physical storage constraints. It
describes the data and their relationships as they exist in the real world.
 Internal View: This is the system's perspective, dealing with the physical
storage and organization of data on storage devices.

• Data Independence: The framework enables data independence by separating the


external schema (user views) from the conceptual schema (enterprise view) and the
conceptual schema from the internal schema (physical storage). This separation
allows for logical data independence and physical data independence.
• Mappings Between Levels: Transformations between these levels are achieved
through mappings that define how data at one level can be derived from data at
another level.

TOPIC 7: Architectural Models for Distributed DBMS:

 A distributed DBMS architected that organizes the systems as characterized w.r.t


(1) the autonomy of local systems,
(2) their distribution, and
(3) their heterogeneity .

DBMS Implementation Alternatives:


Autonomy:
 Autonomy here refers to how much control individual database systems (DBMSs) have,
not how data is handled. It means how independently each DBMS can function. Several
factors influence this, like whether the systems share information, can run transactions on
their own, or can be modified.
 Autonomous systems should meet these requirements:
1. Local operations of individual DBMSs shouldn't be affected by being part of a larger
distributed system.
2. The way individual DBMSs handle and optimize queries should stay the same, even
when processing global queries across multiple databases.
3. The system should remain consistent even if a DBMS joins or leaves the distributed
system.
 There are three main aspects of autonomy:
1. Design Autonomy: Each DBMS can choose its own data models and methods for
managing transactions.
2. Communication Autonomy: Each DBMS can decide what information to share with
other DBMSs.
3. Execution Autonomy: Each DBMS can execute its transactions in its own way.

Three types of integration for autonomous systems are highlighted:

1. Tight Integration: All databases are logically combined into one, giving users the
impression of a single database, even if the data is in multiple databases. A central data
manager controls user requests, even if they involve multiple databases.
2. Semi-Autonomous Systems: DBMSs usually operate independently but collaborate to
share their data. Each DBMS decides what data to share. They aren't fully autonomous
because they need some modifications to share information with others.
3. Total Isolation: DBMSs operate completely independently, unaware of other DBMSs.
Processing transactions that involve multiple databases is difficult because there's no
global control over them.
 These are just three common types of autonomous systems, but there are other
possibilities as well.

Distribution:
There are two main ways that DBMSs are distributed:
 Client/Server Distribution:
 In this setup, servers handle data management, while clients take care of the
application environment, including the user interface. Both clients and servers share
the communication tasks. This is a balanced way of distributing tasks between
machines, with some setups being more distributed than others. The key point is
that in a client/server model, machines are categorized as either "clients" or
"servers," and they have different roles.
 Peer-to-Peer Distribution:
 In this system, there is no difference between clients and servers. Every machine
has the full capabilities of a DBMS and can work with other machines to run
queries and transactions. Early distributed database systems were mostly based on
peer-to-peer architecture. It is also called fully distributed systems), although many
of the techniques also apply to client/server systems.
 Heterogeneity:
 Heterogeneity in distributed systems means that there can be differences in various
aspects, like hardware, networking protocols, and data management methods. The most
important differences in this context are related to data models, query languages, and
transaction management protocols.
 Data Models: Different tools for representing data can cause heterogeneity because
each data model has its own strengths and limitations.
 Query Languages: Differences in query languages can arise in several ways. For
example, some systems might access data one record at a time (like in some object-
oriented systems), while others access data in sets (like in relational systems). Even
within the same data model, such as SQL for relational databases, different vendors
may have their own versions of the language, which can behave slightly differently.

TOPIC 8: Architectural Alternatives/ DDBMS Architecture:

 Three Key Factors: The distribution of databases, their potential differences


(heterogeneity), and their level of independence (autonomy) are separate issues.

 Three Focused Architectures:


 Client/Server Distributed DBMS.
 Peer-to-Peer Distributed DBMS.
 Peer-to-Peer Distributed, Heterogeneous Multi database System.

Client Server Systems


 Client/Server DBMS Overview: Introduced in the 1990s, client/server DBMS architecture
divides tasks between client and server machines, making it easier to manage complex
systems.
 Function Distribution:
 Clients handle the application environment and user interface.
 Servers manage data processing, including query optimization, transaction management,
and storage.

Types of Client/Server Architectures:


 Single Server: Multiple clients connect to one server, similar to centralized databases but
with differences in transaction execution and cache management.
 Multiple Servers: Clients may connect to multiple servers directly or through a "home
server." This can lead to "heavy client" systems with more client responsibilities or "light
client" systems with more server-side processing.
 Comparison to Peer-to-Peer: Both client/server and peer-to-peer systems offer a unified
view of the data, but they differ in their architectural approach.
 Three-Tier Architecture: Extends client/server by adding specialized servers:
 Client Servers: Handle user interface tasks (e.g., web servers).
 Application Servers: Run application programs.
 Database Servers: Manage data storage and processing.
 Advantages of Database Servers:
 Improved data reliability and availability through specialized techniques.
 Enhanced performance with tight integration and advanced hardware.
 Challenges: The added communication between application and database servers can
introduce overhead, but this can be managed with high-level server interfaces for complex
queries.
 Expansion with Multiple Servers: The system can include multiple application and
database servers, where each application server is dedicated to specific tasks, and database
servers work together.

/Server Reference Architecture


Distributed Database ServersClient:

Peer-to-Peer Systems:
 Peer-to-Peer Evolution: The concept of peer-to-peer (P2P) systems has evolved. Unlike
early systems with a few sites, modern P2P systems handle thousands of sites with diverse
and independent systems.
 Classical vs. Modern P2P: The book initially focuses on the classical peer-to-peer
architecture (where all sites have the same functions) and later addresses modern P2P
database issues.
 Data Organization in P2P Systems:
 Local Internal Schema (LIS): Each site may have different physical data
organization.
 Global Conceptual Schema (GCS): Represents the overall logical structure of data
across all sites.
 Local Conceptual Schema (LCS): Describes the logical organization of data at
each site.
 Transparency and Independence: The architecture supports data independence,
location, and replication transparency, meaning users can query data without worrying
about its physical location.
 Components of Distributed DBMS:
 User Processor: Handles user interaction, query processing, and coordination of
distributed execution.
 Data Processor: Manages local query optimization, data consistency, and physical
data access.
 Function Placement in P2P: Unlike client/server systems, both user and data processors
are typically found on each machine in a P2P system, but there are suggestions to have
"query-only" sites with limited functionality.
 Client/Server Comparison: In client/server systems, the client handles user interaction,
while the server manages data processing. Multiple server setups can have more complex
module distributions.

Distributed Database Reference Architecture

Components of a Distributed DBMS


Multi database System Architecture:
 Multidatabase Systems (MDBS): MDBS refers to systems where different databases are
fully independent and don’t need to cooperate. They might not even be aware of each
other. The focus is on distributed MDBSs, which involve multiple independent databases.
 Global Conceptual Schema (GCS): In distributed DBMSs, the GCS represents the entire
database. In MDBSs, the GCS only includes the parts of local databases that are shared.
MDBSs allow each database to decide what data to share, leading to a subset of the
combined databases.
 Schema Design: Designing the GCS in MDBSs is usually a bottom-up process (starting
from local schemas), while in distributed DBMSs, it’s a top-down process (starting with
the global schema).
 Homogeneous vs. Heterogeneous Systems:
 Homogeneous: All databases use the same data model and language.
 Heterogeneous: Different databases use different models and languages, requiring
special approaches to integrate them.
 Unilingual vs. Multilingual Systems:
 Unilingual: Users access the global database using a different language/model than
they use for local databases.
 Multilingual: Users access both local and global databases using the same
language/model as their local DBMS.
 Mediator/Wrapper Architecture:
 Mediator: A software module that integrates data from different databases to
provide information to users.
 Wrapper: Maps the data from the source database to the mediator’s view, helping
to handle differences in data models.
 Middleware Layer: The mediators and wrappers together form a middleware layer that
provides services on top of the source databases, enabling users to access and query data
across multiple databases efficiently.
TOPIC 9: Distributed Database Design

 Design Considerations: When designing a distributed computer system, decisions must


be made about where to place data and programs across different sites in a network, and
sometimes how to design the network itself.
 Distributed DBMS Focus: For distributed database management systems (DBMSs), this
involves both distributing the DBMS software and the application programs that use it.
 Architectural Models: Different architectural models address how applications are
distributed, but this section focuses specifically on data distribution.
 Three Dimensions of Distributed Systems:
 Level of Sharing: Ranges from no sharing (each application and its data are
isolated), to data sharing (programs are replicated but not data), to data-plus-
program sharing (both data and programs can be shared across sites).
 Access Patterns: Refers to how data and programs are accessed and used across the
network.
 Knowledge of Access Patterns: Involves understanding how predictable the access
patterns are.
 Types of Sharing:
 No Sharing: Applications and data are isolated to one site without communication
with others.
 Data Sharing: Programs are copied across all sites, but data files move across the
network as needed.
 Data-Plus-Program Sharing: Both data and programs can be shared between sites,
allowing more complex interactions.
 Homogeneous vs. Heterogeneous Systems:
 Homogeneous Systems: Easier to manage because the same program can run on
different hardware.
 Heterogeneous Systems: More complex due to differences in hardware and
operating systems, making it difficult to run the same program across different
platforms, though data can still be moved around relatively easily.
Access Pattern Behavior in Distributed Systems:
 Access Patterns:
 Static vs. Dynamic: Access patterns can be static (unchanging) or dynamic
(changing over time). Static patterns are easier to manage, but most real-life
applications are dynamic. The key is understanding how dynamic the system is.
 Relation to Database Design: The dynamic nature of access patterns impacts
distributed database design and query processing.
 Knowledge of Access Patterns:
 No Information: Designing a system without any knowledge of access patterns is
almost impossible.
 Complete Information: Designers have full understanding of access patterns,
making it easier to predict and plan.
 Partial Information: Designers have some knowledge, but there are deviations
from expected patterns.
 Designing Distributed Databases:
 Top-Down Approach: Best for tightly integrated, homogeneous systems where the
design is planned from a broad overview down to details.
 Bottom-Up Approach: Suited for multidatabase systems where design starts with
existing databases and integrates them.
 Challenges in Distributed Environments:
 New problems arise in distributed systems, especially in environments where data
and programs are shared, which are not present in centralized systems.

TOPIC 10: Alternative Design Strategies

 Requirements Analysis:
 Purpose: Defines the system environment and gathers the data and processing
needs of all potential users.
 Objectives: Considers performance, reliability, availability, cost-effectiveness, and
flexibility of the system.
 Parallel Design Activities:
 View Design: Focuses on creating interfaces for end users.
 Conceptual Design: Involves analyzing the organization to identify key entities and
their relationships, split into:
 Entity Analysis: Identifies entities, their attributes, and relationships.
 Functional Analysis: Identifies the main functions of the organization and
how they relate to entities.

Integration of Designs:
 Relationship: Conceptual design is an integration of different user views.
 Future-Proofing: The conceptual model should support both current and future
applications.
 View Integration: Ensures all user requirements are captured in the conceptual
schema.
 Data and Application Specification:
 User Specifications: Define data entities, determine applications to run on the
database, and provide statistical information like usage frequency and data volume.
 Outcome: This process results in the global conceptual schema, which is crucial for
both centralized and distributed database design.
 Focus on Centralized Design:
 Up to this point, the design process mirrors that of centralized databases, without
yet considering the complexities of a distributed environment.

TOPIC 11: Distribution Design Issues:


1. Why fragment at all?
2. How should we fragment?
3. How much should we fragment?
4. Is there any way to test the correctness of decomposition?
5. How should we allocate?
6. What is the necessary information for fragmentation and allocation?

1. Reasons for Fragmentation


Fragmentation vs. Full Data Distribution:
a. In distributed systems, data can be fragmented (split into smaller pieces) rather than
distributed as whole files or relations.
b. Early distributed systems focused on allocating entire files to network nodes.
Why Fragment Data?:
c. Application Views: Applications often use subsets of data (relations), so it makes sense
to distribute these subsets rather than entire relations.
d. Location of Applications: If applications accessing the same data are at different sites,
either:
i. The data is stored at one site (causing many remote accesses), or
ii. The data is replicated at multiple sites (which can be inefficient and problematic
for updates).
2. Benefits of Fragmentation:
a. Increased Concurrency: Fragmenting data allows multiple transactions to run
simultaneously, improving system throughput.
b. Parallel Query Execution: A single query can be split into subqueries that run in
parallel on different data fragments.
3. Challenges of Fragmentation:
a. Conflicting Application Needs: If applications need overlapping data fragments, it can
cause performance issues, like needing to join data from different fragments.
b. Integrity and Data Control: Fragmentation can complicate integrity checks since related
data may be spread across different sites, making dependency checks more difficult.
4. Minimizing Distributed Joins:
o A key challenge is reducing the need for costly operations like joining data from multiple
fragments, which can slow down performance.

Fragmentation Alternatives
 Finding alternative ways of dividing a table into smaller ones. There are clearly two
alternatives for this: 1) dividing it horizontally
2) dividing it vertically.
 Divided horizontally into two relations.
 If both the nesting's are of different types, one gets hybrid fragmentation.

Degree of Fragmentation
Fragmentation Levels:
 Fragmentation can range from no fragmentation to very detailed fragmentation, like
dividing data into individual tuples (horizontal) or individual attributes (vertical).
 The right level of fragmentation is a balance between extremes, depending on the
applications using the database.
Correctness Rules for Fragmentation:
 Completeness: Every piece of data in the original relation should be found in one or more
fragments after fragmentation.
 Reconstruction: It should be possible to rebuild the original relation from its fragments
using a relational operator.
 Disjointness: Fragments should not overlap. For horizontal fragmentation, data should not
be repeated across fragments. In vertical fragmentation, only non-primary key attributes
should be disjoint.
Allocation Alternatives:
 After proper fragmentation, fragments need to be allocated to different sites in a network.
 Replication: Data can be replicated for reliability and efficiency, especially for read-only
queries, but updating replicated data can be challenging.
 Non-replicated Database: Contains only one copy of each fragment.
 Fully Replicated Database: The entire database is replicated at each site.
 Partially Replicated Database: Some fragments are replicated at multiple sites.

Information Requirements for Distribution Design:


 The distribution design is complex and influenced by many factors like database structure,
application locations, access patterns, and system properties.
 Information needed for distribution design includes database, application, communication
network, and computer system information.

TOPIC 12: Fragmentation:


Horizontal Fragmentation:
 Divides data (relation) based on rows.
Two types:
 Primary (based on its own data) and
 Derived (based on another relation’s data).
 Information needed for horizontal fragmentation
includes:
 Database Info: Structure and relationships between data.
 Application Info: How data is accessed by users.

Primary Horizontal Fragmentation:


 Involves dividing data based on certain conditions (predicates).
 Fragments are defined by these conditions.
Key Concepts:
 Simple Predicates: Basic conditions used for fragmentation.
 Minterm Predicates: Combinations of simple predicates.
 Completeness: All data is equally accessible.
 Minimality: Only necessary predicates are used.

COM MIN Algorithm:


Purpose: The COM MIN algorithm is designed to find the best set of simple conditions
(predicates) to divide a database relation (table) into fragments.
 Steps:
 Start:
 Choose the first predicate from the set that can effectively divide the relation
according to a rule (Rule 1).
 Add this predicate to the new set of selected predicates (Pr0).
 Remove this predicate from the original set (Pr).
 Loop (Repeat Until Complete):
 Find another predicate from the remaining set (Pr) that can further divide any
fragment created so far.
 Add this new predicate to the selected set (Pr0) and remove it from the original set
(Pr).
 Add the new fragment to the set of fragments.
 Check for Relevance:
 If a predicate in the selected set (Pr0) becomes irrelevant (doesn't contribute to
further division), remove it.
 Finish:
 Continue the process until the set of selected predicates (Pr0) can fully divide the
relation.
 Outcome:
The algorithm ensures that the final set of predicates (Pr0) is minimal (no unnecessary
predicates) and complete (fully divides the relation effectively).

PHORIZONTAL Algorithm
 Input:
 R: The relation (table) that needs to be fragmented.
 Pr: A set of simple predicates (conditions) that could be used to fragment the
relation.
 Step 1: Apply COM MIN:
 Use the COM MIN algorithm on the relation R with the set of predicates Pr.
 The output is a new set of predicates Pr' that are minimal and complete.
 Step 2: Identify Minterm Predicates:
 Determine a set of minterm predicates M. These are combinations of the predicates
from Pr' that can be used to fragment the relation.
 Step 3: Check for Contradictions:
 Identify any contradictions among the minterm predicates in M.
 If any predicate contradicts another, remove it from M.
 Final Output:
 The algorithm produces M, the final set of minterm fragments, which effectively
fragments the relation R without contradictions.
 This algorithm is used to systematically divide a relation into smaller, non-conflicting
fragments based on given conditions.

Derived Horizontal Fragmentation


 Definition:
It's a type of fragmentation based on a selection operation applied to a member relation,
influenced by the fragmentation of its owner relation.
 Key Points:
The link between the owner and member relations is defined using an equi-join.
An equi-join can be implemented using semijoins.
 Process:
Given a link LLL, where the owner is SSS and the member is RRR, the fragments of RRR
are defined based on the fragments of SSS.
This ensures that the fragmentation of RRR aligns with SSS, but is only defined by the
attributes of RRR.
 Criteria for Fragmentation:
 Choose fragmentation that:
 Has better join characteristics.
 Is used in more application.
Checking for Correctness
Completeness:
 Ensures that all data is included in the fragmentation, based on complete selection
predicates.
 For derived fragmentation, every tuple in a fragment of RRR must have a matching tuple
in the corresponding fragment of SSS (ensuring referential integrity).
Reconstruction:
 The original relation can be recreated by merging (union) all the fragments.
 Disjointness:
 For primary fragmentation, disjointness is guaranteed if the selection conditions don’t
overlap.
 For derived fragmentation, it's more complex and requires ensuring that tuples don’t
belong to multiple fragments unless necessary.

Vertical Fragmentation
 Definition:
 Divides a table into smaller fragments based on columns, with each fragment
containing some columns and the primary key.
 Objective:
 To create fragments that minimize the execution time of applications that access the
database.
 Challenges:
 More complex than horizontal fragmentation due to more possible fragment
combinations.
 Heuristic Approaches:
 Grouping: Starts with individual attributes and combines them based on criteria.
 Splitting: Starts with the full relation and splits it based on application access
patterns.
 Key Consideration:
 Replicate primary keys across fragments to allow reconstruction of the original
table.
 This breakdown simplifies the detailed explanation into more digestible points,
maintaining the core ideas.

Information Requirements of Vertical Fragmentation


 Vertical fragmentation relies heavily on understanding how applications interact with the
database.
 Measure of "Togetherness":
 Attributes that are often accessed together should be grouped in the same fragment. This is
measured by attribute affinity.
 Attribute Usage:
 For each application, determine which attributes it accesses (e.g., which columns are
needed in a query).
 Example: If a query needs BUDGET and PNAME, these attributes have a higher affinity.
 Attribute Affinity Matrix:
 An attribute affinity matrix is created to show how often different attributes are accessed
together by applications.

Clustering Algorithm
Grouping Attributes:
Attributes with high affinity are grouped together using a bond energy algorithm.
 Goal of the Algorithm:
Maximize the similarity of grouped attributes.
Keep the computation time reasonable.
 Steps:
Initialize with some attributes.
Iteratively add more attributes to maximize the grouping benefit.
Adjust rows and columns in the matrix to align similar attributes together.

Partitioning Algorithm:
 Objective:
Split the attributes into groups (fragments) that are accessed mainly by specific
applications.
 Optimization:
Find the best split point in the matrix that maximizes access efficiency and minimizes
overlap.
 Handling Multiple Splits:
For complex datasets, multiple splits may be required, but this can increase computational
complexity.
 Algorithm's Outcome:
The result is a set of fragments where attributes are grouped based on how applications use
them.
Correctness:
 Completeness:
Ensure that all attributes are assigned to at least one fragment.
 Reconstruction:
The original table can be rebuilt by joining the fragments together.
 Disjointness:
Fragments should not overlap, except for necessary attributes like keys.

Hybrid Fragmentation
 Combination of Fragmentations: You can combine both horizontal and vertical
fragmentations.
 Practical Limits: Usually, you don't need more than 2 levels of fragmentation because it
gets costly and complex.
 Reconstruction: To reconstruct the original data, you start from the smallest fragments and
join or union them.

Topic 13: Allocation of Data:


 What is Allocation?: It's the process of deciding where to store data fragments across a
network of computers.
 Two Main Goals:
 Minimize Costs: This includes storage, communication, and processing costs.
 Improve Performance: Ensuring data is processed and retrieved quickly.
 Challenges:
 Allocation impacts other fragments.
 Complexities in query processing, integrity enforcement, and concurrency control should
be considered.
 This problem is difficult to solve optimally, so heuristic methods are often used.
Information Requirements for Allocation:
 Database Information: Size and selectivity of fragments.
 Application Information: Number of read and update accesses.
 Site Information: Storage and processing capacity at each site.
 Network Information: Communication costs between sites.
Allocation Model:
 Goal: Minimize total processing and storage costs while meeting response time, storage,
and processing constraints.
 Components:
 Total Cost: Includes storage and query processing.
 Constraints: Ensure data fits within site limits and queries are processed within acceptable
time.
Data Directory
 Purpose: Stores metadata (information about the data), such as schemas and access
information.
 Types:
 Global Directory: Describes the entire distributed database.
 Local Directory: Describes data at individual sites.
 Considerations:
 Location: Can be centralized or distributed across sites.
 Replication: Having multiple copies can improve reliability but complicates
updates.
Conclusion:
 Focus: Distributed database design needs better integration of horizontal and vertical
fragmentation methods.
 Challenge: Creating a comprehensive design methodology that combines both types of
fragmentations effectively.

You might also like