0% found this document useful (0 votes)
8 views25 pages

ET UNIT-2

The document discusses various architectures of Database Management Systems (DBMS), including 1-Tier, 2-Tier, and 3-Tier architectures, highlighting their functionalities and user interactions. It also covers centralized and client-server architectures, explaining their evolution and design principles, as well as the differences between parallel and distributed databases. Additionally, it details parallelism in query processing, outlining methods such as I/O parallelism and intra-query parallelism to enhance database performance.

Uploaded by

Aasha Ganesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views25 pages

ET UNIT-2

The document discusses various architectures of Database Management Systems (DBMS), including 1-Tier, 2-Tier, and 3-Tier architectures, highlighting their functionalities and user interactions. It also covers centralized and client-server architectures, explaining their evolution and design principles, as well as the differences between parallel and distributed databases. Additionally, it details parallelism in query processing, outlining methods such as I/O parallelism and intra-query parallelism to enhance database performance.

Uploaded by

Aasha Ganesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

EMERGING TECHNOLOGIES IN DATA PROCESSING

UNIT-2

DBMS Architecture

o The DBMS design depends upon its architecture. The basic client/server architecture
is used to deal with a large number of PCs, web servers, database servers and other
components that are connected with networks.
o The client/server architecture consists of many PCs and a workstation which are
connected via the network.
o DBMS architecture depends upon how users are connected to the database to get their
request done.

Types of DBMS Architecture

Database architecture can be seen as a single tier or multi-tier. But logically, database
architecture is of two types like: 2-tier architecture and 3-tier architecture.

1-Tier Architecture

o In this architecture, the database is directly available to the user. It means the user can
directly sit on the DBMS and uses it.
o Any changes done here will directly be done on the database itself. It doesn't provide
a handy tool for end users.
o The 1-Tier architecture is used for development of the local application, where
programmers can directly communicate with the database for the quick response.

2-Tier Architecture

o The 2-Tier architecture is same as basic client-server. In the two-tier architecture,


applications on the client end can directly communicate with the database at the
server side. For this interaction, API's like: ODBC, JDBC are used.

1
EMERGING TECHNOLOGIES IN DATA PROCESSING

o The user interfaces and application programs are run on the client-side.
o The server side is responsible to provide the functionalities like: query processing and
transaction management.
o To communicate with the DBMS, client-side application establishes a connection
with the server side.

Fig: 2-tier Architecture

3-Tier Architecture

o The 3-Tier architecture contains another layer between the client and server. In this
architecture, client can't directly communicate with the server.
o The application on the client-end interacts with an application server which further
communicates with the database system.
o End user has no idea about the existence of the database beyond the application
server. The database also has no idea about any other user beyond the application.
o The 3-Tier architecture is used in case of large web application.

2
EMERGING TECHNOLOGIES IN DATA PROCESSING

Centralized and Client Server Architecture for DBMS

Centralized Architecture of DBMS:

 Architectures for DBMSs have generally followed trends seen in architectures for larger
computer systems.
 The primary processing for all system functions, including user application programs, user
interface programs, and all DBMS capabilities, was handled by mainframe computers in
earlier systems.
 The primary cause of this was that the majority of users accessed such systems using
computer terminals with limited processing power and merely display capabilities.
 Only display data and controls were delivered from the computer system to the display
terminals, which were connected to the central node by a variety of communications
networks, while all processing was done remotely on the computer system.
 The majority of users switched from terminals to PCs and workstations as hardware prices
decreased.
 Initially, Database Systems operated on these computers in a manner akin to how they had
operated display terminals. As a result, the DBMS itself continued to operate as a centralized
DBMS, where all DBMS functionality, application program execution, and UI processing
were done on a single computer.
 The physical elements of a centralized architecture Client/server DBMS designs emerged as
DBMS systems gradually began to take advantage of the user side's computing capability.

Client-server Architecture of DBMS:

 In order to handle computing settings with a high number of PCs, workstations, file servers,
printers, database servers, etc., the client/server architecture was designed.
 A network connects various pieces of software and hardware, including email and web server
software.
 To define specialized servers with a particular functionality is the aim. For instance, it is
feasible to link a number of PCs or compact workstations to a file server that manages the
client machines' files as clients.

3
EMERGING TECHNOLOGIES IN DATA PROCESSING

 By having connections to numerous printers, different devices can be designated as a printer


server; all print requests from clients are then directed to this machine.
 The category of specialized servers also includes web servers and email servers. Many client
machines can utilize the resources offered by specialized servers.
 The user is given the proper user interfaces for these servers as well as local processing power
to run local applications on the client devices.
 This idea can be applied to various types of software, where specialist applications, like a
CAD (computer-aided design) package, are kept on particular server computers and made
available to a variety of clients. Some devices (such as workstations or PCs with discs that
only have client software installed) would only be client sites.
 The idea of client/server architecture presupposes an underpinning structure made up of
several PCs and workstations as well as fewer mainframe computers connected via LANs as
well as other types of computer networks.
 In this system, a client is often a user machine that offers local processing and user interface
capabilities. When a client needs access to extra features-like database access-that are not
available on that system, it connects to a server that offers those features.
 A server is a computer system that includes both hardware and software that can offer client
computer services like file access, printing, archiving, or database access.
 Generally speaking, some workstations install both client and server software, while others
just install client software.
 Client and server software, however, typically run on separate workstations, which is more
typical. On this underlying client/server framework, Two-tier and Three-tier fundamental
DBMS architectures were developed.

Two-Tier Client Server Architecture:

Here, the term "two-tier" refers to our architecture's two layers-the Client layer and the Data layer.
There are a number of client computers in the client layer that can contact the database server. The
API on the client computer will use JDBC or some other method to link the computer to the database
server. This is due to the possibility of various physical locations for clients and database servers.

Three-Tier Client-Server Architecture:

The Business Logic Layer is an additional layer that serves as a link between the Client layer and the
Data layer in this instance. The layer where the application programs are processed is the business
logic layer, unlike a Two-tier architecture, where queries are performed in the database server. Here,
the application programs are processed in the application server itself.

Server Architecture

What Does Server Architecture Mean?

Server architecture is the foundational layout or model of a server, based on which a server is
created and/or deployed.

It defines how a server is designed, different components the server is created from, and the
services that it provides.

Server Architecture : Server architecture primarily helps in designing and evaluating the
server and its associated operations as well as services in whole before it is actually deployed.
Server architecture includes, but is not limited to:

4
EMERGING TECHNOLOGIES IN DATA PROCESSING

 Physical capacity of server (computing power and storage)


 Installed components
 Types and layers of applications and operating system
 Authentication and overall security mechanism
 Networking and other communication interface with other applications and/or services

Parallel and Distributed databases

1. Parallel Database :
 A parallel DBMS is a DBMS that runs across multiple processors and is designed to
execute operations in parallel, whenever possible.
 The parallel DBMS link a number of smaller machines to achieve the same
throughput as expected from a single large machine.

Features :
1. There are parallel working of CPUs
2. It improves performance
3. It divides large tasks into various other tasks
4. Completes works very quickly

2. Distributed Database :
 A Distributed database is defined as a logically related collection of data that is shared
which is physically distributed over a computer network on different sites.
 The Distributed DBMS is defined as, the software that allows for the management of
the distributed database and makes the distributed data available for the users.

Features :
1. It is a group of logically related shared data
2. The data gets split into various fragments
3. There may be a replication of fragments
4. The sites are linked by a communication network

The main difference between the parallel and distributed databases is that the former
is tightly coupled and then later loosely coupled.

Difference between Parallel and Distributed databases :

Parallel Database Distributed Database

In parallel databases, processes are tightly In distributed databases, the sites are loosely
coupled and constitutes a single database coupled and share no physical components i.e.,
system i.e., the parallel database is a distributed database is our geographically
centralized database and data reside in a departed, and data are distributed at several
single location locations.

In parallel databases, query processing and In distributed databases, query processing and

5
EMERGING TECHNOLOGIES IN DATA PROCESSING

transaction is complicated. transaction is more complicated.

In distributed databases, a local and global


In parallel databases, it’s not applicable. transaction can be transformed into distributed
database systems

In parallel databases, the data is partitioned In distributed databases, each site preserve a
among various disks so that it can be local database system for faster processing due
retrieved faster. to the slow interconnection between sites

In parallel databases, there are 3 types of


Distributed databases are generally a kind of
architecture: shared memory, shared disk,
shared-nothing architecture
and shared shared-nothing.

In distributed databases, query Optimisation


In parallel databases, query optimization is
techniques may be different at different sites
more complicated.
and are easy to maintain

In distributed databases, data is replicated at


In parallel databases, data is generally not
any number of sites to improve the
copied.
performance of systems

Parallel databases are generally Distributed databases may be homogeneous or


homogeneous in nature heterogeneous in nature.

Skew is the major issue with the increasing Blocking due to site failure and transparency
degree of parallelism in parallel databases. are the major issues in distributed databases.

Parallelism in Query in DBMS

Parallelism in a query allows us to parallel execution of multiple queries by decomposing


them into the parts that work in parallel. This can be achieved by shared-nothing architecture.
Parallelism is also used in fastening the process of a query execution as more and more
resources like processors and disks are provided. We can achieve parallelism in a query by
the following methods :

6
EMERGING TECHNOLOGIES IN DATA PROCESSING

1. I/O parallelism
2. Intra-query parallelism
3. Inter-query parallelism
4. Intra-operation parallelism
5. Inter-operation parallelism
1. I/O parallelism :
 It is a form of parallelism in which the relations are partitioned on multiple disks a
motive to reduce the retrieval time of relations from the disk.
 Within, the data inputted is partitioned and then processing is done in parallel with
each partition.
 The results are merged after processing all the partitioned data. It is also known
as data-partitioning.
 Hash partitioning has the advantage that it provides an even distribution of data across
the disks and it is also best suited for those point queries that are based on the
partitioning attribute.
 It is to be noted that partitioning is useful for the sequential scans of the entire table
placed on ‘n‘ number of disks and the time taken to scan the relationship is
approximately 1/n of the time required to scan the table on a single disk system.
 We have four types of partitioning in I/O parallelism:
 Hash partitioning –
As we already know, a Hash Function is a fast, mathematical function. Each row of the
original relationship is hashed on partitioning attributes. For example, let’s assume that
there are 4 disks disk1, disk2, disk3, and disk4 through which the data is to be partitioned.
Now if the Function returns 3, then the row is placed on disk3.

 Range partitioning –
In range partitioning, it issues continuous attribute value ranges to each disk. For
example, we have 3 disks numbered 0, 1, and 2 in range partitioning, and may assign
relation with a value that is less than 5 to disk0, values between 5-40 to disk1, and values
that are greater than 40 to disk2. It has some advantages, like it involves placing shuffles
containing attribute values that fall within a certain range on the disk. See figure 1: Range
partitioning given below:

 Round-robin partitioning –
In Round Robin partitioning, the relations are studied in any order. The ith tuple is sent
to the disk number(i % n). So, disks take turns receiving new rows of data. This technique
ensures the even distribution of tuples across disks and is ideally suitable for applications
that wish to read the entire relation sequentially for each query.

 Schema partitioning –
In schema partitioning, different tables within a database are placed on different disks.

7
EMERGING TECHNOLOGIES IN DATA PROCESSING

See figure 2 below:

figure – 2

2. Intra-query parallelism :
 Intra-query parallelism refers to the execution of a single query in a parallel process
on different CPUs using a shared-nothing paralleling architecture technique.
 This uses two types of approaches:
 First approach –
In this approach, each CPU can execute the duplicate task against some data portion.
 Second approach –
In this approach, the task can be divided into different sectors with each CPU executing a
distinct subtask.
3. Inter-query parallelism :
 In Inter-query parallelism, there is an execution of multiple transactions by each CPU.
 It is called parallel transaction processing.
 DBMS uses transaction dispatching to carry inter query parallelism.
 We can also use some different methods, like efficient lock management. In this
method, each query is run sequentially, which leads to slowing down the running of
long queries.
 In such cases, DBMS must understand the locks held by different transactions running
on different processes.
 Inter query parallelism on shared disk architecture performs best when transactions
that execute in parallel do not accept the same data.
 Also, it is the easiest form of parallelism in DBMS, and there is an increased
transaction throughput.
4. Intra-operation parallelism :
 Intra-operation parallelism is a sort of parallelism in which we parallelize the
execution of each individual operation of a task like sorting, joins, projections, and so
on.
 The level of parallelism is very high in intra-operation parallelism. This type of
parallelism is natural in database systems.
 Let’s take an SQL query example:
SELECT * FROM Vehicles ORDER BY Model_Number;
In the above query, the relational operation is sorting and since a relation can have a large
number of records in it, the operation can be performed on different subsets of the relation in
multiple processors, which reduces the time required to sort the data.
5. Inter-operation parallelism :
When different operations in a query expression are executed in parallel, then it is called
inter-operation parallelism. They are of two types –
 Pipelined parallelism –
In pipeline parallelism, the output row of one operation is consumed by the second
operation even before the first operation has produced the entire set of rows in its output.
Also, it is possible to run these two operations simultaneously on different CPUs, so that
one operation consumes tuples in parallel with another operation, reducing them. It is
useful for the small number of CPUs and avoids writing of intermediate results to disk.

8
EMERGING TECHNOLOGIES IN DATA PROCESSING

 Independent parallelism –
In this parallelism, the operations in query expressions that are not dependent on each
other can be executed in parallel. This parallelism is very useful in the case of the lower
degree of parallelism.

Design of Parallel Databases | DBMS

A parallel DBMS is a DBMS that runs across multiple processors or CPUs and is mainly
designed to execute query operations in parallel, wherever possible. The parallel DBMS link
a number of smaller machines to achieve the same throughput as expected from a single large
machine.
In Parallel Databases, mainly there are three architectural designs for parallel DBMS. They
are as follows:
1. Shared Memory Architecture
2. Shared Disk Architecture
3. Shared Nothing Architecture
Let’s discuss them one by one:
1. Shared Memory Architecture-
 In Shared Memory Architecture, there are multiple CPUs that are attached to an
interconnection network.
 They are able to share a single or global main memory and common disk arrays. It is
to be noted that, In this architecture, a single copy of a multi-threaded operating
system and multithreaded DBMS can support these multiple CPUs.
 Also, the shared memory is a solid coupled architecture in which multiple CPUs
share their memory.
 It is also known as Symmetric multiprocessing (SMP).
 This architecture has a very wide range which starts from personal workstations that
support a few microprocessors in parallel via RISC.

Shared Memory Architecture

Advantages :
1. It has high-speed data access for a limited number of processors.
2. The communication is efficient.
Disadvantages :
1. It cannot use beyond 80 or 100 CPUs in parallel.

9
EMERGING TECHNOLOGIES IN DATA PROCESSING

2. The bus or the interconnection network gets block due to the increment of the large
number of CPUs.
2. Shared Disk Architectures :
 In Shared Disk Architecture, various CPUs are attached to an interconnection
network.
 In this, each CPU has its own memory and all of them have access to the same disk.
 Also, note that here the memory is not shared among CPUs therefore each node has
its own copy of the operating system and DBMS.
 Shared disk architecture is a loosely coupled architecture optimized for applications
that are inherently centralized. They are also known as clusters.

Shared Disk Architecture

Advantages :
1. The interconnection network is no longer a bottleneck each CPU has its own memory.
2. Load-balancing is easier in shared disk architecture.
3. There is better fault tolerance.
Disadvantages :
1. If the number of CPUs increases, the problems of interference and memory contentions
also increase.
2. There’s also exists a scalability problem.
3. Shared Nothing Architecture :
 Shared Nothing Architecture is multiple processor architecture in which each
processor has its own memory and disk storage.
 In this, multiple CPUs are attached to an interconnection network through a node.
Also, note that no two CPUs can access the same disk area.
 In this architecture, no sharing of memory or disk resources is done.
 It is also known as Massively parallel processing (MPP).

Shared Nothing Architecture

10
EMERGING TECHNOLOGIES IN DATA PROCESSING

Advantages :
1. It has better scalability as no sharing of resources is done
2. Multiple CPUs can be added
Disadvantages:
1. The cost of communications is higher as it involves sending of data and software
interaction at both ends
2. The cost of non-local disk access is higher than the cost of shared disk architectures.

Distributed DBMS - Database Environments

Distributed databases can be classified into homogeneous and heterogeneous databases


having further divisions.

Types of Distributed Databases

Distributed databases can be broadly classified into homogeneous and heterogeneous


distributed database environments, each with further sub-divisions, as shown in the following
illustration.

Homogeneous Distributed Databases


In a homogeneous distributed database, all the sites use identical DBMS and operating
systems. Its properties are −
 The sites use very similar software.
 The sites use identical DBMS or DBMS from the same vendor.
 Each site is aware of all other sites and cooperates with other sites to process user
requests.
 The database is accessed through a single interface as if it is a single database.
Types of Homogeneous Distributed Database
There are two types of homogeneous distributed database −
 Autonomous − Each database is independent that functions on its own. They are
integrated by a controlling application and use message passing to share data updates.
 Non-autonomous − Data is distributed across the homogeneous nodes and a central or
master DBMS co-ordinates data updates across the sites.
Heterogeneous Distributed Databases
In a heterogeneous distributed database, different sites have different operating systems,
DBMS products and data models. Its properties are −
 Different sites use dissimilar schemas and software.

11
EMERGING TECHNOLOGIES IN DATA PROCESSING

 The system may be composed of a variety of DBMSs like relational, network,


hierarchical or object oriented.
 Query processing is complex due to dissimilar schemas.
 Transaction processing is complex due to dissimilar software.
 A site may not be aware of other sites and so there is limited co-operation in
processing user requests.
Types of Heterogeneous Distributed Databases
 Federated − The heterogeneous database systems are independent in nature and
integrated together so that they function as a single database system.
 Un-federated − The database systems employ a central coordinating module through
which the databases are accessed.

Distributed DBMS Architectures

DDBMS architectures are generally developed depending on three parameters −


 Distribution − It states the physical distribution of data across the different sites.
 Autonomy − It indicates the distribution of control of the database system and the
degree to which each constituent DBMS can operate independently.
 Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system
components and databases.

Architectural Models

Some of the common architectural models are −

 Client - Server Architecture for DDBMS


 Peer - to - Peer Architecture for DDBMS
 Multi - DBMS Architecture
Client - Server Architecture for DDBMS
This is a two-level architecture where the functionality is divided into servers and clients. The
server functions primarily encompass data management, query processing, optimization and
transaction management. Client functions include mainly user interface. However, they have
some functions like consistency checking and transaction management.
The two different client - server architecture are −

 Single Server Multiple Client


 Multiple Server Multiple Client (shown in the following diagram)

12
EMERGING TECHNOLOGIES IN DATA PROCESSING

Peer- to-Peer Architecture for DDBMS


In these systems, each peer acts both as a client and a server for imparting database services.
The peers share their resource with other peers and co-ordinate their activities.
This architecture generally has four levels of schemas −
 Global Conceptual Schema − Depicts the global logical view of data.
 Local Conceptual Schema − Depicts logical data organization at each site.
 Local Internal Schema − Depicts physical data organization at each site.
 External Schema − Depicts user view of data.

Multi - DBMS Architectures


This is an integrated database system formed by a collection of two or more autonomous
database systems.
Multi-DBMS can be expressed through six levels of schemas −
 Multi-database View Level − Depicts multiple user views comprising of subsets of
the integrated distributed database.
 Multi-database Conceptual Level − Depicts integrated multi-database that comprises
of global logical multi-database structure definitions.
 Multi-database Internal Level − Depicts the data distribution across different sites
and multi-database to local data mapping.

13
EMERGING TECHNOLOGIES IN DATA PROCESSING

 Local database View Level − Depicts public view of local data.


 Local database Conceptual Level − Depicts local data organization at each site.
 Local database Internal Level − Depicts physical data organization at each site.
There are two design alternatives for multi-DBMS −

 Model with multi-database conceptual level.


 Model without multi-database conceptual level.

Design Alternatives

The distribution design alternatives for the tables in a DDBMS are as follows −

 Non-replicated and non-fragmented


 Fully replicated
 Partially replicated
 Fragmented
 Mixed
Non-replicated & Non-fragmented
In this design alternative, different tables are placed at different sites. Data is placed so that it
is at a close proximity to the site where it is used most. It is most suitable for database
systems where the percentage of queries needed to join information in tables placed at

14
EMERGING TECHNOLOGIES IN DATA PROCESSING

different sites is low. If an appropriate distribution strategy is adopted, then this design
alternative helps to reduce the communication cost during data processing.
Fully Replicated
In this design alternative, at each site, one copy of all the database tables is stored. Since,
each site has its own copy of the entire database, queries are very fast requiring negligible
communication cost. On the contrary, the massive redundancy in data requires huge cost
during update operations. Hence, this is suitable for systems where a large number of queries
is required to be handled whereas the number of database updates is low.
Partially Replicated
Copies of tables or portions of tables are stored at different sites. The distribution of the
tables is done in accordance to the frequency of access. This takes into consideration the fact
that the frequency of accessing the tables vary considerably from site to site. The number of
copies of the tables (or portions) depends on how frequently the access queries execute and
the site which generate the access queries.
Fragmented
In this design, a table is divided into two or more pieces referred to as fragments or partitions,
and each fragment can be stored at different sites. This considers the fact that it seldom
happens that all data stored in a table is required at a given site. Moreover, fragmentation
increases parallelism and provides better disaster recovery. Here, there is only one copy of
each fragment in the system, i.e. no redundant data.
The three fragmentation techniques are −

 Vertical fragmentation
 Horizontal fragmentation
 Hybrid fragmentation
Mixed Distribution
This is a combination of fragmentation and partial replications. Here, the tables are initially
fragmented in any form (horizontal or vertical), and then these fragments are partially
replicated across the different sites according to the frequency of accessing the fragments.

Distributed Data Storage

In this section we will talk about data stored at different sites in distributed database
management system.

 There are two ways in which data can be stored at different sites. These are,
1. Replication.
2. Fragmentation.

Replication

 As the name suggests, the system stores copies of data at different sites. If an entire
database is available on multiple sites, it is a fully redundant database.
 The advantage of data replication is that it increases availability of data on different
sites. As the data is available at different sites, queries can be processed parallelly.

15
EMERGING TECHNOLOGIES IN DATA PROCESSING

 However, data replication has some disadvantages as well. Data needs to


be constantly updated and synchronized with other sites, if any site fails to
achieve it then it will lead to inconsistencies in the database. Availability of data is
highly benefitted from Replication.
 Constant updation complicates concurrency control and it is also overhead for the
servers.

Fragmentation

 In Fragmentation, the relations are fragmented, which means they are split into
smaller parts. Each of the fragments is stored on a different site, where it is required.
In this, the data is not replicated, and no copies are created. Consistency of data is
highly benefitted from Fragmentation.
 The prerequisite for fragmentation is to make sure that the fragments can later be
reconstructed into the original relation without losing any data.
 Consistency is not a problem here as each site has a different piece of information.
 There are two types of fragmentation,
o Horizontal Fragmentation – Splitting by rows.
o Vertical fragmentation – Splitting by columns.

Horizontal Fragmentation(or Sharding)

 The relation schema is fragmented into group of rows, and each group is then
assigned to one fragment.

Vertical Fragmentation

 The relation schema is fragmented into group of columns, called smaller schemas.
These smaller schemas are then assigned to each fragment.
 Each fragment must contain a common candidate key to guarantee a lossless join.

DDBMS - Transaction Processing Systems

Transactions

A transaction is a program including a collection of database operations, executed as a logical


unit of data processing. The operations performed in a transaction include one or more of
database operations like insert, delete, update or retrieve data. It is an atomic process that is
either performed into completion entirely or is not performed at all. A transaction involving
only data retrieval without any data update is called read-only transaction.
Each high level operation can be divided into a number of low level tasks or operations. For
example, a data update operation can be divided into three tasks −
 read_item() − reads data item from storage to main memory.
 modify_item() − change value of item in the main memory.
 write_item() − write the modified value from main memory to storage.
Database access is restricted to read_item() and write_item() operations. Likewise, for all
transactions, read and write forms the basic database operations.

16
EMERGING TECHNOLOGIES IN DATA PROCESSING

Transaction Operations

The low level operations performed in a transaction are −


 begin_transaction − A marker that specifies start of transaction execution.
 read_item or write_item − Database operations that may be interleaved with main
memory operations as a part of transaction.
 end_transaction − A marker that specifies end of transaction.
 commit − A signal to specify that the transaction has been successfully completed in
its entirety and will not be undone.
 rollback − A signal to specify that the transaction has been unsuccessful and so all
temporary changes in the database are undone. A committed transaction cannot be
rolled back.

Transaction States

A transaction may go through a subset of five states, active, partially committed, committed,
failed and aborted.
 Active − The initial state where the transaction enters is the active state. The
transaction remains in this state while it is executing read, write or other operations.
 Partially Committed − The transaction enters this state after the last statement of the
transaction has been executed.
 Committed − The transaction enters this state after successful completion of the
transaction and system checks have issued commit signal.
 Failed − The transaction goes from partially committed state or active state to failed
state when it is discovered that normal execution can no longer proceed or system
checks fail.
 Aborted − This is the state after the transaction has been rolled back after failure and
the database has been restored to its state that was before the transaction began.
The following state transition diagram depicts the states in the transaction and the low level
transaction operations that causes change in states.

Desirable Properties of Transactions

Any transaction must maintain the ACID properties, viz. Atomicity, Consistency, Isolation,
and Durability.

17
EMERGING TECHNOLOGIES IN DATA PROCESSING

 Atomicity − This property states that a transaction is an atomic unit of processing, that
is, either it is performed in its entirety or not performed at all. No partial update should
exist.
 Consistency − A transaction should take the database from one consistent state to
another consistent state. It should not adversely affect any data item in the database.
 Isolation − A transaction should be executed as if it is the only one in the system.
There should not be any interference from the other concurrent transactions that are
simultaneously running.
 Durability − If a committed transaction brings about a change, that change should be
durable in the database and not lost in case of any failure.

DISTRIBUTED DBMS - COMMIT PROTOCOLS

The transaction manager in a local database system just needs to inform the recovery
manager of their choice to commit a transaction. However, in a distributed system, the
transaction manager should consistently enforce the decision to commit and communicate it
to all the servers in the various sites where the transaction is being conducted. Each site's
processing ends when it reaches the partially committed transaction state, where it waits for
all other transactions to get their partially committed states. Once it receives the signal that all
the sites are prepared, it begins to commit. Either every site commits in a distributed system,
or none of them does.

To guarantee atomicity, the execution's ultimate result must be accepted by every site where
transaction T was executed. T must either commit at every location or abort at every location.
The transaction coordinator of T must carry out a commit protocol in order to guarantee this
property.

There is a total of three different Distributed DBMS Commit Protocols:

o One-phase Commit,
o Two-phase Commit, and
o Three-phase Commit.

1. One-Phase Commit

The distributed one-phase commit is the most straightforward commit protocol. Consider the
scenario where the transaction is being carried out at a controlling site and several slave sites.
These are the steps followed in the one-phase distributed commit protocol:

o Each slave sends a "DONE" notification to the controlling site once it has
successfully finished its transaction locally.
o The slaves await the commanding site's "Commit" or "Abort" message. This period of
waiting is known as the window of vulnerability.

18
EMERGING TECHNOLOGIES IN DATA PROCESSING

o The controlling site decides whether to commit or abort after receiving the "DONE"
message from each slave. The commit point is where this happens. It then broadcasts
this message to every slave.
o An acknowledgement message is sent to the controlling site by the slave once it
commits or aborts in response to this message.

2. Two-Phase Commit

The two-phase commit protocol (2PC), which is explained in this Section, is one of the most
straightforward and popular commit methods. The vulnerability of one-phase commit
methods is decreased by distributed two-phase commit. The following actions are taken in the
two phases:

Considering a transaction T that proceeds at site S i and is coordinated by the transaction


coordinator Ci.

When T has completed, or when all the sites where T has run notify C i that T has
accomplished, Ci initiates the 2PC protocol.

Phase 1 - Obtaining a Decision

o Ci inserts the record <prepare T> into the log and forces it to be stored in a stable
location. After that, it notifies every site where T was executed to prepare T.
o When such a communication is received, the transaction manager at that location
decides whether it is willing to commit its share of T.

o In response, it sends Ci an abort T message and adds a record with the text <no T> to
the log if the response is negative. If the response is affirmative, it adds a record
labelled <ready T> to the log and forces the log (along with every record labelled "T"
in the log) into stable storage.
o A ready T message is then returned to Ci by the transaction manager.

Phase 2 - Recording the Decision

19
EMERGING TECHNOLOGIES IN DATA PROCESSING

o When all of the sites have responded to the "prepare T" message, or after a
predetermined amount of time has passed since the "prepare< T" message was
delivered, Ci can decide whether the transaction T can be committed or aborted.
o If Ci got a ready T message from every site involved in the transaction, transaction T
can be committed. If not, then transaction T must be abandoned. Depending on the
outcome, the log is either forced into stable storage, or a record is added to the log.

o The transaction's outcome has already been decided at this time. The coordinator then
sends one of two messages to all involved sites: a "commit T" message or an "abort
T" message. The message is entered into the log locally when a site receives it.

3. Three-Phase Commit

The two-phase commit protocol can be extended to overcome the blocking issue using the
three-phase commit (3PC) protocol, under particular assumptions.

It is specifically anticipated that there will be no network partitions and that there won't be
any more than k sites that fail, where k is a preset number. Under these presumptions, the
protocol prevents blocking by adding a third phase that involves several sites in the commit
decision.

The coordinator initially makes certain that at least k other sites are aware that it planned to
commit the transaction before immediately documenting the decision to commit in its
persistent storage. In the event that the coordinator fails, the surviving sites initially choose a
replacement.

The protocol's status is checked by the new coordinator from the remaining locations; I If the
coordinator had made the decision to commit, at minimum one of the other K sites it had
notified would be online and would make sure the commit decision was upheld. If some site
understood that the previous coordinator intended to complete the transaction, the new
coordinator starts over with the third phase of the procedure. Otherwise, the transaction is
aborted by the new coordinator.

20
EMERGING TECHNOLOGIES IN DATA PROCESSING

The 3PC protocol has the advantage of not blocking until k sites fail, but it also has the
disadvantage that a network partitioning might be mistaken for more than k sites failing,
which would result in blocking. In addition, the protocol must be properly developed to
prevent inconsistent results, such as transactions being committed in one partition but aborted
in another, in the event of network partitioning (or more than k sites failing). The 3PC
protocol is not frequently utilized due of its overhead.

The three phases of the distributed three-phase commit protocol are as follows:

Phase one - Obtaining Preliminary Decision

o It is identical to the 2PC Phase one.


o Every site must be prepared to make a commitment if directed to do so

Phase 2 of 2PC is divided into Phase Two and Phase Three in 3PC.

Phase Two -

Phase 2 involves the coordinator making a choice similar to the 2PC (known as the pre-
commit decision) and documenting it in several (at least K).

Phase Three -

Phase 3 involves the coordinator notifying all participating sites whether to commit or abort.

Under 3PC, despite the coordinator's failure, a choice can be committed using knowledge of
pre-commit decisions.

21
EMERGING TECHNOLOGIES IN DATA PROCESSING

Concurrency Control

Concurrency Control is the working concept that is required for controlling and managing the
concurrent execution of database operations and thus avoiding the inconsistencies in the
database. Thus, for maintaining the concurrency of the database, we have the concurrency
control protocols.

Concurrency Control Protocols

The concurrency control protocols ensure the atomicity, consistency, isolation,


durability and serializability of the concurrent execution of the database transactions.
Therefore, these protocols are categorized as:

o Lock Based Concurrency Control Protocol


o Time Stamp Concurrency Control Protocol
o Validation Based Concurrency Control Protocol

Lock-Based Protocol

In this type of protocol, any transaction cannot read or write data until it acquires an
appropriate lock on it. There are two types of lock:

1. Shared lock:

o It is also known as a Read-only lock. In a shared lock, the data item can only read by
the transaction.
o It can be shared between the transactions because when the transaction holds a lock,
then it can't update the data on the data item.

2. Exclusive lock:

o In the exclusive lock, the data item can be both reads as well as written by the
transaction.
o This lock is exclusive, and in this lock, multiple transactions do not modify the same
data simultaneously.

Timestamp Ordering Protocol

o The Timestamp Ordering Protocol is used to order the transactions based on their
Timestamps. The order of transaction is nothing but the ascending order of the
transaction creation.

22
EMERGING TECHNOLOGIES IN DATA PROCESSING

o The priority of the older transaction is higher that's why it executes first. To determine
the timestamp of the transaction, this protocol uses system time or logical counter.
o The lock-based protocol is used to manage the order between conflicting pairs among
transactions at the execution time. But Timestamp based protocols start working as
soon as a transaction is created.
o Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has
entered the system at 007 times and transaction T2 has entered the system at 009
times. T1 has the higher priority, so it executes first as it is entered the system first.
o The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write'
operation on a data.

Validation Based Protocol

Validation phase is also known as optimistic concurrency control technique. In the validation
based protocol, the transaction is executed in the following three phases:

1. Read phase: In this phase, the transaction T is read and executed. It is used to read
the value of various data items and stores them in temporary local variables. It can
perform all the write operations on temporary variables without an update to the
actual database.
2. Validation phase: In this phase, the temporary variable value will be validated
against the actual data to see if it violates the serializability.
3. Write phase: If the validation of the transaction is validated, then the temporary
results are written to the database or system otherwise the transaction is rolled back.

Query Processing in DBMS

Query Processing is the activity performed in extracting data from the database. In query
processing, it takes various steps for fetching the data from the database. The steps involved
are:

1. Parsing and translation


2. Optimization
3. Evaluation

The query processing works in the following way:

Parsing and Translation

 As query processing includes certain activities for data retrieval.

23
EMERGING TECHNOLOGIES IN DATA PROCESSING

 Initially, the given user queries get translated in high-level database languages such as
SQL. It gets translated into expressions that can be further used at the physical level
of the file system.
 After this, the actual evaluation of the queries and a variety of query -optimizing
transformations and takes place.
 Thus before processing a query, a computer system needs to translate the query into a
human-readable and understandable language.
 Consequently, SQL or Structured Query Language is the best suitable choice for
humans.
 But, it is not perfectly suitable for the internal representation of the query to the
system.
 Relational algebra is well suited for the internal representation of a query. The
translation process in query processing is similar to the parser of a query.
 When a user executes any query, for generating the internal form of the query, the
parser in the system checks the syntax of the query, verifies the name of the relation
in the database, the tuple, and finally the required attribute value. The parser creates a
tree of the query, known as 'parse-tree.'
 Further, translate it into the form of relational algebra. With this, it evenly replaces all
the use of the views when used in the query.

Thus, we can understand the working of a query processing in the below-described diagram:

Suppose a user executes a query. As we have learned that there are various methods of
extracting the data from the database. In SQL, a user wants to fetch the records of the
employees whose salary is greater than or equal to 10000. For doing this, the following query
is undertaken:

select emp_name from Employee where salary>10000;

Thus, to make the system understand the user query, it needs to be translated in the form of
relational algebra. We can bring this query in the relational algebra form as:

o σsalary>10000 (πsalary (Employee))


o πsalary (σsalary>10000 (Employee))

24
EMERGING TECHNOLOGIES IN DATA PROCESSING

After translating the given query, we can execute each relational algebra operation by using
different algorithms. So, in this way, a query processing begins its working.

Evaluation

For this, with addition to the relational algebra translation, it is required to annotate the
translated relational algebra expression with the instructions used for specifying and
evaluating each operation. Thus, after translating the user query, the system executes a query
evaluation plan.

Query Evaluation Plan

o In order to fully evaluate a query, the system needs to construct a query evaluation
plan.
o The annotations in the evaluation plan may refer to the algorithms to be used for the
particular index or the specific operations.
o Such relational algebra with annotations is referred to as Evaluation Primitives. The
evaluation primitives carry the instructions needed for the evaluation of the operation.
o Thus, a query evaluation plan defines a sequence of primitive operations used for
evaluating a query. The query evaluation plan is also referred to as the query
execution plan.
o A query execution engine is responsible for generating the output of the given query.
It takes the query execution plan, executes it, and finally makes the output for the user
query.

Optimization

o The cost of the query evaluation can vary for different types of queries. Although the
system is responsible for constructing the evaluation plan, the user does need not to
write their query efficiently.

o Usually, a database system generates an efficient query evaluation plan, which


minimizes its cost. This type of task performed by the database system and is known
as Query Optimization.
o For optimizing a query, the query optimizer should have an estimated cost analysis of
each operation. It is because the overall operation cost depends on the memory
allocations to several operations, execution costs, and so on.

25

You might also like