ADBMS Notes
ADBMS Notes
Query processing is the process by which a declarative query is translated into low-level data
manipulation operations. SQL is the standard query language that is supported in current
DBMSs.
Query Processing steps:
Non-leaf nodes = operations of relational algebra (with parameters); Leaf nodes = relations
Query optimization refers to the process by which the best execution strategy for a given query is
found from among a set of alternatives.
The process typically involves two steps: query decomposition and query optimization.
Query decomposition takes an SQL query and translates it into one expressed in relational algebra. In the
process, the query is analyzed semantically so that incorrect queries are detected and rejected as easily as
possible, and correct queries are simplified. Simplification involves the elimination of redundant
predicates which may be introduced as a result of query modification to deal with views, security
enforcement and semantic integrity control. The simplified query is then restructured as an algebraic
query.
For a given SQL query, there are more than one possible algebraic queries. Some of these algebraic
queries are better than others. The quality of an algebraic query is defined in terms of expected
performance.
The traditional procedure is to obtain an initial algebraic query by translating the predicates and the target
statement into relational operations as they appear in the query. This initial algebraic query is then
transformed, using algebraic transformation rules, into other algebraic queries until the best one is
found.
The best algebraic query is determined according to a cost function which calculates the cost of
executing the query according to that algebraic specification. This is the process of query optimization.
Optimization typically takes one of two forms: Heuristic Optimization or Cost Based
Optimization
In Heuristic Optimization, the query execution is refined based on heuristic rules for reordering
the individual operations.
With Cost Based Optimization, the overall cost of executing the query is systematically
reduced by estimating the costs of executing several different execution plans.
Query Optimization
We divide the query optimization into two types: Heuristic (sometimes called Rule based) and
Systematic (Cost based).
A query can be represented as a tree data structure. Operations are at the interior nodes
and data items (tables, columns) are at the leaves.
For Example:
SELECT PNUMBER, DNUM, LNAME
FROM
PROJECT, DEPARTMENT, EMPLOYEE
WHERE DNUM=DNUMBER and MGRSSN=SSN and
PLOCATION = 'Stafford';
TABLE
MI LNAME
-- ------B SMITH
T WONG
J ZELAYA
S WALLACE
K NARAYAN
SSN
--------123456789
333445555
999887777
987654321
666884444
BDATE
--------09-JAN-55
08-DEC-45
19-JUL-58
20-JUN-31
15-SEP-52
ADDRESS
------------------------731 FONDREN, HOUSTON, TX
638 VOSS,HOUSTON TX
3321 CASTLE, SPRING, TX
291 BERRY, BELLAIRE, TX
975 FIRE OAK, HUMBLE, TX
JOYCE
AHMAD
JAMES
A
V
E
DEPARTMENT TABLE:
DNAME
DNUMBER
--------------- --------HEADQUARTERS
1
ADMINISTRATION
4
RESEARCH
5
MGRSSN
--------888665555
987654321
333445555
PROJECT TABLE:
PNAME
PNUMBER
---------------- ------ProductX
1
ProductY
2
ProductZ
3
Computerization
10
Reorganization
20
NewBenefits
30
PLOCATION
---------Bellaire
Sugarland
Houston
Stafford
Houston
Stafford
MGRSTARTD
--------19-JUN-71
01-JAN-85
22-MAY-78
DNUM
---5
5
5
4
1
4
WORKS_ON TABLE:
ESSN
PNO
--------- --123456789
1
123456789
2
666884444
3
453453453
1
453453453
2
333445555
2
333445555
3
333445555
10
333445555
20
999887777
30
999887777
10
987987987
10
987987987
30
987654321
30
987654321
20
888665555
20
F
M
M
25000 333445555 5
25000 987654321 4
55000
1
HOURS
----32.5
7.5
40.0
20.0
20.0
10.0
10.0
10.0
10.0
30.0
10.0
35.0
5.0
20.0
15.0
null
Note the two cross product operations. These require lots of space and time (nested loops)
to build.
After the two cross products, we have a temporary table with 144 records (6 projects * 3
departments * 8 employees).
An overall rule for heuristic query optimization is to perform as many select and project
operations as possible before doing any joins.
There are a number of transformation rules that can be used to transform a query:
1. Cascading selections. A list of conjunctive conditions can be broken up into
separate individual conditions.
c1c2(E)= c1(c2(E))
2. Commutativity of the selection operation.
3. Cascading projections. All but the last projection can be ignored.
Assume that attributes A1, . . . ,An are among B1, . . . ,Bm. Then
A1,...,An( B1,...,Bm(E)) = A1,...,An(E)
4. Commuting selection and projection. If a selection condition only involves
attributes contained in a projection clause, the two can be commuted.
5. Commutativity of Join and Cross Product.
6. Commuting selection with Join.
If c only involves attributes from E1,then
Just looking at the Syntax of the query may not give the whole picture - need to look at
the data as well.
Of these, Access cost is the most crucial in a centralized DBMS. The more work we can
do with data in cache or in memory, the better.
Access Routines are algorithms that are used to access and aggregate data in a database.
An RDBMS may have a collection of general purpose access routines that can be
combined to implement a query execution plan.
We are interested in access routines for selection, projection, join and set operations such
as union, intersection, set difference, cartesian product, etc.
As with heuristic optimization, there can be many different plans that lead to the same
result.
There are many possible ways to estimate cost, e.g., based on disk accesses, CPU
time, or communication overhead.
Disk access is the predominant cost (in terms of time); relatively easy to estimate;
therefore, number of block transfers from/to disk is typically used as measure.
Simplifying assumption: each block transfer has the same cost.
Cost of algorithm (e.g., for join or selection) depends on database buffer size; more
memory for DB buffer reduces disk accesses. Thus DB buffer size is a parameter for
estimating cost.
We refer to the cost estimate of algorithm S as cost(S). We do not consider cost of
writing output to disk.
(
F1
F2
(E)) = F1^F2(E)
UNIT-2
Disadvantages of RDBMS
RDBMSs are not suitable for applications with complex data structures or new data types
for large, unstructured objects, such as CAD/CAM, Geographic information systems,
multimedia databases, imaging and graphics.
The RDBMSs typically do not allow users to extend the type system by adding new data
types.
They also only support first-normal-form relations in which the type of every column
must be atomic, i.e., no sets, lists, or tables are allowed inside a column.
Recursive queries are difficult to write.
MOTIVATING EXAMPLE
As a specific example of the need for object-relational systems, we focus on a new business data
processing problem that is both harder and (in our view) more entertaining than the dollars and
cents bookkeeping of previous decades. Today, companies in industries such as entertainment are
in the business of selling bits; their basic corporate assets are not tangible products, but rather
software artifacts such as video and audio.
We consider the fictional Dinky Entertainment Company, a large Hollywood conglomerate
whose main assets are a collection of cartoon characters, especially the cuddly and
internationally beloved Herbert the Worm. Dinky has a number of Herbert the Worm films, many
of which are being shown in theaters around the world at any given time. Dinky also makes a
good deal of money licensing Herbert's image, voice, and video footage for various purposes:
action figures, video games, product endorsements, and so on. Dinky's database is used to
manage the sales and leasing records for the various Herbert-related products, as well as the
video and audio data that make up Herbert's many films.
Traditional database systems, such as RDBMS, have been quite successful in developing the
database technology required for many traditional business database applications. However, they
have certain shortcomings when more complex database applications must be designed and
implementedfor example, databases for engineering design and manufacturing (CAD/CAM ),
scientific experiments, telecommunications, geographic information systems, and multimedia.
These newer applications have requirements and characteristics that differ from those of
traditional business applications, such as more complex structures for objects, longer-duration
transactions, new data types for storing images or large textual items, and the need to define
nonstandard application-specific operations.
Object-oriented databases were proposed to meet the needs of these more complex applications.
The object-oriented approach offers the flexibility to handle some of these requirements without
being limited by the data types and query languages available in traditional database systems. A
key feature of object-oriented databases is the power they give the designer to specify both the
structure of complex objects and the operations that can be applied to these objects.
Object database systems combine the classical capabilities of relational database management
systems (RDBMS), with new functionalities assumed by the object-orientedness. The traditional
capabilities include:
It
It
It
It
It
is
is
is
is
is
unique
system generated
invisible to the user. That is it cannot be modified by the user.
immutable. That is, once generated, it is never regenerated.
a long integer value
Encapsulation
Object-oriented models enforce encapsulation and information hiding. This means, the state of
objects can be manipulated and read only by invoking operations that are specified within the
type definition and made visible through the public clause.
In an object-oriented database system encapsulation is achieved if only the operations are
visible to the programmer and both the data and the implementation are hidden.
Support for types or classes
Type: in an object-oriented system, summarizes the common features of a set of objects
with the same characteristics. In programming languages types can be used at
compilation time to check the correctness of programs.
Class: The concept is similar to type but associated with run-time execution. The term
class refers to a collection of all objects with the same internal structure (attributes) and
methods. These objects are called instances of the class.
Both of these two features can be used to group similar objects together, but it is normal
for a system to support either classes or types and not both.
Class or type hierarchies
Any subclass or subtype will inherit attributes and methods from its superclass or supertype.
Overriding, Overloading and Late Binding
Overloading: A class modifies an existing method, by using the same name, but with a
different list, or type, of parameters.
Overriding: The implementation of the operation will depend on the type of the object it is
applied to.
Late binding: The implementation code cannot be referenced until run-time.
Computational Completeness
SQL does not have the full power of a conventional programming language. Languages such as
Pascal or C are said to be computationally complete because they can exploit the full
capabilities of a computer. SQL is only relationally complete, that is, it has the full power of
relational algebra. Whilst any SQL code could be rewritten as a C++ program, not all C++
programs could be rewritten in SQL.
Mandatory features of database systems
A database is a collection of data that is organized so that its contents can easily be accessed,
managed, and updated. Thus, a database system contains the five following features:
Persistence
As in a conventional database, data must remain after the process that created it has
terminated. For this purpose data has to be stored permanently on secondary storage.
Secondary Storage Management
Traditional databases employ techniques, which manage secondary storage in order to improve
the performance of the system. These are usually invisible to the user of the system.
Concurrency
The system should provide a concurrency mechanism, which is similar to the concurrency
mechanisms in conventional databases.
Recovery
The system should provide a recovery mechanism similar to recovery mechanisms in
conventional databases.
Ad hoc query facility
The database should provide a high-level, efficient, application independent query facility.
This needs not necessarily be a query language but could instead, be some type of graphical
interface.
A structured data type can be used as the type for a column in a regular table, the
type for an entire table (or view), or as an attribute of another structured type.
When used as the type for a table, the table is known as a typed table.
Structured data types exhibit a behavior known as inheritance. A structured type
can have subtypes, other structured types that reuse all of its attributes and contain
their own specific attributes. The type from which a subtype inherits attributes is
known as its supertype.
For Example:
We have to create table employee
Name
FNam
e
Age
LName
Salar
y
Address
street
city
privinc
e
Postal_co
de
For instance, the Employee and Department objects can be connected by a link
worksFor. In the data structure links are implemented as logical pointers (bidirectional or uni-directional).
Encapsulation and information hiding. The internal properties of an object
are subdivided into two parts: public (visible from the outside) and private
(invisible from the outside). The user of an object can refer to public properties
only.
Classes, types, interfaces. Each object is an instance of one or more classes.
The class is understood as a blueprint for objects; i.e. objects are instantiated
according to information presented in the class and the class contains the
properties that are common for some collection of objects (objects invariants).
Each object is assigned a type. Objects are accessible through their interfaces,
which specify all the information that is necessary for using objects.
Abstract data types (ADTs): a kind of a class, which assumes that any access
operations (called methods). The object performs the operation after receiving a
message with the name of operation to be performed (and parameters of this
operation).
Inheritance. Classes are organized in a hierarchy reflecting the hierarchy of real
world concepts. For instance, the class Person is a super class of the classes
Employee and Student. Properties of more abstract classes are inherited by more
specific classes. Multi-inheritance means that a specific class inherits from
several independent classes.
Polymorphism, late binding, overriding. The operation to be executed on an
object is chosen dynamically, after the object receives the message with the
operation name. The same message sent to different objects can invoke different
operations.
Persistence. Database objects are persistent, i.e., they live as long as necessary.
Object Model
Object Specification Languages
Object Definition Language (ODL) for schema definition
Object Interchange Format (OIF) to exchange objects between databases
Object Query Language
declarative language to query and update database objects
Language Bindings (C++, Java, Smalltalk)
Object manipulation language
Mechanisms to invoke OQL from language
Procedures for operation on databases and transactions
Structured objects can also be large, but unlike ADT objects they often vary in size during the
lifetime of a database. For example, consider the stars attribute of the films table. As the years
pass, some of the bit actors in an old movie may become famous. When a bit actor becomes
famous, we might want to advertise his or her presence in the earlier films. This involves an
insertion into the stars attribute of an individual tuple in lms. Because these bulk attributes can
grow arbitrarily, flexible disk layout mechanisms are required. An additional complication arises
with array types. Traditionally, array elements are stored sequentially on disk in a row-by-row
fashion, for example
A11,.A1n, A21,..,A2n Am1,.....,Amn
However, queries may often request sub arrays that are not stored contiguously on disk (e.g.,
A11,A21,...,Am1). Such requests can result in a very high I/O cost for retrieving the sub array. In
order to reduce the number of I/Os required in general, arrays are often broken into contiguous
chunks, which are then stored in some order on disk. Although each chunk is some contiguous
region of the array, chunks need not be row-by-row or column-by-column. For example, a chunk
of size 4 might be A11,A12,A21,A22, which is a square region if we think of the array as being
arranged row-by-row in two dimensions.
Query Processing
ADTs and structured types call for new functionality in processing queries in ORDBMSs. They
also change a number of assumptions that affect the efficiency of queries. In this section we look
at two functionality issues (user-defined aggregates and security) and two efficiency issues
(method caching and pointer swizzling).
Since users are allowed to define new methods for their ADTs, it is not unreasonable to expect
them to want to define new aggregation functions for their ADTs as well. For example, the usual
SQL aggregates COUNT, SUM, MIN, MAX, AVGare not particularly appropriate for the
Image type schema.
Most ORDBMSs allow users to register new aggregation functions with the system. To register
an aggregation function, a user must implement three methods, which we will call initialize,
iterate, and terminate. The initialize method initializes the internal state for the aggregation. The
iterate method updates that state for every tuple seen, while the terminate method computes the
aggregation result based on the final state and then cleans up. As an example, consider an
aggregation function to compute the second-highest value in a field. The initialize call would
allocate storage for the top two values, the iterate call would compare the current tuples value
with the top two and update the top two as necessary, and the terminate call would delete the
storage for the top two values, returning a copy of the second-highest value.
Method Security
ADTs give users the power to add code to the DBMS, this power can be abused. A buggy or
malicious ADT method can bring down the database server or even corrupt the database. The
DBMS must have mechanisms to prevent buggy or malicious user code from causing problems.
It may make sense to override these mechanisms for efficiency in production environments with
vendor-supplied methods. However, it is important for the mechanisms to exist, if only to
support debugging of ADT methods, otherwise method writers would have to write bug-free
code before registering their methods with the DBMSnot a very forgiving programming
environment.One mechanism to prevent problems is to have the user methods be interpreted
rather than compiled . The DBMS can check that the method is well behaved either by restricting
the power of the interpreted language or by ensuring that each step taken by a method is safe
before executing it. Typical interpreted languages for this purpose include Java and the
procedural portions of SQL:1999
An alternative mechanism is to allow user methods to be compiled from a general-purpose
programming language such as C++, but to run those methods in a different address space than
the DBMS. In this case the DBMS sends explicit interprocess communications (IPCs) to the user
method, which sends IPCs back in return. This approach prevents bugs in the user methods (e.g.,
stray pointers) from corrupting the state of the DBMS or database and prevents malicious
methods from reading or modifying the DBMS state or database as well. Note that the user
writing the method need not know that the DBMS is running the method in a separate process:
The user code can be linked with a wrapper that turns method invocations and return values
into IPCs
Method Caching
User-defined ADT methods can be very expensive to execute and can account for the bulk of the
time spent in processing a query. During query processing it may make sense to cache the results
of methods, in case they are invoked multiple times with the same argument. Within the scope of
a single query, one can avoid calling a method twice on duplicate values in a column by either
sorting the table on that column or using a hash-based scheme much like that used for
aggregation. An alternative is to maintain a cache of method inputs and matching outputs as a
table in the database. Then to find the value of a method on particular inputs, we essentially join
the input tuples with the cache table. These two approaches can also be combined.
Pointer Swizzling
In some applications, objects are retrieved into memory and accessed frequently through their
oids, dereferencing must be implemented very efficiently. Some systems maintains table of oids
of objects that are (currently) in memory. When an object O is brought into memory, they check
each oid contained in O and replace oids of in-memory objects by in-memory pointers to those
objects. This technique is called pointer swizzling and makes references to in-memory objects
very fast. The downside is that when an object is paged out, in-memory references to it must
somehow be invalidated and replaced with its oid.
Query Optimization
New indexes and query processing techniques widen the choices available to a query optimizer.
In order to handle the new query processing functionality, an optimizer must know about the new
functionality and use it appropriately. In this section we discuss two issues in exposing
information to the optimizer (new indexes and ADT method estimation) and an issue in query
planning that was ignored in relational systems (expensive selection optimization).
appears in a multi-table query, it may even make sense to postpone the selection until after performing joins. Note that this approach is the opposite of the heuristic for pushing selections. The
details of optimally placing expensive selections among joins are somewhat complicated, adding
to the complexity of optimization in ORDBMSs.
OODBMS
ORDBMS
Aimed
at
designing
management and finance
systems
i.e.:
hotel
management,
shop
management, etc.
ORDBMSs
support
an
extended form of SQL,
An OODBMS is aimed at
applications where an objectcentric viewpoint is
appropriate;
that is, typical user sessions
consist of retrieving a few
objects and
working on them for long
periods, with related objects
(e.g., objects referenced
object
ORDBMS.
RDBMS:
server,
Examples of OODBMS:
Object
store,
Versant,
Gemstone, etc.
Examples of ORDBMS:
Postgres, SQL 92
UNIT- 3
Parallel and Distributed Databases
A parallel database system is one that seeks to improve performance through parallel
implementation of various operations such as loading data, building indexes, and evaluating
queries.
2. Shared disk(All processors share common disk & have private memories). where each
CPU has a private memory and direct access to all disks through an interconnection
network.
Advantages: Each processor has its own local memory, so the memory bus is not
bottleneck.
This architecture provides higher degree of fault tolerance.(If a processor fails, the other
processors can take over its task)
Disadvantage: The interconnection to the disk subsystem is now a bottleneck.
3. Shared nothing (Each node of machine consists of a processor, memory and one or
more disks). where each CPU has local main memory and disk space, but no two CPUs
can access the same storage area; all communication between CPUs is through a
network connection.
Advantages: Instead of passing all I/O to go through a single interconnection network, only
queries to non local disks and result relations are passed through network.
These architectures are more scalable and can easily support large number of
processors.
Transmissions capacity increases as more nodes can be added.
Disadvantage: Cost of communication and non local disk access are higher as compared
to others because transmitting data involves software interaction at both ends.
Each individual operator can also be executed in parallel by partitioning the input data
and then working on each partition in parallel and then combining the result of each
partition. This approach is called Data Partitioned parallel Evaluation.
Data Partitioning: Here large datasets are partitioned horizontally across several disk, this
enables us to exploit the I/O bandwidth of the disks by reading and writing them in parallel.
This can be done in the following ways:
a. Round Robin Partitioning
b. Hash Partitioning
c. Range Partitioning
a. Round Robin Partitioning :If there are n processors, the ith tuple is assigned to
processor i mod n
b. Hash Partitioning : A hash function is applied to (selected fields of) a tuple to determine
its processor.
Hash partitioning has the additional virtue that it keeps data evenly distributed even if the
data grows and shrinks over time.
c. Range Partitioning : Tuples are sorted (conceptually), and n ranges are chosen for the
sort key values so that each range contains roughly the same number of tuples; tuples in
range i are assigned to processor i.
Range partitioning can lead to data skew; that is, partitions with widely varying numbers of
tuples across partitions or disks. Skew causes processors dealing with large partitions to
become performance bottlenecks.
For N processors each processor gets the tuples which lie in range assigned to it. Like
processor 1 contains all tuples in range 10 to 20 and so on.
Each processor has a sorted version of the tuples which can then be combined by
traversing and collecting the tuples in the order on the processors (according to the range
assigned)
The problem with range partitioning is data skew which limits the scalability of the
parallel sort. One good approach to range partitioning is to obtain a sample of the entire
relation by taking samples at each processor that initially contains part of the relation. The
(relatively small) sample is sorted and used to identify ranges with equal numbers of tuples.
This set of range values, called a splitting vector, is then distributed to all processors and
used to range partition the entire relation.
Joins:
Here we consider how the join operation can be parallelized
Consider 2 relations A and B to be joined using the age attribute. A and B are initially
distributed across several disks in a way that is not useful for join operation
So we have to decompose the join into a collection of k smaller joins by partitioning
both A and B into a collection of k logical partitions.
If same partitioning function is used for both A and B then the union of k smaller joins
will compute to the join of A and B.
DISTRIBUTED DATABASES
The idea of a distributed database is that the data should be physically stored at different
locations but its distribution and access should be transparent to the user.
Introduction to DBMS:
A Distributed Database should exhibit the following properties:
1) Distributed Data Independence: - The user should be able to access the database
without having the need to know the location of the data.
2) Distributed Transaction Atomicity: - The concept of atomicity should be distributed for
the operation taking place at the distributed sites.
Types of Distributed Databases are:a) Homegeneous Distributed Database is where the data stored across multiple sites is
managed by same DBMS software at all the sites.
We need just one database server that is capable of managing queries and
transactions spanning multiple servers; the remaining servers only need to handle local
queries and transactions.
This can be done by using a global name server that can assign globally unique
names.
This can be implemented by using the following two fields:1. Local name field locally assigned name by the site where the relation is created. Two
objects at different sites can have same local names.
2. Birth site field indicates the site at which the relation is created and where information
about its fragments and replicas is maintained.
Catalog Structure:
A centralized system catalog is used to maintain the information about all the
transactions in the distributed database but is vulnerable to the failure of the site containing
the catalog.
This could be avoided by maintaining a copy of the global system catalog but it involves
broadcast of every change done to a local catalog to all its replicas.
Another alternative is to maintain a local catalog at every site which keeps track of all
the replicas of the relation.
Distributed Data Independence:
It means that the user should be able to query the database without needing to specify
the location of the fragments or replicas of a relation which has to be done by the DBMS
Users can be enabled to access relations without considering how the relations are
distributed as follows:
The local name of a relation in the system catalog is a combination of a user name and a
user-defined relation name.
When a query is fired the DBMS adds the user name to the relation name to get a local
name, then adds the user's site-id as the (default) birth site to obtain a global relation name.
By looking up the global relation name in the local catalog if it is cached there or in the
catalog at the birth site the DBMS can locate replicas of the relation.
Distributed query processing:
In a distributed system several factors complicates the query processing.
One of the factors is cost of transferring the data over network.
This data includes the intermediate files that are transferred to other sites for further
processing or the final result files that may have to be transferred to the site where the
query result is needed.
Although these cost may not be very high if the sites are connected via a high local n/w
but sometime they become quit significant in other types of network.
Hence, DDBMS query optimization algorithms consider the goal of reducing the
amount of data transfer as an optimization criterion in choosing a distributed query
execution strategy.
Consider an EMPLOYEE relation.
The size of the employee relation is 100 * 10,000=10^6 bytes
The size of the department relation is 35 * 100=3500 bytes
10,000 records
Each record is 100 bytes
Fname field is 15 bytes long
SSN field is 9 bytes long
Lname field is 15 bytes long
Dnum field is 4 byte long
100records
Each record is 35 bytes long
Dnumber field is 4 bytes long
Dname field is 10 bytes long
MGRSSN field is 9 bytes long
Now consider the following query:
For each employee, retrieve the employee name and the name of the department for which
the employee works.
Using relational algebra this query can be expressed as
FNAME, LNAME, DNAME ( EMPLOYEE * DNO=DNUMBER DEPARTMENT)
If we assume that every employee is related to a department then the result of this
query will include 10,000 records.
Now suppose that each record in the query result is 40 bytes long and the query is
submitted at a distinct site which is the result site.
Then there are 3 strategies for executing this distributed query:
1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the site 3 that is your
result site and perform the join at that site. In this case a total of 1,000,000 + 3500 =
1,003,500 bytes must be transferred.
2. Transfer the EMPLOYEE relation to site 2 (site where u have Department relation) and
send the result to site 3. the size of the query result is 40 * 10,000 = 400,000 bytes so
400,000 + 1,000,000 = 1,400,000 bytes must be transferred.
3. Transfer the DEPARTEMNT relation to site 1 (site where u have Employee relation) and
send the result to site 3. in this case 400,000 + 3500 = 403,500 bytes must be transferred.
Nonjoin Queries in a Distributed DBMS:
Consider the following two relations:
Sailors (sid: integer, sname:string, rating: integer, age: real)
Reserves (sid: integer, bid: integer, day: date, rname: string)
Now consider the following query:
SELECT S.age FROM Sailors S WHERE S.rating > 3 AND S.rating < 7
Now suppose that sailor relation is horizontally fragmented with all the tuples having a rating
less than 5 at Shanghais and all the tuples having a rating greater than 5 at Tokyo.
The DBMS will answer this query by evaluating it both sites and then taking the union of the
answer.
Joins in a Distributed DBMS:
Joins of a relation at different sites can be very expensive so now we will consider the
evaluation option that must be considered in a distributed environment.
Suppose that Sailors relation is stored at London and Reserves relation is stored at
Paris. Hence we will consider the following strategies for computing the joins for Sailors and
Reserves.
In the next example the time taken to read one page from disk (or to write one page to
disk) is denoted as td and the time taken to ship one page (from any site to another site) as
ts.
Distributed Recovery
When a transaction commits, all its actions across all the sites at which it executes
must persist.
To detect such deadlocks, a distributed deadlock detection algorithm must be used and we
have three types of algorithms:
1. Centralized Algorithm:
It consist of periodically sending all local waits-for graphs to some one site that is
responsible for global deadlock detection.
At this site, the global waits-for graphs is generated by combining all local graphs and in
the graph the set of nodes is the union of nodes in the local graphs and there is an edge
from one node to another if there is such an edge in any of the local graphs.
2. Hierarchical Algorithm:
This algorithm groups the sites into hierarchies and the sites might be grouped by states,
then by country and finally into single group that contain all sites.
Every node in this hierarchy constructs a waits-for graph that reveals deadlocks involving
only sites contained in (the sub tree rooted at) this node.
Thus, all sites periodically (e.g., every 10 seconds) send their local waits-for graph to the
site constructing the waits-for graph for their country.
The sites constructing waits-for graph at the country level periodically (e.g., every 10
minutes) send the country waits-for graph to site constructing the global waits-for graph.
3. Simple Algorithm:
UNIT IV
INTRODUCTION TO DATABASE SECURITY
There are three main objectives to consider while designing a secure database application:
1. Secrecy: Information should not be disclosed to unauthorized users. For example, a
student should not be allowed to examine other students' grades.
2. Integrity: Only authorized users should be allowed to modify data. For example, students
may be allowed to see their grades, yet not allowed (obviously!) to modify them.
3. Availability: Authorized users should not be denied access. For example, an instructor
who wishes to change a grade should be allowed to do so.
A DBMS typically includes a database security and authorization subsystem that is
responsible for ensuring the security of portions of a database against unauthorized access.
It is now customary to refer to two types of database security mechanisms:
Discretionary Security mechanism: These are used to grant privileges to users, including the
capability to access specific data files, records, or fields in a specified mode(such as read,
insert,delete, or update).
Mandatory security mechanisms: These are used to enforce multilevel security by classifying
the data and users into various security classes (or levels) and then implementing the
appropriate security policy of the organization. For example, a typical policy is to purmit
users at a certain classification level to see only data items classified at the users own level.
An extension of this is role-based security, which enforces policies and privileges based on
the concept of roles.
ACCESS CONTROL
A DBMS should provide mechanisms to control access to data. A DBMS offers two main
approaches to access control.
Discretionary access control
Mandatory access control
Discretionary access control: It is based on the concept of access rights, or privileges,
and mechanisms for users. A privilege allows a user to access some data object in a certain
manner ( e.g., to read or to modify). A user who creates a database object such as a table or
a view automatically gets all applicable privileges on that object. SQL-92 supports
discretionary access control through the GRANT and REVOKE commands.
The GRANT command gives privileges to users.
The GRANT command gives privileges to base table and views. The syntax of this command
is as follows:
GRANT privileges ON object TO users [WITH GRANT OPTION]
Here object is either a base table or a view.
Several privileges can be specified including:
SELECT: The right to access (read) all columns of the table specified as object, including
columns added later through ALTER TABLE commands.
INSERT(column-name): The right to insert rows with (non-null or non default) values in the
named column of the table named as object. The privileges UPDATE(column-name) and
UPDATE are similar to INSERT.
DELETE: The right to delete rows from the table named as object.
REFERENCES(column-name): The right to define foreign keys (in other tables) that refer
to the specified column of the table object. REFERENCES without a column name specified
denotes this right with respect to all columns.
For Example:
Suppose that user joe has created the tables BOATS, RESERVES, and SAILORS. Some
examples of GRANT command that joe can now execute are:
GRANT INSERT, DELETE ON RESERVES TO Yuppy WITH GRANT OPTION
GRANT SELECT ON RESERVES TO Michel
GRANT SELECT ON SAILORS TO Michael WITH GRANT OPTION
GRANT UPDATE (rating) ON SAILORS TO Leah
GRANT REFERENCES (bid) ON BOATS TO Bill
Adding WITH GRANT OPTION at the end of the grant command allows the user who has been
granted the privilege to pass those privilege to other user.
In the above examples. Yuppy can insert or delete Reserves rows and can authorize
someone else to do the same. Michael can execute Select queries on Sailors and Reserves,
and he can pass this privilege to others for sailors, but not for Reserves.
The REVOKE command takes away privileges.
This is complementary command to GRANT that allows the withdrawal of privileges.
The syntax of REVOKE Command is as follows:
REVOKE [ GRANT OPTION FOR] Privileges
ON object FROM users {RESTRICT|CASCADE}
The command can be used to revoke either a privilege or just the grant option on a
privilege( by using the option GRANT OPTION FOR clause).
A user who has granted a privilege to other user may change his mind and want to withdraw
the granted privilege. The intuition behind exactly what effect a REVOKE command has is
complicated by the fact that a user may be granted the same privilege multiple times,
possible by different users.
When a user executes a REVOKE command with the CASCADE keyword, the effect is to
withdraw the named privileges or grant option from all users who currently hold these
privileges solely through a GRANT command that was previously executed by some user
who is now executing the REVOKE command. If these users received the privileges with the
grant option and passed it along, those recipients will also lose their privileges as
consequence of the REVOKE command unless they received these privileges independently.
For Example:
GRANT SELECT ON Sailors TO Art WITH GRANT OPTION (executed by Joe)
GRANT SELECT ON Sailors TO Bob WITH GRANT OPTION (executed by Art)
REVOKE SELECT ON Sailors FROM Art CASCADE
(executed by Joe)
Art loses the SELECT privilege on Sailors, of course. Then Bob, who received this privilege
from Art, and only Art, also loses this privilege.
If the RESTRICT keyword is specified in the REVOKE command, the command is rejected if
revoking the privileges just from the users specified in the command would result in other
privileges becoming abandoned.
Mandatory access control: It is based on system wide policies that cannot be changed by
individual users. In this approach each database object is assigned a security class, each
user is assigned clearance for a security class, and rules are imposed on reading and writing
of database objects by users. The DBMS determines whether a given user can read or write
a given object based on certain rules that involve the security level of the object and the
clearance of the user.
The popular model for mandatory access control, called the Bell-LaPadula model, is
described in terms of objects (e.g., tables, views, rows, columns), subjects (e.g., users,
programs), security classes, and clearances. Each database object is assigned a security
class, and each subject is assigned clearance for a security class; we will denote the class of
an object or subject A as class(A). The security classes in a system are organized according
to a partial order, with a most secure class and a least secure class. For simplicity, we
will assume that there are four classes: top secret (TS), secret (S), confidential (C), and
unclassified (U). In this system, TS > S > C > U, where A > B means that class A data is
more sensitive than class B data.
The Bell-LaPadula model imposes two restrictions on all reads and writes of database
objects:
1. Simple Security Property: Subject S is allowed to read object O only if class(S)
class(O). For example, a user with TS clearance can read a table with C clearance, but a user
with C clearance is not allowed to read a table with TS classification.
2. *-Property: Subject S is allowed to write object O only if class(S) class(O). For example,
a user with S clearance can only write objects with S or TS classification.
To apply mandatory access control policies in a relational DBMS, a security class must be
assigned to each database object. The objects can be at the granularity of tables, rows, or
even individual column values. Let us assume that each row is assigned a security class.
This situation leads to the concept of a multilevel table, which is a table with the surprising
property that users with di_erent security clearances will see a different collection of rows
when they access the same table.
Consider the instance of the Boats table shown in Figure below. Users with S and TS
clearance will get both rows in the answer when they ask to see all rows in Boats. A user
with C clearance will get only the second row, and a user with U clearance will get no rows.
bid
bname
color
Security class
101
Salsa
Red
102
Pinto
Brown
The Boats table is defined to have bid as the primary key. Suppose that a user with
clearance C wishes to enter the row <101, Picante,Scarlet, i>. We have a dilemma:
If the insertion is permitted, two distinct rows in the table will have key 101.
If the insertion is not permitted because the primary key constraint is violated, the user
trying to insert the new row, who has clearance C, can infer that there is a boat with
bid=101 whose security class is higher than C. This situation compromises the principle that
users should not be able to infer any information about objects that have a higher security
classification.
This dilemma is resolved by effectively treating the security classification as part of the key.
Thus, the insertion is allowed to continue, and the table instance is modified as shown in
Figure below.
bid
bname
color
Security class
101
Salsa
Red
101
Picante
Scarlet
102
Pinto
Brown
Users with clearance C or U see just the rows for Picante and Pinto, but users with clearance
S or TS see all three rows. The two rows with bid=101 can be interpreted in one of two ways:
only the row with the higher classification (Salsa, with classification S) actually exists, or
both exist and their presence is revealed to users according to their clearance level. The
choice of interpretation is up to application developers and users.
both processes have to agree to commit before the transaction can be committed. This
requirement can be exploited to pass information with an S classification to the process with
a C clearance: The transaction is repeatedly invoked, and the process with the C clearance
always agrees to commit, whereas the process with the S clearance agrees to commit if it
wants to transmit a 1 bit and does not agree if it wants to transmit a 0 bit.
In this manner, information with an S clearance can be sent to a process with a C clearance
as a stream of bits. This covert channel is an indirect violation of the intent behind the *Property.
Encryption
A DBMS can use encryption to protect information in certain situations where the normal
security mechanism of the DBMS are not adequate. For example, an intruder may steal
tapes containing some data or tape a communication line. By storing and transmitting data
in an encrypted form, the DBMS ensures that such stolen data is not intelligible to the
intruder.
Encryption is basically done through encryption algorithm. The output of the algorithm is the
encrypted version of the data. There is also a decryption algorithm, which takes the
encrypted data and the encryption key as input and then returns the original data. This
approach is called Data Encryption Standard (DES). The main weakness of this approach is
that authorized users must be told the encryption key, and the mechanism for
communicating this information is vulnerable to clever intruders.
Another approach is called Public Key encryption. The encryption scheme proposed by
Rivest, Shamir, and Adleman, called RSA, is a well-known example of public-key encryption.
In this each authorized user has a public encryption key, known to everyone, and a private
decryption key, choosen by the user and known only to him or her.
For example: Consider a user called sam. Anyone can send sam a secret message by
encrypting the message using sams publicly known encryption key. Only sam can decrypt
this secret message because the decryption algorithm requires sams decryption key, known
only to sam. Since users choose their own decryption keys, the weakness of DES is avoided.
UNIT V
What is Postgres?
Traditional relational database management systems (DBMSs) support a data model
consisting of a collection of named relations, containing attributes of a specific type. In
current commercial systems, possible types include floating point numbers, integers,
character strings, money, and dates. It is commonly recognized that this model is
inadequate for future data processing applications. The relational model successfully
replaced previous models in part because of its "Spartan simplicity". However, as
mentioned, this simplicity often makes the implementation of certain applications very
difficult. Postgres offers substantial additional power by incorporating the following four
additional basic concepts in such a way that users can easily extend the system:
classes
inheritance
types
functions
Other features provide additional power and flexibility:
constraints
triggers
rules
transaction integrity
These features put Postgres into the category of databases referred to as object-relational.
Postgres is a client/server application. As a user, you only need access to the client portions
of the installation
POSTGRES ARCHITECTURE
Postgres uses a simple "process per-user" client/server model. A Postgres session consists of
the following cooperating UNIX processes (programs):
A supervisory daemon process (postmaster),
The users frontend application (e.g., the psql program), and
The one or more backend database servers (the postgres process itself).
A single postmaster manages a given collection of databases on a single host. Such a
collection of databases is called an installation or site. Frontend applications that wish to
access a given database within an installation make calls to the library. The library sends
user requests over the network to the postmaster (How a connection is established), which
in turn starts a new backend server process and connects the frontend process to the new
server. From that point on, the frontend process and the backend server communicate
without intervention by the postmaster. Hence, the postmaster is always running, waiting for
requests, whereas frontend and backend processes come and go.
Transactions in POSTGRES
Transactions are a fundamental concept of all database systems. The essential point of a
transaction is that it bundles multiple steps into a single, all-or-nothing operation. The
intermediate states between the steps are not visible to other concurrent transactions, and if some
failure occurs that prevents the transaction from completing, then none of the steps affect the
database at all.
For example, consider a bank database that contains balances for various customer accounts, as
well as total deposit balances for branches. Suppose that we want to record a payment of $100.00
from Alice's account to Bob's account. Simplifying outrageously, the SQL commands for this
might look like
UPDATE accounts SET balance = balance - 100.00
WHERE name = 'Alice';
UPDATE branches SET balance = balance - 100.00
WHERE name = (SELECT branch_name FROM accounts WHERE name = 'Alice');
UPDATE accounts SET balance = balance + 100.00
WHERE name = 'Bob';
UPDATE branches SET balance = balance + 100.00
WHERE name = (SELECT branch_name FROM accounts WHERE name = 'Bob');
The details of these commands are not important here; the important point is that there are
several separate updates involved to accomplish this rather simple operation. Our bank's officers
will want to be assured that either all these updates happen, or none of them happen. It would
certainly not do for a system failure to result in Bob receiving $100.00 that was not debited from
Alice. Nor would Alice long remain a happy customer if she was debited without Bob being
credited. We need a guarantee that if something goes wrong partway through the operation, none
of the steps executed so far will take effect. Grouping the updates into a transaction gives us this
guarantee. A transaction is said to be atomic: from the point of view of other transactions, it
either happens completely or not at all.
In PostgreSQL, a transaction is set up by surrounding the SQL commands of the transaction with
BEGIN and COMMIT commands. So our banking transaction would actually look like
BEGIN;
UPDATE accounts SET balance = balance - 100.00
WHERE name = 'Alice';
-- etc etc
COMMIT;
If, partway through the transaction, we decide we do not want to commit (perhaps we just
noticed that Alice's balance went negative), we can issue the command ROLLBACK instead of
COMMIT, and all our updates so far will be canceled.
PostgreSQL actually treats every SQL statement as being executed within a transaction. If you
do not issue a BEGIN command, then each individual statement has an implicit BEGIN and (if
successful) COMMIT wrapped around it. A group of statements surrounded by BEGIN and COMMIT
is sometimes called a transaction block.
XML stands for the eXtensible Markup Language. It is a new markup language, developed by
the W3C (World Wide Web Consortium)
Some of the areas where XML will be useful in the near-term include:
large Web site maintenance. XML would work behind the scene to simplify the creation of
HTML documents
exchange of information between organizations
off loading and reloading of databases
syndicated content, where content is being made available to different Web sites
electronic commerce applications where different organizations collaborate to serve a customer
scientific applications with new markup languages for mathematical and chemical formulas
electronic books with new markup languages to express rights and ownership
handheld devices and smart phones with new markup languages optimized for these
alternative devices
XML makes essentially two changes to HTML:
It predefines no tags.
It is stricter.
No Predefined Tags
Because there are no predefined tags in XML, you, the author, can create the tags that you need.
Example:
<price currency=usd>499.00</price>
<toc xlink:href=/https/www.scribd.com/newsletter>Pineapplesoft Link</toc>
Stricter
HTML has a very forgiving syntax. This is great for authors who can be as lazy as they want, but
it also makes Web browsers more complex. According to some estimates, more than 50% of the
code in a browser handles errors or sloppiness on the authors part.
XML Example:
A List of Products in XML
<?xml version=1.0?>
<products>
<product id=p1>
<name>XML Editor</name>
<price>499.00</price>
</product>
<product id=p2>
<name>DTD Editor</name>
<price>199.00</price>
</product>
<product id=p3>
<name>XML Book</name>
<price>19.99</price>
</product>
<product id=p4>
<name>XML Training</name>
<price>699.00</price>
</product>
</products>
In this context, XML is used to exchange information between organizations.
The XML Web is a large database on which applications can tap
XML Schemas
The DTD is the original modeling language or schema for XML.
The syntax for DTDs is different from the syntax for XML documents.
The purpose of a DTD is to define the structure of an XML document. It defines the structure
with a list of legal elements:
Example:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE note SYSTEM "Note.dtd">
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
<!DOCTYPE note
[
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
XML Schema
<xs:element name="note">
<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
XML NAMESPACES
XSL
XSL stands for EXtensible Stylesheet Language.
The World Wide Web Consortium (W3C) started to develop XSL because there was a need for
an XML-based Stylesheet Language.
What is XSLT?
XSLT is a language for transforming XML documents into XHTML documents or to
other XML documents.
XPath is a language for navigating in XML documents. XSLT uses XPath to find
information in an XML document. XPath is used to navigate through elements and
attributes in XML documents.
What is XSL-FO?