Chapter 17 - Physical Database Design
Chapter 17 - Physical Database Design
17
Methodology – Physical
Database Design for
Relational Databases
Chapter Objectives
In this chapter you will learn:
n The purpose of physical database design.
n How to map the logical database design to a physical database design.
n How to design base relations for the target DBMS.
n How to design general constraints for the target DBMS.
n How to select appropriate file organizations based on analysis of transactions.
n When to use secondary indexes to improve performance.
n How to estimate the size of the database.
n How to design user views.
n How to design security mechanisms to satisfy user requirements.
In this chapter and the next we describe and illustrate by example a physical database
design methodology for relational databases.
The starting point for this chapter is the logical data model and the documentation
that describes the model created in the conceptual/logical database design methodology
described in Chapters 15 and 16. The methodology started by producing a conceptual
data model in Step 1 and then derived a set of relations to produce a logical data model
in Step 2. The derived relations were validated to ensure they were correctly structured
using the technique of normalization described in Chapters 13 and 14, and to ensure they
supported the transactions the users require.
In the third and final phase of the database design methodology, the designer must decide
how to translate the logical database design (that is, the entities, attributes, relationships,
and constraints) into a physical database design that can be implemented using the target
DBMS. As many parts of physical database design are highly dependent on the target
DBMS, there may be more than one way of implementing any given part of the database.
Consequently to do this work properly, the designer must be fully aware of the function-
ality of the target DBMS, and must understand the advantages and disadvantages of each
alternative approach for a particular implementation. For some systems the designer may
also need to select a suitable storage strategy that takes account of intended database usage.
17.1 Comparison of Logical and Physical Database Design | 495
Step 5 involves deciding how each user view should be implemented. Step 6 involves
designing the security measures necessary to protect the data from unauthorized access,
including the access controls that are required on the base relations.
Step 7 (described in Chapter 18) considers relaxing the normalization constraints imposed
on the logical data model to improve the overall performance of the system. This step
should be undertaken only if necessary, because of the inherent problems involved in intro-
ducing redundancy while still maintaining consistency. Step 8 (Chapter 18) is an ongoing
process of monitoring the operational system to identify and resolve any performance
problems resulting from the design, and to implement new or changing requirements.
Appendix G presents a summary of the methodology for those readers who are already
familiar with database design and simply require an overview of the main steps.
Objective To produce a relational database schema from the logical data model that
can be implemented in the target DBMS.
The first activity of physical database design involves the translation of the relations in the
logical data model into a form that can be implemented in the target relational DBMS. The
first part of this process entails collating the information gathered during logical database
design and documented in the data dictionary along with the information gathered during
the requirements collection and analysis stage and documented in the systems specifica-
tion. The second part of the process uses this information to produce the design of the base
relations. This process requires intimate knowledge of the functionality offered by the
target DBMS. For example, the designer will need to know:
n how to create base relations;
n whether the system supports the definition of primary keys, foreign keys, and alternate
keys;
n whether the system supports the definition of required data (that is, whether the system
allows attributes to be defined as NOT NULL);
n whether the system supports the definition of domains;
n whether the system supports relational integrity constraints;
n whether the system supports the definition of integrity constraints.
498 | Chapter 17 z Methodology – Physical Database Design for Relational Databases
Objective To decide how to represent the base relations identified in the logical data
model in the target DBMS.
To start the physical design process, we first collate and assimilate the information about
the relations produced during logical database design. The necessary information can be
obtained from the data dictionary and the definition of the relations described using the
Database Design Language (DBDL). For each relation identified in the logical data model,
we have a definition consisting of:
n the name of the relation;
n a list of simple attributes in brackets;
n the primary key and, where appropriate, alternate keys (AK) and foreign keys (FK);
n referential integrity constraints for any foreign keys identified.
From the data dictionary, we also have for each attribute:
n its domain, consisting of a data type, length, and any constraints on the domain;
n an optional default value for the attribute;
n whether the attribute can hold nulls;
n whether the attribute is derived and, if so, how it should be computed.
To represent the design of the base relations, we use an extended form of the DBDL to
define domains, default values, and null indicators. For example, for the PropertyForRent
relation of the DreamHome case study, we may produce the design shown in Figure 17.1.
Figure 17.1
DBDL for the
PropertyForRent
relation.
Objective To decide how to represent any derived data present in the logical data
model in the target DBMS.
Attributes whose value can be found by examining the values of other attributes are known
as derived or calculated attributes. For example, the following are all derived attributes:
n the number of staff who work in a particular branch;
n the total monthly salaries of all staff;
n the number of properties that a member of staff handles.
Often, derived attributes do not appear in the logical data model but are documented in the
data dictionary. If a derived attribute is displayed in the model, a ‘/’ is used to indicate that
it is derived (see Section 11.1.2). The first step is to examine the logical data model and
the data dictionary, and produce a list of all derived attributes. From a physical database
500 | Chapter 17 z Methodology – Physical Database Design for Relational Databases
Figure 17.2
The PropertyforRent
relation and a
simplified Staff
relation with the
derived attribute
noOfProperties.
design perspective, whether a derived attribute is stored in the database or calculated every
time it is needed is a tradeoff. The designer should calculate:
n the additional cost to store the derived data and keep it consistent with operational data
from which it is derived;
n the cost to calculate it each time it is required.
The less expensive option is chosen subject to performance constraints. For the last ex-
ample cited above, we could store an additional attribute in the Staff relation representing
the number of properties that each member of staff currently manages. A simplified Staff
relation based on the sample instance of the DreamHome database shown in Figure 3.3
with the new derived attribute noOfProperties is shown in Figure 17.2.
The additional storage overhead for this new derived attribute would not be particularly
significant. The attribute would need to be updated every time a member of staff was
assigned to or deassigned from managing a property, or the property was removed from
the list of available properties. In each case, the noOfProperties attribute for the appropriate
member of staff would be incremented or decremented by 1. It would be necessary to
ensure that this change is made consistently to maintain the correct count, and thereby
ensure the integrity of the database. When a query accesses this attribute, the value would
be immediately available and would not have to be calculated. On the other hand, if the
attribute is not stored directly in the Staff relation it must be calculated each time it is
required. This involves a join of the Staff and PropertyForRent relations. Thus, if this type of
query is frequent or is considered to be critical for performance purposes, it may be more
appropriate to store the derived attribute rather than calculate it each time.
It may also be more appropriate to store derived attributes whenever the DBMS’s query
language cannot easily cope with the algorithm to calculate the derived attribute. For
example, SQL has a limited set of aggregate functions and cannot easily handle recursive
queries, as we discussed in Chapter 5.
17.3 The Physical Database Design Methodology for Relational Databases | 501
Updates to relations may be constrained by integrity constraints governing the ‘real world’
transactions that are represented by the updates. In Step 3.1 we designed a number of
integrity constraints: required data, domain constraints, and entity and referential integrity.
In this step we have to consider the remaining general constraints. The design of such con-
straints is again dependent on the choice of DBMS; some systems provide more facilities
than others for defining general constraints. As in the previous step, if the system is com-
pliant with the SQL standard, some constraints may be easy to implement. For example,
DreamHome has a rule that prevents a member of staff from managing more than 100
properties at the same time. We could design this constraint into the SQL CREATE
TABLE statement for PropertyForRent using the following clause:
CONSTRAINT StaffNotHandlingTooMuch
CHECK (NOT EXISTS (SELECT staffNo
FROM PropertyForRent
GROUP BY staffNo
HAVING COUNT(*) > 100))
In Section 8.1.4 we demonstrated how to implement this constraint in Microsoft Office
Access using an event procedure in VBA (Visual Basic for Applications). Alternatively,
a trigger could be used to enforce some constraints as we illustrated in Section 8.2.7. In
some systems there will be no support for some or all of the general constraints and it will
be necessary to design the constraints into the application. For example, there are very few
relational DBMSs (if any) that would be able to handle a time constraint such as ‘at 17.30
on the last working day of each year, archive the records for all properties sold that year
and delete the associated records’.
Objective To determine the optimal file organizations to store the base relations and
the indexes that are required to achieve acceptable performance, that is,
the way in which relations and tuples will be held on secondary storage.
502 | Chapter 17 z Methodology – Physical Database Design for Relational Databases
One of the main objectives of physical database design is to store and access data in
an efficient way (see Appendix C). While some storage structures are efficient for bulk
loading data into the database, they may be inefficient after that. Thus, we may have to
choose to use an efficient storage structure to set up the database and then choose another
for operational use.
Again, the types of file organization available are dependent on the target DBMS; some
systems provide more choice of storage structures than others. It is extremely important
that the physical database designer fully understands the storage structures that are avail-
able, and how the target system uses these structures. This may require the designer to
know how the system’s query optimizer functions. For example, there may be circum-
stances where the query optimizer would not use a secondary index, even if one were
available. Thus, adding a secondary index would not improve the performance of the
query, and the resultant overhead would be unjustified. We discuss query processing and
optimization in Chapter 21.
As with logical database design, physical database design must be guided by the nature
of the data and its intended use. In particular, the database designer must understand the
typical workload that the database must support. During the requirements collection and
analysis stage there may have been requirements specified about how fast certain transac-
tions must run or how many transactions must be processed per second. This information
forms the basis for a number of decisions that will be made during this step.
With these objectives in mind, we now discuss the activities in Step 4:
Step 4.1 Analyze transactions
Step 4.2 Choose file organizations
Step 4.3 Choose indexes
Step 4.4 Estimate disk space requirements
Objective To understand the functionality of the transactions that will run on the
database and to analyze the important transactions.
To carry out physical database design effectively, it is necessary to have knowledge of the
transactions or queries that will run on the database. This includes both qualitative and
quantitative information. In analyzing the transactions, we attempt to identify performance
criteria, such as:
n the transactions that run frequently and will have a significant impact on performance;
n the transactions that are critical to the operation of the business;
n the times during the day/week when there will be a high demand made on the database
(called the peak load).
We use this information to identify the parts of the database that may cause performance
problems. At the same time, we need to identify the high-level functionality of the trans-
actions, such as the attributes that are updated in an update transaction or the criteria
17.3 The Physical Database Design Methodology for Relational Databases | 503
used to restrict the tuples that are retrieved in a query. We use this information to select
appropriate file organizations and indexes.
In many situations, it is not possible to analyze all the expected transactions, so we
should at least investigate the most ‘important’ ones. It has been suggested that the most
active 20% of user queries account for 80% of the total data access (Wiederhold, 1983).
This 80/20 rule may be used as a guideline in carrying out the analysis. To help identify
which transactions to investigate, we can use a transaction/relation cross-reference matrix,
which shows the relations that each transaction accesses, and/or a transaction usage
map, which diagrammatically indicates which relations are potentially heavily used. To
focus on areas that may be problematic, one way to proceed is to:
(1) map all transaction paths to relations;
(2) determine which relations are most frequently accessed by transactions;
(3) analyze the data usage of selected transactions that involve these relations.
(D) List the property number, address, type, and rent of all properties in 5
Glasgow, ordered by rent. 4
4
(E) List the details of properties for rent managed by a named member
6 Branch view
of staff. 4
(F) Identify the total number of properties assigned to each member of 4
staff at a given branch. 7
The matrix indicates, for example, that transaction (A) reads the Staff table and also inserts
tuples into the PropertyForRent and PrivateOwner/BusinessOwner relations. To be more use-
ful, the matrix should indicate in each cell the number of accesses over some time interval
(for example, hourly, daily, or weekly). However, to keep the matrix simple, we do not
show this information. This matrix shows that both the Staff and PropertyForRent relations
504 | Chapter 17 z Methodology – Physical Database Design for Relational Databases
Branch X X X
Telephone
Staff X X X X X
Manager
PrivateOwner X
BusinessOwner X
PropertyForRent X X X X X X X
Viewing
Client
Registration
Lease
Newspaper
Advert
are accessed by five of the six transactions, and so efficient access to these relations may
be important to avoid performance problems. We therefore conclude that a closer inspec-
tion of these transactions and relations are necessary.
Figure 17.3
Transaction usage
map for some
sample transactions
showing expected
occurrences.
the risk of likely performance problems is reduced. However, if their operating patterns
conflict, potential problems may be alleviated by examining the transactions more closely
to determine whether changes can be made to the structure of the relations to improve
performance, as we discuss in Step 7 in the next chapter. Alternatively, it may be pos-
sible to reschedule some transactions so that their operating patterns do not conflict (for
example, it may be possible to leave some summary transactions until a quieter time in the
evening or overnight).
n The relations and attributes accessed by the transaction and the type of access; that is,
whether it is an insert, update, delete, or retrieval (also known as a query) transaction.
For an update transaction, note the attributes that are updated, as these attributes
may be candidates for avoiding an access structure (such as a secondary index).
n The attributes used in any predicates (in SQL, the predicates are the conditions specified
in the WHERE clause). Check whether the predicates involve:
– pattern matching; for example: (name LIKE ‘%Smith%’);
– range searches; for example: (salary BETWEEN 10000 AND 20000);
– exact-match key retrieval; for example: (salary = 30000).
This applies not only to queries but also to update and delete transactions, which can
restrict the tuples to be updated/deleted in a relation.
These attributes may be candidates for access structures.
n For a query, the attributes that are involved in the join of two or more relations.
Again, these attributes may be candidates for access structures.
506 | Chapter 17 z Methodology – Physical Database Design for Relational Databases
n The expected frequency at which the transaction will run; for example, the transaction
will run approximately 50 times per day.
n The performance goals for the transaction; for example, the transaction must complete
within 1 second.
The attributes used in any predicates for very frequent or critical transactions
should have a higher priority for access structures.
Figure 17.4 shows an example of a transaction analysis form for transaction (D). This
form shows that the average frequency of this transaction is 50 times per hour, with a peak
loading of 100 times per hour daily between 17.00 and 19.00. In other words, typically half
the branches will run this transaction per hour and at peak time all branches will run this
transaction once per hour.
The form also shows the required SQL statement and the transaction usage map. At this
stage, the full SQL statement may be too detailed but the types of details that are shown
adjacent to the SQL statement should be identified, namely:
One of the main objectives of physical database design is to store and access data in an
efficient way. For example, if we want to retrieve staff tuples in alphabetical order of
name, sorting the file by staff name is a good file organization. However, if we want to
retrieve all staff whose salary is in a certain range, searching a file ordered by staff name
would not be particularly efficient. To complicate matters, some file organizations are
17.3 The Physical Database Design Methodology for Relational Databases | 507
efficient for bulk loading data into the database but inefficient after that. In other words,
we may want to use an efficient storage structure to set up the database and then change it
for normal operational use.
The objective of this step therefore is to choose an optimal file organization for each
relation, if the target DBMS allows this. In many cases, a relational DBMS may give
little or no choice for choosing file organizations, although some may be established as
508 | Chapter 17 z Methodology – Physical Database Design for Relational Databases
indexes are specified. However, as an aid to understanding file organizations and indexes
more fully, we provide guidelines in Appendix C.7 for selecting a file organization based
on the following types of file:
n Heap
n Hash
n Indexed Sequential Access Method (ISAM)
n B+-tree
n Clusters.
If the target DBMS does not allow the choice of file organizations, this step can be omitted.
Objective To determine whether adding indexes will improve the performance of the
system.
One approach to selecting an appropriate file organization for a relation is to keep the
tuples unordered and create as many secondary indexes as necessary. Another approach
is to order the tuples in the relation by specifying a primary or clustering index (see
Appendix C.5). In this case, choose the attribute for ordering or clustering the tuples as:
n the attribute that is used most often for join operations, as this makes the join operation
more efficient, or
n the attribute that is used most often to access the tuples in a relation in order of that
attribute.
If the ordering attribute chosen is a key of the relation, the index will be a primary index;
if the ordering attribute is not a key, the index will be a clustering index. Remember that
each relation can only have either a primary index or a clustering index.
Specifying indexes
We saw in Section 6.3.4 that an index can usually be created in SQL using the CREATE
INDEX statement. For example, to create a primary index on the PropertyForRent relation
based on the propertyNo attribute, we might use the following SQL statement:
CREATE UNIQUE INDEX PropertyNoInd ON PropertyForRent(propertyNo);
To create a clustering index on the PropertyForRent relation based on the staffNo attribute,
we might use the following SQL statement:
17.3 The Physical Database Design Methodology for Relational Databases | 509
As we have already mentioned, in some systems the file organization is fixed. For ex-
ample, until recently Oracle has supported only B+-trees but has now added support for
clusters. On the other hand, INGRES offers a wide set of different index structures that can
be chosen using the following optional clause in the CREATE INDEX statement:
n adding an index record to every secondary index whenever a tuple is inserted into the
relation;
n updating a secondary index when the corresponding tuple in the relation is updated;
n the increase in disk space needed to store the secondary index;
n possible performance degradation during query optimization, as the query optimizer
may consider all secondary indexes before selecting an optimal execution strategy.
(1) Do not index small relations. It may be more efficient to search the relation in
memory than to store an additional index structure.
(2) In general, index the primary key of a relation if it is not a key of the file organiza-
tion. Although the SQL standard provides a clause for the specification of primary
keys as discussed in Section 6.2.3, it should be noted that this does not guarantee that
the primary key will be indexed.
(3) Add a secondary index to a foreign key if it is frequently accessed. For example, we
may frequently join the PropertyForRent relation and the PrivateOwner/BusinessOwner
relations on the attribute ownerNo, the owner number. Therefore, it may be more
efficient to add a secondary index to the PropertyForRent relation based on the attribute
ownerNo. Note, some DBMSs may automatically index foreign keys.
510 | Chapter 17 z Methodology – Physical Database Design for Relational Databases
(4) Add a secondary index to any attribute that is heavily used as a secondary key (for
example, add a secondary index to the PropertyForRent relation based on the attribute
rent, as discussed above).
(5) Add a secondary index on attributes that are frequently involved in:
(a) selection or join criteria;
(b) ORDER BY;
(c) GROUP BY;
(d) other operations involving sorting (such as UNION or DISTINCT).
(6) Add a secondary index on attributes involved in built-in aggregate functions, along
with any attributes used for the built-in functions. For example, to find the average
staff salary at each branch, we could use the following SQL query:
SELECT branchNo, AVG(salary)
FROM Staff
GROUP BY branchNo;
From the previous guideline, we could consider adding an index to the branchNo
attribute by virtue of the GROUP BY clause. However, it may be more efficient
to consider an index on both the branchNo attribute and the salary attribute. This may
allow the DBMS to perform the entire query from data in the index alone, without
having to access the data file. This is sometimes called an index-only plan, as the
required response can be produced using only data in the index.
(7) As a more general case of the previous guideline, add a secondary index on attributes
that could result in an index-only plan.
(8) Avoid indexing an attribute or relation that is frequently updated.
(9) Avoid indexing an attribute if the query will retrieve a significant proportion (for
example 25%) of the tuples in the relation. In this case, it may be more efficient to
search the entire relation than to search using an index.
(10) Avoid indexing attributes that consist of long character strings.
If the search criteria involve more than one predicate, and one of the terms contains an OR
clause, and the term has no index/sort order, then adding indexes for the other attributes is
not going to help improve the speed of the query, because a linear search of the relation
will still be required. For example, assume that only the type and rent attributes of the
PropertyForRent relation are indexed, and we need to use the following query:
SELECT *
FROM PropertyForRent
WHERE (type = ‘Flat’ OR rent > 500 OR rooms > 5);
Although the two indexes could be used to find the tuples where (type = ‘Flat or rent > 500),
the fact that the rooms attribute is not indexed will mean that these indexes cannot be used
for the full WHERE clause. Thus, unless there are other queries that would benefit from
having the type and rent attributes indexed, there would be no benefit gained in indexing
them for this query.
On the other hand, if the predicates in the WHERE clause were AND’ed together, the
two indexes on the type and rent attributes could be used to optimize the query.
17.3 The Physical Database Design Methodology for Relational Databases | 511
Office Access does, however, support indexes as we now briefly discuss. In this section
we use the terminology of Office Access, which refers to a relation as a table with fields
and records.
Table 17.2 Interactions between base tables and query transactions for the Staff view of DreamHome.
(1) Create the primary key for each table, which will cause Office Access to automatically
index this field.
(2) Ensure all relationships are created in the Relationships window, which will cause
Office Access to automatically index the foreign key fields.
As an illustration of which other indexes to create, we consider the query transactions
listed in Appendix A for the Staff user views of Dreamhome. We can produce a summary
of interactions between the base tables and these transactions shown in Table 17.2. This
figure shows for each table: the transaction(s) that operate on the table, the type of access
(a search based on a predicate, a join together with the join field, any ordering field, and
any grouping field ), and the frequency with which the transaction runs.
Based on this information, we choose to create the additional indexes shown in
Table 17.3. We leave it as an exercise for the reader to choose additional indexes to
create in Microsoft Office Access for the transactions listed in Appendix A for the Branch
view of Dreamhome (see Exercise 17.5).
Table Index
Oracle automatically adds an index for each primary key. In addition, Oracle recom-
mends that UNIQUE indexes are not explicitly defined on tables but instead UNIQUE
integrity constraints are defined on the desired columns. Oracle enforces UNIQUE integrity
constraints by automatically defining a unique index on the unique key. Exceptions to this
recommendation are usually performance related. For example, using a CREATE TABLE
. . . AS SELECT with a UNIQUE constraint is slower than creating the table without the
constraint and then manually creating a UNIQUE index.
Assume that the tables are created with the identified primary, alternate, and foreign keys
specified. We now identify whether any clusters are required and whether any additional
indexes are required. To keep the design simple, we will assume that clusters are not
appropriate. Again, considering just the query transactions listed in Appendix A for the
Staff view of DreamHome, there may be performance benefits in adding the indexes
shown in Table 17.4. Again, we leave it as an exercise for the reader to choose additional
indexes to create in Oracle for the transactions listed in Appendix A for the Branch view
of Dreamhome (see Exercise 17.6).
Objective To estimate the amount of disk space that will be required by the database.
Table Index
be a maximum number, but it may also be worth considering how the relation will grow,
and modifying the resulting disk size by this growth factor to determine the potential size
of the database in the future. In Appendix H (see companion Web site) we illustrate the
process for estimating the size of relations created in Oracle.
Objective To design the user views that were identified during the requirements
collection and analysis stage of the database system development
lifecycle.
The first phase of the database design methodology presented in Chapter 15 involved
the production of a conceptual data model for either the single user view or a number of
combined user views identified during the requirements collection and analysis stage. In
Section 10.4.4 we identified four user views for DreamHome named Director, Manager,
Supervisor, and Assistant. Following an analysis of the data requirements for these user
views, we used the centralized approach to merge the requirements for the user views as
follows:
n Branch, consisting of the Director and Manager user views;
n Staff, consisting of the Supervisor and Assistant user views.
516 | Chapter 17 z Methodology – Physical Database Design for Relational Databases
In Step 2 the conceptual data model was mapped to a logical data model based on the rela-
tional model. The objective of this step is to design the user views identified previously.
In a standalone DBMS on a PC, user views are usually a convenience, defined to simplify
database requests. However, in a multi-user DBMS, user views play a central role in
defining the structure of the database and enforcing security. In Section 6.4.7, we dis-
cussed the major advantages of user views, such as data independence, reduced complex-
ity, and customization. We previously discussed how to create views using the ISO SQL
standard (Section 6.4.10), and how to create views (stored queries) in Microsoft Office
Access (Chapter 7), and in Oracle (Section 8.2.5).
Objective To design the security mechanisms for the database as specified by the
users during the requirements and collection stage of the database system
development lifecycle.
Chapter Summary
n Physical database design is the process of producing a description of the implementation of the database on
secondary storage. It describes the base relations and the storage structures and access methods used to access
the data effectively, along with any associated integrity constraints and security measures. The design of the
base relations can be undertaken only once the designer is fully aware of the facilities offered by the target
DBMS.
n The initial step (Step 3) of physical database design is the translation of the logical data model into a form that
can be implemented in the target relational DBMS.
n The next step (Step 4) designs the file organizations and access methods that will be used to store the base
relations. This involves analyzing the transactions that will run on the database, choosing suitable file organ-
izations based on this analysis, choosing indexes and, finally, estimating the disk space that will be required
by the implementation.
n Secondary indexes provide a mechanism for specifying an additional key for a base relation that can be
used to retrieve data more efficiently. However, there is an overhead involved in the maintenance and use
of secondary indexes that has to be balanced against the performance improvement gained when retrieving
data.
n One approach to selecting an appropriate file organization for a relation is to keep the tuples unordered and
create as many secondary indexes as necessary. Another approach is to order the tuples in the relation by
specifying a primary or clustering index. One approach to determining which secondary indexes are needed
is to produce a ‘wish-list’ of attributes that we consider are candidates for indexing, and then to examine the
impact of maintaining each of these indexes.
n The objective of Step 5 is to design how to implement the user views identified during the requirements
collection and analysis stage, such as using the mechanisms provided by SQL.
n A database represents an essential corporate resource and so security of this resource is extremely important.
The objective of Step 6 is to design how the security mechanisms identified during the requirements collec-
tion and analysis stage will be realized.
Review Questions
17.1 Explain the difference between conceptual, 17.3 Describe the purpose of the main steps in the
logical, and physical database design. Why physical design methodology presented in
might these tasks be carried out by different this chapter.
people? 17.4 Discuss when indexes may improve the
17.2 Describe the inputs and outputs of physical efficiency of the system.
database design.
518 | Chapter 17 z Methodology – Physical Database Design for Relational Databases
Exercises
18
Methodology – Monitoring
and Tuning the Operational
System
Chapter Objectives
In this chapter you will learn:
n The meaning of denormalization.
n When to denormalize to improve performance.
n The importance of monitoring and tuning the operational system.
n How to measure efficiency.
n How system resources affect performance.
In this chapter we describe and illustrate by example the final two steps of the physical
database design methodology for relational databases. We provide guidelines for deter-
mining when to denormalize the logical data model and introduce redundancy, and then
discuss the importance of monitoring the operational system and continuing to tune it. In
places, we show physical implementation details to clarify the discussion.
be necessary to accept the loss of some of the benefits of a fully normalized design in favor
of performance. This should be considered only when it is estimated that the system will
not be able to meet its performance requirements. We are not advocating that normaliza-
tion should be omitted from logical database design: normalization forces us to understand
completely each attribute that has to be represented in the database. This may be the most
important factor that contributes to the overall success of the system. In addition, the
following factors have to be considered:
n denormalization makes implementation more complex;
n denormalization often sacrifices flexibility;
n denormalization may speed up retrievals but it slows down updates.
Formally, the term denormalization refers to a refinement to the relational schema such
that the degree of normalization for a modified relation is less than the degree of at least
one of the original relations. We also use the term more loosely to refer to situations where
we combine two relations into one new relation, and the new relation is still normalized
but contains more nulls than the original relations. Some authors refer to denormalization
as usage refinement.
As a general rule of thumb, if performance is unsatisfactory and a relation has a low update
rate and a very high query rate, denormalization may be a viable option. The transaction
/relation cross-reference matrix that may have been produced in Step 4.1 provides useful
information for this step. The matrix summarizes, in a visual way, the access patterns of the
transactions that will run on the database. It can be used to highlight possible candidates
for denormalization, and to assess the effects this would have on the rest of the model.
More specifically, in this step we consider duplicating certain attributes or joining
relations together to reduce the number of joins required to perform a query. Indirectly,
we have encountered an implicit example of denormalization when dealing with address
attributes. For example, consider the definition of the Branch relation:
Branch (branchNo, street, city, postcode, mgrStaffNo)
Strictly speaking, this relation is not in third normal form: postcode (the post or zip code)
functionally determines city. In other words, we can determine the value of the city attribute
given a value for the postcode attribute. Hence, the Branch relation is in Second Normal
Form (2NF). To normalize the relation to Third Normal Form (3NF), it would be neces-
sary to split the relation into two, as follows:
Branch (branchNo, street, postcode, mgrStaffNo)
Postcode (postcode, city)
However, we rarely wish to access the branch address without the city attribute. This would
mean that we would have to perform a join whenever we want a complete address for a
branch. As a result, we settle for the second normal form and implement the original
Branch relation.
Unfortunately, there are no fixed rules for determining when to denormalize relations.
In this step we discuss some of the more common situations for considering denormaliza-
tion. For additional information, the interested reader is referred to Rogers (1989) and
Fleming and Von Halle (1989). In particular, we consider denormalization in the follow-
ing situations, specifically to speed up frequent or critical transactions:
18.1 Denormalizing and Introducing Controlled Redundancy | 521
To illustrate these steps, we use the relation diagram shown in Figure 18.1(a) and the
sample data shown in Figure 18.1(b).
Figure 18.1
(a) Sample relation
diagram.
Figure 18.1
(b) Sample relations.
18.1 Denormalizing and Introducing Controlled Redundancy | 523
Figure 18.2 Combined Client and Interview: (a) revised extract from the relation diagram; (b) combined relation.
Figure 18.4
Lookup table for
property type:
(a) relation diagram;
(b) sample relations.
If the lookup table is used in frequent or critical queries, and the description is unlikely
to change, consideration should be given to duplicating the description attribute in the
child relation, as shown in Figure 18.5. The original lookup table is not redundant – it
can still be used to validate user input. However, by duplicating the description in
the child relation, we have eliminated the need to join the child relation to the lookup
table.
Figure 18.6
Duplicating the
foreign key
branchNo in the
PrivateOwner
relation: (a) revised
(simplified) relation
diagram with
branchNo included
as a foreign key; (b)
revised PrivateOwner
relation.
18.1 Denormalizing and Introducing Controlled Redundancy | 527
If an owner could rent properties through many branches, the above change would
not work. In this case, it would be necessary to model a many-to-many (*:*) relation-
ship between Branch and PrivateOwner. Note also that the PropertyForRent relation has the
branchNo attribute because it is possible for a property not to have a member of staff allo-
cated to it, particularly at the start when the property is first taken on by the agency. If the
PropertyForRent relation did not have the branch number, it would be necessary to join the
PropertyForRent relation to the Staff relation based on the staffNo attribute to get the required
branch number. The original SQL query would then become:
SELECT o.lName
FROM Staff s, PropertyForRent p, PrivateOwner o
WHERE s.staffNo = p.staffNo AND p.ownerNo = o.ownerNo AND s.branchNo = ‘B003’;
Removing two joins from the query may provide greater justification for creating a
direct relationship between PrivateOwner and Branch and thereby duplicating the foreign
key branchNo in the PrivateOwner relation.
Figure 18.7
Duplicating the
street attribute from
the PropertyForRent
relation in the
Viewing relation.
Figure 18.8
Branch incorporating
repeating group:
(a) revised relation
diagram; (b) revised
relation.
new relation, forming a 1:* relationship with the original (parent) relation. Occasionally,
reintroducing repeating groups is an effective way to improve system performance. For
example, each DreamHome branch office has a maximum of three telephone numbers,
although not all offices necessarily have the same number of lines. In the logical data
model, we created a Telephone entity with a three-to-one (3:1) relationship with Branch,
resulting in two relations, as shown in Figure 18.1.
If access to this information is important or frequent, it may be more efficient to com-
bine the relations and store the telephone details in the original Branch relation, with one
attribute for each telephone, as shown in Figure 18.8.
In general, this type of denormalization should be considered only in the following
circumstances:
n the absolute number of items in the repeating group is known (in this example there is
a maximum of three telephone numbers);
n the number is static and will not change over time (the maximum number of telephone
lines is fixed and is not expected to change);
n the number is not very large, typically not greater than 10, although this is not as import-
ant as the first two conditions.
18.1 Denormalizing and Introducing Controlled Redundancy | 529
Sometimes it may be only the most recent or current value in a repeating group, or just the
fact that there is a repeating group, that is needed most frequently. In the above example
we may choose to store one telephone number in the Branch relation and leave the remain-
ing numbers for the Telephone relation. This would remove the presence of nulls from the
Branch relation, as each branch must have at least one telephone number.
Figure 18.9
Horizontal and
vertical partitioning.
530 | Chapter 18 z Methodology – Monitoring and Tuning the Operational System
Figure 18.10
CREATE TABLE ArchivedPropertyForRentPartition(
Oracle SQL
propertyNo VARHAR2(5) NOT NULL,
statement to create
a hash partition. street VARCHAR2(25) NOT NULL,
city VARCHAR2(15) NOT NULL,
postcode VARCHAR2(8),
type CHAR NOT NULL,
rooms SMALLINT NOT NULL,
rent NUMBER(6, 2) NOT NULL,
ownerNo VARCHAR2(5) NOT NULL,
staffNo VARCHAR2(5),
branchNo CHAR(4) NOT NULL,
PRIMARY KEY (propertyNo),
FOREIGN KEY (ownerNo) REFERENCES PrivateOwner(ownerNo),
FOREIGN KEY (staffNo) REFERENCES Staff(staffNo),
FOREIGN KEY (branchNo) REFERENCES Branch(branchNo))
PARTITION BY HASH (branchNo)
(PARTITION b1 TABLESPACE TB01,
PARTITION b2 TABLESPACE TB02,
PARTITION b3 TABLESPACE TB03,
PARTITION b4 TABLESPACE TB04);
Partitions are particularly useful in applications that store and analyze large amounts of
data. For example, DreamHome maintains an ArchivedPropertyForRent relation with several
hundreds of thousands of tuples that are held indefinitely for analysis purposes. Searching
for a particular tuple at a branch could be quite time consuming; however, we could reduce
this time by horizontally partitioning the relation, with one partition for each branch. We
can create a (hash) partition for this scenario in Oracle using the SQL statement shown in
Figure 18.10.
As well as hash partitioning, other common types of partitioning are range (each parti-
tion is defined by a range of values for one or more attributes) and list (each partition is
defined by a list of values for an attribute). There are also composite partitions such as
range–hash and list–hash (each partition is defined by a range or a list of values and then
each partition is further subdivided based on a hash function).
There may also be circumstances where we frequently examine particular attributes of
a very large relation and it may be appropriate to vertically partition the relation into
those attributes that are frequently accessed together and another vertical partition for the
remaining attributes (with the primary key replicated in each partition to allow the
original relation to be reconstructed using a join).
Partitioning has a number of advantages:
n Improved load balancing Partitions can be allocated to different areas of secondary
storage thereby permitting parallel access while at the same time minimizing the con-
tention for access to the same storage area if the relation was not partitioned.
n Improved performance By limiting the amount of data to be examined or processed,
and by enabling parallel execution, performance can be enhanced.
18.1 Denormalizing and Introducing Controlled Redundancy | 531
n Increased availability If partitions are allocated to different storage areas and one
storage area becomes unavailable, the other partitions would still be available.
n Improved recovery Smaller partitions can be recovered more efficiently (equally
well, the DBA may find backing up smaller partitions easier than backing up very large
relations).
n Security Data in a partition can be restricted to those users who require access to it,
with different partitions having different access restrictions.
Partitioning can also have a number of disadvantages:
n Complexity Partitioning is not usually transparent to end-users and queries that utilize
more than one partition become more complex to write.
n Reduced performance Queries that combine data from more than one partition may be
slower than a non-partitioned approach.
n Duplication Vertical partitioning involves duplication of the primary key. This leads
not only to increased storage requirements but also to potential inconsistencies arising.
Advantages Disadvantages
Can improve performance by: May speed up retrievals but can slow down updates.
n precomputing derived data; Always application-specific and needs to be
n minimizing the need for joins; re-evaluated if the application changes.
n reducing the number of foreign keys in relations; Can increase the size of relations.
n reducing the number indexes (thereby saving storage space); May simplify implementation in some cases but
n reducing the number of relations. may make it more complex in others.
Sacrifices flexibility.
532 | Chapter 18 z Methodology – Monitoring and Tuning the Operational System
For this activity we should remember that one of the main objectives of physical database
design is to store and access data in an efficient way (see Appendix C). There are a num-
ber of factors that we may use to measure efficiency:
n Transaction throughput This is the number of transactions that can be processed in
a given time interval. In some systems, such as airline reservations, high transaction
throughput is critical to the overall success of the system.
n Response time This is the elapsed time for the completion of a single transaction.
From a user’s point of view, we want to minimize response time as much as possible.
However, there are some factors that influence response time that the designer may have
no control over, such as system loading or communication times. Response time can be
shortened by:
– reducing contention and wait times, particularly disk I/O wait times;
– reducing the amount of time for which resources are required;
– using faster components.
n Disk storage This is the amount of disk space required to store the database files. The
designer may wish to minimize the amount of disk storage used.
However, there is no one factor that is always correct. Typically, the designer has to trade
one factor off against another to achieve a reasonable balance. For example, increasing
the amount of data stored may decrease the response time or transaction throughput. The
initial physical database design should not be regarded as static, but should be considered
as an estimate of how the operational system might perform. Once the initial design has
been implemented, it will be necessary to monitor the system and tune it as a result of
observed performance and changing requirements (see Step 8). Many DBMSs provide the
Database Administrator (DBA) with utilities to monitor the operation of the system and
tune it.
18.2 Monitoring the System to Improve Performance | 533
Main memory
Main memory accesses are significantly faster than secondary storage accesses, sometimes
tens or even hundreds of thousands of times faster. In general, the more main memory
available to the DBMS and the database applications, the faster the applications will run.
However, it is sensible always to have a minimum of 5% of main memory available. Equally
well, it is advisable not to have any more than 10% available otherwise main memory is
not being used optimally. When there is insufficient memory to accommodate all processes,
the operating system transfers pages of processes to disk to free up memory. When one of
these pages is next required, the operating system has to transfer it back from disk. Some-
times it is necessary to swap entire processes from memory to disk, and back again, to free up
memory. Problems occur with main memory when paging or swapping becomes excessive.
To ensure efficient usage of main memory, it is necessary to understand how the target
DBMS uses main memory, what buffers it keeps in main memory, what parameters exist
to allow the size of the buffers to be adjusted, and so on. For example, Oracle keeps a data
dictionary cache in main memory that ideally should be large enough to handle 90% of
data dictionary accesses without having to retrieve the information from disk. It is also
necessary to understand the access patterns of users: an increase in the number of concur-
rent users accessing the database will result in an increase in the amount of memory being
utilized.
CPU
The CPU controls the tasks of the other system resources and executes user processes, and
is the most costly resource in the system so needs to be correctly utilized. The main objec-
tive for this component is to prevent CPU contention in which processes are waiting for
the CPU. CPU bottlenecks occur when either the operating system or user processes make
too many demands on the CPU. This is often a result of excessive paging.
It is necessary to understand the typical workload through a 24-hour period and ensure
that sufficient resources are available for not only the normal workload but also the peak
534 | Chapter 18 z Methodology – Monitoring and Tuning the Operational System
Figure 18.11
Typical disk
configuration.
workload (for example, if the system has 90% CPU utilization and 10% idle during the
normal workload then there may not be sufficient scope to handle the peak workload). One
option is to ensure that during peak load no unnecessary jobs are being run and that such
jobs are instead run in off-hours. Another option may be to consider multiple CPUs, which
allows the processing to be distributed and operations to be performed in parallel.
CPU MIPS (millions of instructions per second) can be used as a guide in comparing
platforms and determining their ability to meet the enterprise’s throughput requirements.
Disk I/O
With any large DBMS, there is a significant amount of disk I/O involved in storing and
retrieving data. Disks usually have a recommended I/O rate and, when this rate is
exceeded, I/O bottlenecks occur. While CPU clock speeds have increased dramatically in
recent years, I/O speeds have not increased proportionately. The way in which data is
organized on disk can have a major impact on the overall disk performance. One problem
that can arise is disk contention. This occurs when multiple processes try to access the
same disk simultaneously. Most disks have limits on both the number of accesses and the
amount of data they can transfer per second and, when these limits are reached, processes
may have to wait to access the disk. To avoid this, it is recommended that storage should be
evenly distributed across available drives to reduce the likelihood of performance problems
occurring. Figure 18.11 illustrates the basic principles of distributing the data across disks:
– the operating system files should be separated from the database files;
– the main database files should be separated from the index files;
– the recovery log file (see Section 20.3.3) should be separated from the rest of the
database.
If a disk still appears to be overloaded, one or more of its heavily accessed files can be
moved to a less active disk (this is known as distributing I/O). Load balancing can be
achieved by applying this principle to each of the disks until they all have approximately
the same amount of I/O. Once again, the physical database designer needs to understand how
the DBMS operates, the characteristics of the hardware, and the access patterns of the users.
Disk I/O has been revolutionized with the introduction of RAID (Redundant Array of
Independent Disks) technology. RAID works on having a large disk array comprising an
arrangement of several independent disks that are organized to increase performance and
at the same time improve reliability. We discuss RAID in Section 19.2.6.
Network
When the amount of traffic on the network is too great, or when the number of network
collisions is large, network bottlenecks occur.
18.2 Monitoring the System to Improve Performance | 535
Each of above resources may affect other system resources. Equally well, an improvement
in one resource may effect an improvement in other system resources. For example:
n procuring more main memory should result in less paging, which should help avoid
CPU bottlenecks;
n more effective use of main memory may result in less disk I/O.
Summary
Tuning is an activity that is never complete. Throughout the life of the system, it will be
necessary to monitor performance, particularly to account for changes in the environment
and user requirements. However, making a change to one area of an operational system to
improve performance may have an adverse effect on another area. For example, adding an
index to a relation may improve the performance of one transaction, but it may adversely
affect another, perhaps more important, transaction. Therefore, care must be taken when
making changes to an operational system. If possible, test the changes either on a test
database, or alternatively, when the system is not being fully used (such as, out of work-
ing hours).
(1) Ability to hold pictures of the properties for rent, together with comments that
describe the main features of the property.
In Microsoft Office Access we are able to accommodate this request using OLE
(Object Linking and Embedding) fields, which are used to store data such as Microsoft
Word or Microsoft Excel documents, pictures, sound, and other types of binary data
created in other programs. OLE objects can be linked to, or embedded in, a field in a
Microsoft Office Access table and then displayed in a form or report.
To implement this new requirement, we restructure the PropertyForRent table to
include:
(a) a field called picture specified as an OLE data type; this field holds graphical
images of properties, created by scanning photographs of the properties for rent
and saving the images as BMP (Bit Mapped) graphic files;
(b) a field called comments specified as a Memo data type, capable of storing lengthy
text.
536 | Chapter 18 z Methodology – Monitoring and Tuning the Operational System
Figure 18.12
Form based on
PropertyForRent
table with new
picture and
comments fields.
A form based on some fields of the PropertyForRent table, including the new fields, is
shown in Figure 18.12. The main problem associated with the storage of graphic
images is the large amount of disk space required to store the image files. We would
therefore need to continue to monitor the performance of the DreamHome database
to ensure that satisfying this new requirement does not compromise the system’s
performance.
(2) Ability to publish a report describing properties available for rent on the Web.
This requirement can be accommodated in both Microsoft Office Access and Oracle
as both DBMSs provide many features for developing a Web application and publish-
ing on the Internet. However, to use these features, we require a Web browser, such
as Microsoft Internet Explorer or Netscape Navigator, and a modem or other network
connection to access the Internet. In Chapter 29, we describe in detail the technologies
used in the integration of databases and the Web.
Exercises | 537
Chapter Summary
n Formally, the term denormalization refers to a refinement to the relational schema such that the degree of
normalization for a modified relation is less than the degree of at least one of the original relations. The term
is also used more loosely to refer to situations where two relations are combined into one new relation, and
the new relation is still normalized but contains more nulls than the original relations.
n Step 7 of physical database design considers denormalizing the relational schema to improve performance.
There may be circumstances where it may be necessary to accept the loss of some of the benefits of a fully
normalized design in favor of performance. This should be considered only when it is estimated that the
system will not be able to meet its performance requirements. As a rule of thumb, if performance is unsatis-
factory and a relation has a low update rate and a very high query rate, denormalization may be a viable option.
n The final step (Step 8) of physical database design is the ongoing process of monitoring and tuning the
operational system to achieve maximum performance.
n One of the main objectives of physical database design is to store and access data in an efficient way. There
are a number of factors that can be used to measure efficiency, including throughput, response time, and disk
storage.
n To improve performance, it is necessary to be aware of how the following four basic hardware components
interact and affect system performance: main memory, CPU, disk I/O, and network.
Review Questions
18.1 Describe the purpose of the main steps in the 18.3 What factors can be used to measure
physical design methodology presented in this efficiency?
chapter. 18.4 Discuss how the four basic hardware
18.2 Under what circumstances would you want components interact and affect system
to denormalize a logical data model? performance.
Use examples to illustrate your answer. 18.5 How should you distribute data across disks?
Exercise
18.6 Investigate whether your DBMS can accommodate the two new requirements for the DreamHome case study
given in Step 8 of this chapter. If feasible, produce a design for the two requirements and implement them in
your target DBMS.