Introduction To Database Managment PDF
Introduction To Database Managment PDF
1 WILEY
mTHW lY
;l 1 8 O 7 jj
r ©WILEY £
S' 2 O O 7 l
/ÿach generation has its unique needs and aspirations. When Charles Wiley first
opened his small printing shop in lower Manhattan in 1807, it was a generation
of boundless potential searching for an identity. And we were there, helping to
define a new American literary tradition. Over half a century later, in the midst
of the Second Industrial Revolution, it was a generation focused on building the
future. Once again, we were there, supplying the critical scientific, technical, and
engineering knowledge that helped frame the world. Throughout the 20th
Century, and into the new millennium, nations began to reach out beyond their
own borders and a new international community was born. Wiley was there,
expanding its operations around the world to enable a global exchange of ideas,
opinions, and know-how.
For 200 years, Wiley has been an integral part of each generation’s journey,
enabling the flow of information and understanding necessary to meet their needs
and fulfill their aspirations. Today, bold new technologies are changing the way
we live and learn. Wiley will be there, providing you the must-have knowledge
you need to imagine new worlds, new possibilities, and new opportunities.
Generations come and go, but you can always count on Wiley to provide you the
knowledge you need, when and where you need it!
I 1807 «
\l ©WILEY \
2007 ;
BICENTENNIAL
Credits
PUBLISHER PRODUCTION ASSISTANT
Anne Smith Courtney Leshko
ACQUISITIONS EDITOR CREATIVE DIRECTOR
Lois Ann Freier Harry Nolan
MARKETING MANAGER COVER DESIGNER
Jennifer Slomack Hope Miller
SENIOR EDITORIAL ASSISTANT COVER PHOTO
Tiara Kelly ©AP/Wide World Photos
PRODUCTION MANAGER
Kelly Tavares
This book was set in Times New Roman by Techbooks, printed and bound by R.R. Donnelley. The cover was
printed by R.R. Donnelley.
Microsoft product screen shot(s) reprinted with permission from Microsoft Corporation.
This book is printed on acid free paper.⬁
Copyright © 2008 John Wiley & Sons, Inc. All rights reserved. No part of this publication may be reproduced,
stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying,
recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States
Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of
the appropriate per-copy fee to the Copyright Clearance Center, Inc. 222 Rosewood Drive, Danvers, MA 01923,
website www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions
Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, (201) 748-6011,
fax (201) 748-6008, website http://www.wiley.com/go/permissions.
To order books or for customer service please, call 1-800-CALL WILEY (225-5945).
ISBN-13 978-0-470-10186-5
▲ C: Content
▲ A: Analysis
▲ S: Synthesis
▲ E: Evaluation
Instructor Package
Introduction to Database Management is available with the following teach-
ing and learning supplements. All supplements are available online at
the text’s Book Companion Web site, located at www.wiley.com/college/
gillenson.
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
This page intentionally left blank
CONTENTS
1 Introducing to Data and Data Management. . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1 Understanding the Role of Data and Databases. . . . . . . . . . . . . . . 2
1.1.1 A Practical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Understanding Data Management . . . . . . . . . . . . . . . . 5
1.1.3 The Need for Data Management . . . . . . . . . . . . . . . . . . 5
Self-Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Understanding Data Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Picking a Starting Point . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Identifying Primary Processes. . . . . . . . . . . . . . . . . . . . 8
1.2.3 Specific Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Self-Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Potential Data Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Managing Data Accuracy . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 Managing Data Security . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.3 Managing Data Organization . . . . . . . . . . . . . . . . . . . 16
1.3.4 Managing Data Access . . . . . . . . . . . . . . . . . . . . . . . . 16
Self-Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Assess Your Understanding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Summary Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Applying This Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
You Try It . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Introducing Databases and Database Management Systems. . . . . . . . . . . . . 24
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1. Introduction to Key Database Concepts . . . . . . . . . . . . . . . . . . . 25
2.1.1. Database Approach to Data . . . . . . . . . . . . . . . . . . . . 25
2.1.2. Understanding Basic Concepts . . . . . . . . . . . . . . . . . . 26
2.1.3. Database Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Self-Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Understanding Basic Database Models . . . . . . . . . . . . . . . . . . . . 31
2.2.1 The Hierarchical Database Model . . . . . . . . . . . . . . . . 32
2.2.2 The Network Database Model . . . . . . . . . . . . . . . . . . 33
2.2.3 The Relational Database Model. . . . . . . . . . . . . . . . . . 34
xviii CONTENTS
Starting Point
Go to www.wiley.com/college/gillenson to assess your knowledge of data and
data management.
Determine where you need to concentrate your effort.
INTRODUCTION
We live in a true information age where data is a critical resource. Today, per-
haps more than ever before, knowledge is power. This chapter provides an intro-
duction to data and some fundamental points about data management. We start
with the role that data has played and continues to play as a business resource.
We then look at how business practices help you understand data collection
requirements. Finally, we take a brief look at some potential data management
concerns.
Figure 1-1
Required
PROCESS: ticketing and reservation
information
Information needed for ticketing and seat reservation.
Figure 1-2
Business Business
Process Process
CORE BUSINESS
Provide
air
Business transportation Business
Process Process
DATABASE SYSTEM
airports INFORMATION ABOUT airfares
airplane airplanes flights discount
seats plans
types flight
legs customers
reservations
It soon became clear that a new kind of software was needed to help man-
age the data, as well as faster hardware to keep up with the increasing volume
of data and data access demands. In terms of personnel, data management spe-
cialists would have to be developed, educated, and given the responsibility for
managing the data as a corporate resource.
Out of this need was born a new kind of software, the database manage-
ment system (DBMS), and a new category of personnel, with titles like database
administrator and data management specialist. And, yes, hardware has progres-
sively gotten faster and cheaper for the degree of performance that it provides.
The integration of these advances adds up to much more than the simple sum
of their parts; they add up to the database environment.
Figure 1-3
Address Music
Designers Favorite Food
DATABASE
Sales Restaurant
Announcements Suggestions
Recommended
CDs
are about to enter kindergarten, when they will graduate from high school, and
when they are engaged to be married. They then use the information generated
from the data stored in the database, along with sophisticated mining software, to
target sales efforts with a higher probability of success.
All this data is stored in databases. The databases are growing larger, not
only because more data is added to them on a daily basis, but because new kinds
of data are being captured and stored, based on the activities and transactions
in which you participate in the course of your daily living. The amount of data
being stored in databases every day, based on people’s actions and transactions,
is already huge, but will get even larger in the coming years.
FOR EXAMPLE
Amazon.com
Amazon.com, the largest retailer on the Web, has perfected the technique of
using databases to characterize its customers. By analyzing the kinds of
products you have bought or expressed interest in, Amazon.com can pre-
sent you with displays of similar products that you are likely to find inter-
esting. This sales strategy requires not only massive, well-structured data-
bases, but also sophisticated data-mining software that finds associations and
relationships in customers’ past behavior in order to prediction what cus-
tomers are likely to do and want in the future.
1.2.1 PICKING A STARTING POINT 7
SELF-CHECK
• Explain why businesses see data management as a critical
requirement.
• List companies that collect data about their customers as part of
their normal business activities.
Figure 1-4
Fleet
Car demand
Track location and use
Replacement orders
Fleet management
Service
Regular maintenance
Check after rental
Clean and repair
Departmental activities.
retired soon. All these and many more activities within the fleet management
department are aimed at supporting the core business of the car rental company,
which is renting cars to customers. The servicing department’s activities are focused
on keeping each car in top condition. Each car must be thoroughly checked after
each rental and maintained. Again, the activities of the servicing department,
although different from those of fleet management, are also aimed at supporting
the core business of the car rental company, namely, renting cars to customers.
Here is a sample of the core businesses of a few types of organizations:
▲ Retail grocery store: buy groceries from vendors and sell to retail customers.
▲ Stock brokerage: buy and sell stocks and bonds for individuals and institu-
tions.
▲ Auction company: enable customers to sell and buy goods through auctions.
▲ Computer consulting: provide consulting services.
▲ Airlines: provide air transportation to customers.
▲ Car dealership: buy and sell cars.
▲ Department store: buy and sell consumer goods.
▲ University: provide higher learning to students.
After you identify what a business is and what it does, you can start to under-
stand the data involved.
are directed toward accomplishing the purpose of the core business. Many distinc-
tive processes carried out by the departments support the core business. These pri-
mary processes fulfill the purpose of the core business.
At this point, data collection can still look like an overwhelming process.
Even a simple business can easily have dozens, if not hundreds, of primary
processes needed to accomplish its core business goals. Trying to address all of
these processes at once is not only daunting, it’s nearly impossible and, in the
long run, a waste of time. It’s better to start by focusing on one process, like air-
line ticketing and seat assignment.
What tasks must be completed to complete the process? To carry out the
various processes, the company needs resources and assets such as buildings,
equipment, materials, people, and money. But that is not all. The company also
needs information to accomplish its processes. Information is a major asset, like
other tangible and intangible assets of the company, used for performing the mul-
titude of processes. Look for the information needed to complete the task, and
you have a start on collecting the company’s core information resources.
▲ Employee records: includes hiring and firing records, time sheets and other
payroll information, vacation records, and related records.
▲ Customer records: includes customer lists, contact information, purchase
history.
▲ “Hard” assets: long-term organizational assets such as land, buildings, and
equipment.
▲ Inventory records: includes inventory of items for sale as well as items for
internal consumption.
▲ Accounting records: information about the organization’s resources with a
focus on financial resources.
Some of these documents, such as accounting records, are likely already orga-
nized to some extent and in an electronic format. Though they often cannot be
10 INTRODUCING DATA AND DATA MANAGEMENT
used as is, they are typically easier to integrate with your database. Other docu-
ments may exist only as a hard-copy paper trail, leaving much of the organiza-
tion (and data entry) up to you.
Figure 1-5
Job requirements
Employees Task-specific
information
FOR EXAMPLE
Hands-on Collection
You are gathering data for a retail organization. How do you find out the
data requirements for supporting telephone sales? You do so by watching
the sales personnel in action. Watch for both standard requirements and
exceptions. How do sales personnel verify product information? How do
they check stocking levels? What do they record about the products sold?
What do they record about the customer? What is required to complete the
transaction? These questions all help to provide direction to your data col-
lection efforts.
12 INTRODUCING DATA AND DATA MANAGEMENT
SELF-CHECK
• Explain how primary processes can be used to identify data collec-
tion requirements.
• Explain why a company’s organization chart can be important in
determining the employees to interview and what questions to ask
during data collection.
• Take a business model with which you are familiar, such as retail
grocer, department store, or a school. List the primary processes
needed to accomplish core business goals and the types of data
needed by each.
user permission to view data, but not change it. Accessibility is also an issue
when discussing security. You want to protect the data, but you also need to
make it available to the appropriate users.
A vital part of this effort is reducing the surface area (a term used to refer
to how exposed a database is to access and manipulation) that is vulnerable to
attack. The wider your consumer audience (or data consumers—the people or
applications who need access to data and who use data in some manner), the
more difficult this task becomes. Consider a local database that contains all of
your company’s data. If your employees are the only people that need access to
the data, protecting the data is a relatively easy task. The issue of who needs
access, as well as the level of access required, is typically well defined. In many
cases, you can implement many of the security requirements as an extension to
existing network security.
As your potential consumer audience increases, so do your potential secu-
rity concerns. What if people outside of your company need access to the data,
possibly accessing your network resources remotely through the Internet, as
shown in Figure 1-6?
Figure 1-6
Internet
Database server
Now you have less control over the user population. Though not shown in
Figure 1-6, additional tools such as network firewalls often come into play.
Rather than providing direct access to the data like you have in this example,
you might instead provide indirect access, such as through a Web-based appli-
cation. You will find that there are no one-size-fits-all solutions. Instead, it’s likely
that you will need to design a custom-access solution matched to the specific
access requirements, and be ready to modify the solution and access as needs
evolve.
The term commonly used to refer to data access is a data query. A query,
though it often implies simple data retrieval, can also be used to insert new
data or manipulate existing data. Most database systems use a standard set of
commands known as the Structured Query Language, or SQL, to query data-
base data. That is why you often hear modern databases referred to as SQL
databases.
When you set up your database, you must determine who needs access to
the data and what level of access they need. Not all users need access to the
same data. Access requirements are related to users’ responsibilities and the busi-
ness functions (primary practices) they perform (Figure 1-7). Think back to the
processes relating to airline ticketing and seat assignment. Ticket agents need
access to a great deal of information such as flights, available seats, and ticket
prices. There is even more information about the airline, though, to which the
ticket agent should not have access, such as confidential financial records.
In addition to determining who needs access, another design decision
you’ll need to make fairly early is exactly how to give users access to the data-
base. Typically, a very limited number of users have direct access to the data-
base. This is usually restricted to database administrators and, possibly, database
programmers. Nearly everyone else accesses the database indirectly through
another application. The specific application type depends on the access
requirements.
Users responsible for day-to-day activities typically access a database through
one or more applications designed to perform specific tasks. For example, ticket
Figure 1-7
Business Business
Process Process
Business Business
DATABASE
Process Process
SYSTEM
Business Business
Process Process
USERS AT
DIFFERENT LOCATIONS
FOR EXAMPLE
The Galileo Spacecraft and Its Data
The Galileo spacecraft has been studying the Jupiter system for several years.
It has sent back huge amounts of data. By studying a small fraction of that
data, scientists have inferred the existence of a global ocean under the ice
covering the surface of Jupiter’s moon Europa and the probable existence of
a similar ocean under the surface of another of Jupiter’s moons, Callisto.
However, the large majority of Galileo’s data remains unstudied, because in
its unorganized state, it is very difficult to extract useful information from
it, unless you already know what you are looking for. Organizing the data
into a database would be of tremendous benefit, but NASA has no funding
for such a massive effort.
agents for an airline would have an application designed to sell tickets and assign
seats. A user in a department store working a point-of-sale terminal would have
an application designed to post to sales and inventory records. People working
in the accounting department would have an application designed to work with
financial accounting records. Application requirements relate back to the core
business and primary processes.
Other access requirements could add another layer of isolation between the
data consumer and the database. Rather than communicating directly with the
database, an application might send and receive data through a Web service,
which, in turn, is the application that communicates with the database.
SELF-CHECK
• Lists key concerns relating to data management.
• Explain how your network environment can influence your access
and security design.
• Explain the need for different levels of data-access security based
on responsibilities and roles.
• Identify tools you might need to implement physical data security.
• Compare and contrast direct database access, indirect access by
local clients, and indirect access by remote clients.
KEY TERMS 19
SUMMARY
In this chapter, you learned fundamental concepts relating to data, its role as a
business resource, and requirements for data management. You compared data
requirements from both historic and modern perspectives, including a look at a
specific business process and related data requirements. You investigated the
need for data collection and how to identify data sources based on core busi-
ness activities and primary processes. You were also introduced to some of the
fundamentals of data management and potential concerns related to data man-
agement.
KEY TERMS
Access security Data query
Authentication method Data security
Consumer audience Database
Core business Fault-tolerant disk system
Data Information
Data access Mirror image
Data accuracy Physical security
Data consumer Surface area
20 INTRODUCING DATA AND DATA MANAGEMENT
Summary Questions
1. The terms data and information mean the same thing. True or false?
2. What is a database management system?
(a) an organization responsible for database maintenance and administra-
tion
(b) specialized software designed to implement, support, and maintain
databases
(c) a manual method of record-keeping phased out with the introduction
of computers
(d) an end-user application designed to give users access to data
3. Which statement best describes the role of data in business?
(a) Data collection and organization is a low-priority effort with few
benefits for a business.
(b) Data is important to business in the information technology sector only.
(c) All businesses freely share data about their customers in a
cooperative environment.
(d) Data can provide a competitive advantage in hotly contested
industries.
4. Which of the following statements best describe the use of business doc-
uments during initial data collection?
(a) You should collect all business documents except those relating to
personnel records.
(b) Limit collection efforts to documents in electronic format only as
hard-copy documents seldom contain useful information.
(c) You should collect all available business documents, including both
electronic and hard-copy documents.
(d) It is safe to assume that all business documents you locate are com-
plete, accurate, and applicable to the current business environment.
5. When interviewing employees to collect data, you should limit yourself
to executive staff, supervisors, and managers only. True or False?
6. A good place for identifying the starting point for data collection is an
organization’s core business. True or False?
SUMMARY QUESTIONS 21
23
2
INTRODUCING DATABASES
AND DATABASE
MANAGEMENT SYSTEMS
Starting Point
Go to www.wiley.com/college/gillenson to assess your knowledge of
databases and database management systems.
Determine where you need to concentrate your effort.
INTRODUCTION
No one can deny the importance of information in today’s data-driven world.
Vital to putting all of this information to use is the ability to organize, manipu-
late, and retrieve data on demand. Therein lays the importance of databases and
database management systems. In order to understand databases and how to get
the most out of them, you need to understand key terms and have a basic con-
cept of the components involved and how they fit together in a coherent archi-
tectural model. That is the goal of this chapter.
In this chapter, we lift away some of the protective covers to look into the
nuts and bolts of what makes up a database and a database management system
(DBMS). Keep in mind throughout this chapter that databases and DBMSs have
different roles and different components. The database, while a complicated entity
in itself, at its core is simply where we store the data. The DBMS contains all of
the other components that make the database a viable storage and retrieval tool.
Figure 2-1
Ordered collection
Related data
Meet information needs
Shared data access Da
tab
as
e
Database overview.
▲ Data repository
▲ Data dictionary
▲ Database software
▲ Data abstraction
▲ Data access
▲ Transaction support
Data Repository
All data in the database resides in a data repository. This is the data storage unit
where the physical data files are kept and the central storage location for the data
content. The physical structure of the data repository is database specific.
Data Dictionary
The data repository contains the actual data. Let’s say that you want to keep data
about your customers. There are two aspects of this data. One aspect is the struc-
ture of the data consisting of the field names, field sizes, data types, and so on.
The other part is the actual data values stored in the fields.
This first part, relating to the structure, typically is stored separately from
the database values as the data dictionary or data catalog. A data dictionary
(data catalog) contains the data element structures and the relationships among
data elements. The data dictionary and the data repository work together to pro-
vide information to users.
Database Software
Though sometimes referred to as databases, products like Microsoft Access, SQL
Server, MySQL, Oracle, Informix, and similar products are technically the software
that manages data and databases. These are database software or database man-
agement systems (DBMS), which support storing, retrieving, and updating data in
a database. The database software, or database management system (DBMS), is not
the database itself, but lets you store, manage, and protect the data in a database.
Data Abstraction
Consider a situation where you want to store information about customers. Data
about each customer consists of several fields such as customer name, street
address, city, state, zip code, credit status, and so on. Depending on users’ require-
ments and roles in an organization, different users consider the data from differ-
ent levels of data abstraction. Data abstraction, in this context, is a way of look-
ing at data that breaks it down into its basic components and groups the
components to give you different ways of looking at the data. It gives you the abil-
ity to hide the complexities of data design at the levels where they are not required.
28 INTRODUCING DATABASES AND DATABASE MANAGEMENT SYSTEMS
Data Access
The database approach includes the fundamental operations that can be applied
to data. Every database management system provides for the following data
access basic operations:
Transaction Support
Consider the process for entering an order from a customer into the computer
system. The order entry clerk types in the customer number, the product code,
and the quantity ordered. The order entry program retrieves the customer data
so the clerk can verify the customer data, retrieves and displays product infor-
mation and inventory data, and finally completes the order and updates inven-
tory or creates a back order. All these tasks perform a single primary task, enter-
ing a single order comprised of a single order entry transaction. A transaction is
a series of statements or commands that execute as a group, like the statements
represented in Figure 2-2. All of the statements must either run successfully or
all must be rolled back, backing out the results as if the statements were never
run. This compensates for errors that might occur during transaction execution.
When a transaction is initiated it should complete all the tasks and leave the
data in the database in a consistent state. For example, if the initial stock is 1000
units and the order is for 25 units, the stock value stored in the database after
2.1.3 DATABASE USE 29
Figure 2-2
Database Server
Transaction Steps
Transaction.
the transaction is completed must be 975 units and there must be an order in
file documenting the sale of the 25 units.
Figure 2-3
Decision- Mass
Production
Support Deployment
Databases
Databases Databases
• Support business functions • Used for analysis, querying, • Intended for single user
• Online transaction and reporting environments
processing • Generally read-only • Workstation versions of
database products
• Usage includes CRUD • Features include query tools
activities and custom applications • Ease of use important
• Features include • Data warehouse and OLAP • Features include report and
concurrency, security, systems application generation
transaction processing, capabilities
security
Database types and their uses.
30 INTRODUCING DATABASES AND DATABASE MANAGEMENT SYSTEMS
FOR EXAMPLE
Choosing an Appropriate Deployment Model
You are a database designer and administrator for a business with offices
throughout the United States. All of the database and other IT support per-
sonnel are located in the home office in Chicago, Illinois. You support a sin-
gle database server named “Ol_Faithful” with access provided to the remote
offices through remote networking.
The company recently acquired another company in the same industry,
primarily to get its customer list and other business intelligence. Your job
is to make the data available as quickly as possible with minimal interrup-
tion to current procedures and user access to current data. The new data-
base server, named “OurData,” is physically located in St. Louis, Missouri,
and runs a different DBMS than your existing database server. Also, you
determine that you can’t import data from the new server into the existing
server without first upgrading Ol_Faithful to a more powerful computer
with larger hard disks.
How does this situation fit into the implementation models? What you
need is to change your current centralized model into a distributed model,
but with all of the servers physically located in the same place. Bring the
OurData server to Chicago. That’s where all of your support staff is located,
and your infrastructure is designed to support centralized data access from
Chicago. Once you have the server physically in place, you can provide
access through a distributed model based on a heterogeneous data environ-
ment. This buys you time until you can upgrade Ol_Faithful and combine
the data onto one server.
databases are serving another important function. These decision support data-
bases are designed to support advanced data mining and provide support for strate-
gic decision making in an organization. Production databases and decision support
databases are typically large-scale databases designed to support several users’ needs
within an organization. When a company uses both, the production database has
the added role of acting as the data source for the decision support database.
Individuals and single departments might also use private databases (referred
to in the figure as mass deployment databases), but this is typically discouraged
unless required for security or other reasons. However, having multiple databases
does not necessarily mean implementing multiple database servers. Instead, you
can implement multiple databases through a single server.
Businesses implement databases using either a centralized model or a dis-
tributed model. In the centralized model, you have a single, centrally located
2.2 UNDERSTANDING BASIC DATABASE MODELS 31
database server. This server and the database it hosts contain all of the business
data. This model is relatively easy to implement, but can become difficult to sup-
port if you have to provide data access for remote users.
The distributed model uses multiple database servers with the data spread
across the servers. Each server typically contains its own unique set of data.
Servers are usually deployed in different physical locations. The database servers
can all use the same DBMS, or use different database products, giving you a
heterogeneous data environment made of mixed database types, operating sys-
tems, and hardware platforms. Potential drawbacks include the need for reliable
communication between the database servers and the fact that different database
products don’t always interface well with each other.
SELF-CHECK
• Explain basic database concepts relating to data and transactions.
• Discuss how the basic building blocks of a database work together
to provide a data environment.
• List business factors that would determine whether it would be
better to deploy a centralized or distributed database model. Be
sure to include cost as part of the determining factors.
• Compare the role of the data repository with that of the data dictio-
nary and describe how the two interact.
Root
Segment Parent
CUSTOMER Segment
Child
Segment
ORDER
Parent
Segment
ORDER Child
LINE ITEM Segment
Relationship links
through physical
pointers
FOR EXAMPLE
Hierarchical Storage in the Real World
Though it isn’t based on the hierarchical database model, hierarchical stor-
age has seen rapidly growing interest recently with the use of extensible
markup language (XML) files. XML files are commonly used for local data
storage and for data transfer over the Internet. XML is not a database type,
but is instead a data storage and presentation format. XML is used as an inter-
nal format, or at least a formatting option, in several database systems. How-
ever, some new database systems, designed primarily for small, low-volume
databases, are based on XML data files and storage structures. Even though
they use hierarchical storage, they are not hierarchical model databases.
but a parent can have multiple children. You might want to separate orders into
phone orders and mail orders. In that case, CUSTOMER may have PHONE
ORDER and MAIL ORDER as two child segments. The parent/child links are
maintained through physical data pointers that are embedded in the data
records. The parent has a pointer to the child record and the child has a pointer
back to its parent.
Figure 2-5
Owner Member
record type Owner
record type
record type
I
SALES I
CUSTOMER I
TERRITORY
Member Owner
Member record type record type
record type
SALES
ORDER PAYMENT
PERSON
Member Owner
record type record type
Owner
record type ORDER
LINE ITEM
Member
record type
Relationship links
through physical
pointers
Figure 2-6
SALES
CUSTOMER
TERRITORY
SALES
ORDER PAYMENT
PERSON
ORDER
LINE ITEM
key in the referenced table, or it can also point to a column that has been specif-
ically configured as unique.
For example, the CUSTOMER table represents the data for all customers.
Columns in the CUSTOMER table describe the customer with information such
as the customer ID value, name, address, and so forth. The ORDER table rep-
resents the data content for all orders and the information describing each order.
In the relational model you define relationships between different data elements.
A relationship is an association defined and established between two data entities.
Consider the relationship between CUSTOMER and ORDER. Each customer might
have one or more orders. Each customer occurrence must be connected to all related
order occurrences. In the relational model, a foreign key field (column) is included
in the ORDER data structure to define that relationship. For each customer order
in ORDER, the foreign key identifies the related customer. To look for all the orders
for a particular customer, you search through the foreign key field of ORDER and
find those orders with that customer’s ID value in the foreign key field.
The foreign key enforces and maintains the relationship between CUS-
TOMER and ORDER. This replaces the physical pointers used in the hierarchi-
cal and network models. Even though this model does not use physical data
pointers to define relationships, it does still use them. Physical data pointers are
used in many implementations of the relational model to identify where data
rows are physically stored to help optimize data retrieval.
Figure 2-7
21765
24013
Drill
Saw
92.99
26.25
>
26722 Pliers 11.50
database, is shown in Figure 2-7. Notice in this example that the data includes
large-object image data.
It is now common to see the Unified Modeling Language (UML) used
when creating data models for relational database management system (RDBMS)
based applications. UML was developed in 1997 specifically as the modeling lan-
guage to describe data models for object-oriented database systems.
SELF-CHECK
• Describe the basic database models.
• Explain why the relational and object-relational models are the most
commonly used database models in current database applications.
• Compare how relationships between data entities are represented in
the hierarchical, network, and relational models.
38 INTRODUCING DATABASES AND DATABASE MANAGEMENT SYSTEMS
Figure 2-8
Practioners Users
Application Interfaces
Procedures
Application Query/Reporting
Software Front-End Tools
Systems
Hardware
Software
You’ve already been introduced to some of these, specifically the data repos-
itory, data dictionary, and database software, here shown as the DBMS. Let’s take
a brief look at each of these functional building blocks, which include the
following:
The components in the overall architecture fall into three major groups:
(1) hardware, (2) software, and (3) people and procedures.
40 INTRODUCING DATABASES AND DATABASE MANAGEMENT SYSTEMS
Figure 2-9
Mini
k
Lin
ion
nicat
mu
RAID Com Client
Machine/Terminal
OR
Mainframe
Commu
nication
Hard Disk Link
Array OR Client
Machine/Terminal
Server Machine
Database system hardware infrastructure.
2.3.1 HARDWARE COMPONENTS 41
depending on variable factors such as the size and type of database, the expected
number of users, details of the database application and its processing require-
ments, and whether or not the server must support other applications in addi-
tion to the DBMS. Most manufacturers provide guidelines for calculating your
actual server requirements.
In a database environment, servers and client workstations require power-
ful, fast processors. The processors in the client computers perform instructions
to present data to the users in many new and sophisticated ways. There may be
additional processing requirements, such as locally performed calculations,
depending on the specific client application software. The processors in the
application servers (database servers) must process complex requests for data
from the database and perform data manipulation. As with the client, requirements
vary depending on the application. A common option is to have a multiple-
tier configuration for server-side processing where the database server is separate
from an application server running additional application components.
Main memory provides temporary storage of data and programs. This is also
where data manipulation physically occurs. On a database server, much of the
memory is dedicated to memory buffers, memory set aside within the main
memory specifically for storing database data, as shown in Figure 2-10. When
data is initially requested, it is read from the hard disk and maintained in the
Figure 2-10
System memory
Buffer memory
Data request
Data request
Database Client
Database
Buffer memory use.
42 INTRODUCING DATABASES AND DATABASE MANAGEMENT SYSTEMS
memory buffers. Data can be read from the memory buffers for data extraction
or processing as needed. In a database environment, you must move large vol-
umes of data through the memory buffers. The data remains in memory until
the space is needed and it is overwritten by new data coming into the buffers.
Because of this, when a request is made for data, the memory buffers are searched
first to see whether the requested data is already there. If so, then the data can
be accessed from memory and used much faster than when being read from the
hard disk.
Computer systems use I/O (input/output) mechanisms to move data into and
out of the computer’s main memory from secondary storage, a term used to
refer to nonvolatile storage media that does not lose the data it contains if the
computer is turned off, such as hard disks. Modern PCs improve transfer per-
formance through direct memory access (DMA) channels that let devices such
hard disk drives, floppy drives, video adapters, and network adapters transfer
data directly to and from memory without going through the processor cache.
Another I/O concern is communication between computers. In earlier data-
base environments, most of the processing was done locally around a central-
ized database. Today’s database environments are different. Distributed databases
have become common. Data communications components move high volumes
of data between distributed databases at different locations. Communication links
with sufficient capacity and speed to handle the data movement are critical.
These communications links include both local communications through a local
network, and remote communications through a wide-area or enterprise net-
work. The Internet is used extensively as the communication path of choice for
remote communications.
Data storage devices such as disk, tape, and CD or DVD drives are critical.
Without nonvolatile data storage on secondary storage devices, there is no data-
base. How you use secondary storage and how you are able to retrieve data from
it affects the performance of the database environment. Often, storage device per-
formance is the most limiting factor when optimizing database server performance.
Two primary concerns are capacity and speed. You were introduced earlier to
the data repository as containing all of the database data. Your secondary storage
must have sufficient capacity to hold all of this data along with program files,
operating system files, and any other data stored locally on the computer. It must
also provide for high-speed data access so that data requests are processed quickly
and efficiently. Secondly, data storage should be fault tolerant. Data storage mech-
anisms should ensure that the database operations continue even when some mal-
function affects parts of the storage. RAID technology devices are a common solu-
tion and allow you to store data so that even if one disk fails, the database
operations continue. RAID, which stands for redundant array of inexpensive (or
independent, depending on who you ask) disks, provides high-performance disk
storage. Most RAID configurations also provide fault-tolerant storage.
2.3.1 HARDWARE COMPONENTS 43
FOR EXAMPLE
Comparing Data Storage Formats
A DBMS stores data in an operating system file. The total number of files
involved and how they are structured depends on the particular DBMS. You
can get an idea of common variations by comparing popular examples.
Our first example is Microsoft Access. Access uses a relatively simple
storage structure, creating a single file when you create the database. The
database objects are contained in this one file. This has its advantages, such
as making the database extremely portable, but it is also limited, with few
options for managing storage. For example, there is little you can do with
the database file to improve performance.
MySQL is an open source application with versions available that you
can download and use free of charge. There are also for-purchase versions
that include additional tools and utilities. It is a complete RDBMS product
and a popular choice for Web-based applications. When you create a MySQL
database, it creates a data folder containing an.opt file, which contains
option settings for the database. Separate data files are created in the data-
base folder for each table and view that you create in the database (a view
is a database object that supports custom data retrieval). One drawback of
this storage system is that you must deal with a large number of separate
files when managing a large, complicated database.
MySQL also shares log files between databases. Log files are used in most
database systems to store errors, informational messages, database activity,
and other database information. These shared log files include a log that
tracks statements executed against the database as a way of logging trans-
action activity.
Microsoft SQL Server is one of the most full-featured database products
on the market. It includes a wide range of features and supports something
of an object-relational database model. Different versions, called editions, are
available to support a wide range of database needs.
A SQL Server database is physically made up of at least two and pos-
sibly more files. One required file is the database file and contains the
database objects and data. The other is the transaction log that tracks any
activity that modifies the database. The data storage model is designed so
you can create additional data files as needed and specify the physical file
locations. You can then define the file in which a database object is cre-
ated. This gives you a way of spreading a database across multiple hard
disks. You have control over how the space is used and, by sharing the
work between multiple drives, a way of improving database performance
and I/O throughput.
44 INTRODUCING DATABASES AND DATABASE MANAGEMENT SYSTEMS
The data storage component also relates to the storage of the structure defin-
itions in the data dictionary, the database’s metadata. Metadata is a term referring
to data about data. It’s the data that describes the database and database objects.
The storage format is specific to the database management system software used.
Structure definitions specify database and object schemas. In this context, schema
refers to the design and structure of database objects. For example, a table’s schema
describes, among other items, the columns that make up the table.
Speed is of the essence for the storage devices holding the data dictionary.
This is especially true because, with many database systems, the data dictionary
must be accessed first before the data is accessed.
You may also have removable storage media components, such as a tape
drive or writeable CD or DVD drive, to support database backups and data
archiving. In a well-managed design, databases are backed up regularly. The
backups are used for recovery if and when problems occur and data is lost. You
can run backups to a shared storage location on the network or to local remov-
able media. Also, periodically, you might archive old data to remove it from the
active database for storage in a separate location.
▲ Hardware management
▲ Process management
2.3.3 DBMS COMPONENTS 45
▲ Memory management
▲ File management
▲ I/O management
▲ Network control and communications
▲ Fault detection and recovery
Database Engine
The kernel or heart of the DBMS is the database engine, which coordinates the
tasks performed by all other DBMS components. This helps ensure that every
database operation gets completed correctly and completely. The other compo-
nents depend on the database engine to perform actions for them such as stor-
ing data and other system information, retrieving data, and updating data in the
database.
As an example, consider the database engine’s role in processing a query. A
query refers to an executable statement, such as a statement used to retrieve or
modify data. Suppose you create a query to retrieve production data from the
database. When your users run this query, the DBMS must coordinate the tasks
46 INTRODUCING DATABASES AND DATABASE MANAGEMENT SYSTEMS
Figure 2-11
Commn. Utility
Interface Data Tools
Dictionary
Application Forms
Developer Generator
Database
Engine
Query Report
Processor Writer
DBMS components.
needed to determine where the data is located and how it is structured. Once
the structures and data location are determined, the DBMS interacts with the
operating system to retrieve the requested data.
The database engine is also responsible for security, sometimes working
together with the operating system. Does the user executing the query have
authorization to access the data specified in the query? The database engine
component coordinates the services of the security module to verify the
authorization.
Some systems include an automated agent service or agent and alert system
that works in close cooperation with the database engine. This service is known
by various names, depending on the DBMS, but performs the same function in
each case. Some descriptions of DBMS components might not identify this as a
separate component, instead including it with database tools and utilities. How-
ever, its role in the DBMS and the fact that it is often implemented as a sepa-
rate service or application justifies listing it as a unique component in its own
right.
An agent and alert system performs background tasks that support the data-
base service. Key among these is managing periodic activities that are scheduled
for automatic execution. It ensures that these activities run when scheduled and
if not, reports the error and takes appropriate corrective actions. Other tasks for
which the service might take responsibility include background monitoring and
managing activity logs. Background monitoring takes different forms, but is most
often related to performance monitoring and detecting and reporting errors. It
usually falls on the database administrator to define and configure performance
thresholds and identify the error conditions the service should report. Log man-
agement includes writing error and status messages to database log files. Most
2.3.3 DBMS COMPONENTS 47
systems will have one or more log files, some dedicating different logs to different
components.
Nearly all systems provide some kind of support for error and activity log
maintenance. If the DBMS does not include an automated agent as a separate
component, you might find these functions integrated into the database engine.
Performance monitoring and scheduled activities are a different story, however.
If not supported by the database engine or a separate service, you will need to
either find a way to use operating system utilities or buy (or build) a separate
support application that handles them for you.
Data Dictionary
Another component is the data dictionary, which refers to the storage of struc-
ture definitions—the storage repository of structure definitions. In many rela-
tional databases, the data dictionary comprises a set of system tables like a direc-
tory system. The data dictionary interface software is sometimes considered a
distinct component within a DBMS. However, in other manufacturers’ products
the data dictionary is considered an integrated part of the database engine rather
than a separate component. Either way, the data dictionary is part of the data-
base’s metadata, and no matter how it is defined, its role is the same. Specifi-
cally, the data dictionary performs the following tasks:
Query Processor
Another key component is the query processor, which may be called the
query optimizer, the DBMS component responsible for parsing, optimizing,
and compiling queries for execution. Each commercial vendor adopts a
standard query language or develops its own proprietary language for creating
queries. Structured Query Language or SQL (pronounced see-kwil) is the
standard language for relational DBMSs. Each relational database vendor
enhances the standard SQL command set and adds a few features here and
there, but the essential features of SQL are present in every commercial DBMS.
48 INTRODUCING DATABASES AND DATABASE MANAGEMENT SYSTEMS
Now, for example, you can write a query in SQL to retrieve data from the
CUSTOMER table: This might require a query such as the following:
SELECT *
FROM CUSTOMER
WHERE CUSTOMER_ZONE = ’North’ and
CUST_BALANCE > 10000
What happens in the DBMS when your query is presented for execution?
The database engine coordinates the services of the query processor to com-
plete the query and return the results. The query processor parses the query
and checks for obvious syntax errors. It determines the resources available to
run the query and makes selections to help ensure that the query is opti-
mized for processing. It then compiles the query for execution. It also watches
for potential errors specific to query processing and, in some cases, is even
able to abort queries that run too long. As shown in Figure 2-12, some
DBMSs give you a sneak peak into the specific steps necessary to process the
query.
Figure 2-12
These processing steps are used as an execution plan. After the query proces-
sor parses and optimizes the query, it compiles the execution plan and writes it to
memory. This execution plan is used to perform the query and return the requested
data. Figure 2-12 shows one possible execution plan for our sample query. The
actual execution plan can change as the database design and available optimization
components change, so the execution plan you see now might not be exactly the
same as the execution plan you would see on a later run of the same query.
Forms Generator
The forms generator component, when included with the DBMS, lets you create
screen layouts or forms for data input and data display. For example, suppose you
want to display the details of a customer order to your user. You need a layout to
display the various data elements such as the order number, date, method of ship-
ment, products ordered, price, and so on. You would use the forms generator to
create a form to display this information in an easily readable format. You can usu-
ally include graphics and images to enhance the appearance of your forms, such as
displaying the picture of the employee on a form showing employee information.
A forms generator is not included as an integrated component with all DBMS
products. In fact, it is not a common component with higher-level database
products. Instead it is usually found in products like Microsoft Access. Higher-
level products, like Microsoft SQL Server, leave forms generation as a need to
be filled by a separate application or a separate development environment.
Report Writer
The report writer is another component that you might or might not see
included with your DMBS. Although most information access is online, a good
deal of information delivery still uses printed reports. Reports may be viewed on
screen before selecting and printing only the ones needed in hard copy. In some
cases, reports are generated and distributed in an electronic format only, though
the recipient usually has the option of printing a hard copy if desired.
Several DBMS products include a report writer component. The functional-
ity and features are product specific or even product version specific and vary
widely. Because of the additional overhead required to support this component,
many manufacturers include this as an optional component.
between the computers on which they are hosted. This can include DBMSs run-
ning on different computer systems and under multiple operating systems. The
computers could be in different physical locations. Communications interface
modules in the DBMS interact with the operating system and network software
to initiate and manage the connections. They manage the data flow between data-
bases and help to promote true global data sharing.
Another aspect of the communications interface is the link with traditional
programming languages and application development environments. This is sup-
ported both through interfaces integrated in the DBMS and those included in
the development environment.
Most commercial vendors have enhanced their DBMSs and equipped them
with rich sets of utility tools. Others have left this up to third-party devel-
opers who create the utilities as separate applications. In some cases, the only
support tools available are those defined as part of the SQL language
command set.
Either way, these tools and interface modules are commonly known as the
toolkit portion of database software. The toolkit is primarily intended for data-
base administrators to perform necessary functions. Common utilities include
tools to support loading data into the database and backup and restore utilities.
Tools and utilities are often provided in both GUI interface and command-line
interface versions. The command-line interface tools and utilities are often the
preferred choice when having to manage a database remotely, especially when
working over a low-speed connection.
▲ Casual Users: This group uses the database occasionally. Usually, midle-
and upper-level executives fall into this group and need special consider-
ations for providing data access. You typically have to provide these users
with simple menu-driven applications.
▲ Regular Users: The database in a production environment serves as the
information repository for the regular users to perform their day-to-day
functions. This group of users retrieves data and updates the database,
typically more than any other group. They work with programs that
2.3.4 UNDERSTANDING PEOPLE AND PROCEDURES 51
Figure 2-13
Database Server
Users IT
Casual users Business analyst
Regular users Data modeler
Power users Database designer
Specialized users Systems analyst
Individual users Programmer
Database administrator
retrieve data, manipulate data in many ways, change data to current val-
ues, and also typically run periodic reports.
▲ Power Users: These users do not require well-structured applications to
access the database. They can write their own queries and format their
own reports. For the power users, you have to provide general guidance
about the query and reporting tools and give them a road map of the
database contents.
▲ Specialized Users: Business analysts, scientists, researchers, and engi-
neers who are normally part of research and development departments
need the database for specialized purposes. These users know the power
and features of database systems. They may write their own applications.
▲ Individual Users: Some users in your organization may justify maintain-
ing standalone, private databases for individual departments. Departments
working with highly sensitive and confidential data may qualify for per-
sonal databases.
52 INTRODUCING DATABASES AND DATABASE MANAGEMENT SYSTEMS
You can think of the entire group of IT personnel who create and maintain
databases as database practitioners. These people keep the database environment
functioning and enable the users to reap the benefits. Broadly, we can classify
database practitioners into three categories: those who design, those who main-
tain, and those who create the applications.
▲ Usage: clear and concise procedures on how to make use of the data-
base. For example, procedures might include plain and nderstandable
instructions on how to sign on and use the various functions provided
by the applications. These often consist of written procedures as well
as user aides, which are sometimes built into an applications help
system.
▲ Queries and Reports: a list of available predefined queries and reports,
along with instructions on how to supply the parameters and run them,
provided to users. These can include written documentation or be inte-
grated into a database application or support application such as a Web
service. Sometimes, programmers document these through the applica-
tion’s help system.
2.3.4 UNDERSTANDING PEOPLE AND PROCEDURES 53
SELF-CHECK
• List and describe database architectural components.
• Explain the difference between required and recommended hardware.
• Compare common data storage methods, including benefits and
shortcomings of each.
• Describe common DBMS components and how they are related.
54 INTRODUCING DATABASES AND DATABASE MANAGEMENT SYSTEMS
SUMMARY
You learned about databases and database management systems in this chapter.
You were introduced to basic database concepts and database types. You com-
pared the hierarchical, network, relational, and object-oriented data models. You
also learned about database system architectural components.
KEY TERMS
Attribute Metadata
Binary large object (BLOB) data type Network database model
Centralized model Nonvolatile storage
Concurrency control Object-oriented database model
CRUD Object-relational database model
Data abstraction Operating system software
Database engine Physical data pointer
Database management system Primary key
(DBMS) Production database
Database practitioner Query
Database software Query optimizer
Data catalog Query processor
Data dictionary RAID
Data repository Relational database model
Data type Relational database
Decision support database Relationship
Direct memory access (DMA) Report writer
Distributed model Schema
Entity Secondary storage
Foreign key Structured Query Language (SQL)
Forms generator Table
Heterogeneous data environment Transaction
Hierarchical database model Unified Modeling Language (UML)
Main memory View
Memory buffer
KEY TERMS 55
Summary Questions
1. Companies adopting the database approach to data typically use a combina-
tion of data-driven and process-driven development methods. True or False?
2. Which of the following relate to storing and managing data structure
information?
(a) data repository
(b) data abstraction
(c) data dictionary
(d) data access
3. In the distributed model of database implementation, all of the database
servers must use the same DBMS. True or False?
4. Which of the following statements correctly describes a transaction?
(a) All statements in a transaction must be completed or rolled back as a
group.
(b) A transaction must consist of a single executable statement.
(c) If an error occurs during processing, a transaction leaves data in an
inconsistent state.
5. Which of the following correctly lists the words making up the abbrevia-
tion CRUD?
(a) consistency, redundancy, utility, data store
(b) create, read, update, delete
(c) columns, rows, users, data
(d) create, reply, understand, distribute
6. Which of the following is a common hybrid database model?
(a) object-relational
(b) hierarchical-object
(c) network-relational
(d) hierarchical-network
7. In the relational model, a relation refers to which of the following?
(a) the association between two data tables
(b) a foreign key definition
56 INTRODUCING DATABASES AND DATABASE MANAGEMENT SYSTEMS
2. You are comparing DBMS products as the platform for supporting a retail
sales application. You plan to use a standard product from an established
manufacturer. Everything should run on standard PC hardware. The
database must support inventory tracking, customer records, and cus-
tomer orders, including a complete sales history. The company plans to
eventually expand into e-commerce over the Internet. To support this,
you must be able to have both a description and a picture of each prod-
uct. Should you use a traditional model or hybrid model DBMS? Which
model best supports your data requirements?
3. You are comparing DBMS products as the platform for a new e-commerce
application. You have identified the tables that you need and the attributes
that each table needs to support, such as relationships between tables,
uniqueness, and so forth. Compare and contrast features supported by
different commercial DBMS products.
TIP: Use the manufacturer Web sites to research features and functionality.
4. You are deploying a database application based on Microsoft’s SQL Server
DBMS. You need to identify implementation requirements and product
features. You need to justify where additional software packages will be
needed. Why is a separate development platform needed for application
development? How can you determine the initial database server hardware
configuration? What hardware components are critical to optimum perfor-
mance? What two general categories of users must you support?
YOU TRY IT
59
* 3
DATA MODELING
Starting Point
Go to www.wiley.com/college/gillenson to assess your knowledge of data
modeling.
Determine where you need to concentrate your effort.
INTRODUCTION
This chapter focuses on the process of creating a data model. A data model rep-
resents real world business data in the form of a database design. The chapter
begins by introducing the design process, the main types of databases, and the
goals of data modeling. It then looks at the key components of a relational data-
base model: entities and relationships. The chapter concludes by comparing
some data models based on different businesses.
▲ Conceptual design
▲ Logical design
▲ Physical design
During the conceptual design (Figure 3-1), you lay the groundwork. This is
when you interview all of the stakeholders, which includes everyone with any
interest in the project, about their data requirements. This is when you identify
Figure 3-1
✍ ✓
Conceptual Design Logical Design Physical Design
Conceptual design.
62 DATA MODELING
Figure 3-2
✍ ✓
Conceptual Design Logical Design Physical Design
Logical design.
what the application needs to do. This is also where you identify the data require-
ments and collect the information you need. You aren’t concerned about how
things will get done or which of the data you collect you actually need, you’re
just pulling together everything you can find.
You start organizing the data during the logical design process (Figure 3-2).
This is where you sort through the data to identify what information applies to
the project and start putting together a basic logical framework. You identify the
basic type of database application that you need based on processing require-
ments. You start thinking about data services you’ll need to support the appli-
cation. You also go back to the stakeholders for verification, to ensure that your
vision meets their expectations.
Data modeling occurs during logical design. During the process of data
modeling, you identify entities and attributes, distinguish relationships between
the entities, and make initial decisions about database tables. The framework is
still somewhat tentative at this point. You may find roadblocks to implementing
your design that force you to rethink and rework your logical framework.
Implementation details come in during physical design (Figure 3-3). The
focus moves from logical modeling to physical modeling. This is where you make
final decisions about how the data will be represented in the database. You’ve
already identified your attributes and now you decide how to store them. You
decide on the physical database objects that your solution will require. You
make decisions that will help you settle on a hardware platform, such as phys-
ical object placement, to optimize both space use and database performance. At
the end of this process you should have a complete physical design that you
will in turn use to install and configure your database server and create the
physical database.
Throughout the design process, it’s important to realize that the conceptual,
logical, and physical design are all interrelated, and that you’ll move back and
forth between them during the design process.
3.1.2 DETERMINING THE DATABASE TYPE 63
Figure 3-3
✍ ✓
Conceptual Design Logical Design Physical Design
Physical design.
▲ Transactional
▲ Decision support system (DSS)
▲ Hybrid
This chapter focuses on the transactional database model as its primary exam-
ple, but we must also consider the other models and their core concerns. The
core data modeling concerns are much the same in each of these models.
every hour. A library has to track each book coming in and going out as an indi-
vidual change. The database for each of these organizations needs to be opti-
mized to process these changes.
The traditional model for a transactional database is a client-server database
environment where the database services users within a single company. A more
expansive version of this is the online transaction processing (OLTP) database.
The difference is one of scale. With this model, you must consider a much larger
network environment, possibly a global enterprise network, and larger user base.
A typical feature of this model is Internet-based access with Web services and
Web applications. Because of the larger user base, concurrency requirements for
OLTP database models can be critical. Concurrency relates to supporting mul-
tiple users who need concurrent access, access to the same data at the same time.
You are also concerned about throughput, how much data can you process, and
how fast can you process it.
Figure 3-4
Bu
Production Server
lk
lo
ad
Data Warehouse
Bulk load.
3.1.2 DETERMINING THE DATABASE TYPE 65
data warehouse often contains many years of historical data to provide effective
forecasting capabilities. An OLTP database is typically used as the source data-
base for the data warehouse. A data mart is essentially a small subset of a larger
data warehouse. The data mart is typically based on the modeling technique as
its parent data warehouse. A reporting database is often a specialized data-ware-
house-type database containing only active (and not historical or archived) data.
They are typically small in comparison to the data warehouse, similar in size to
the source OLTP database. In fact, they can be smaller than the OLTP database
because they often have more limited data requirements and contain a selected
subset of the data.
Consider the retail example again. In many cases, the company will also have
a DSS that relies on a data warehouse. Information collected by the transactional
database is periodically written to the data warehouse.
How would a retail business use the information? One way is by determin-
ing ordering and stocking levels. It’s important to most businesses to keep inven-
tory levels to a minimum. Ever wonder how a business decides the best time to
discount summer items and put Halloween costumes out for sale, or when to
bring out Christmas trees? A DSS lets the retailer adjust inventory stocking lev-
els based on variations needs, such as seasonal requirements. It tells the retailer
when to rotate the stock and can even provide suggestions as to the best phys-
ical product placement inside the store.
Another example is using a DSS to help identify new products that the busi-
ness might want to stock. Data mining applications can make associations
between products or customer purchases you might otherwise overlook. A DSS
can help the business target advertising so that it better matches specific cus-
tomers’ purchase patterns. That’s one of the ways marketing efforts specifically
target different groups. In a chain store, this is important because purchase
habits vary by geographic location, and can even vary by neighborhood in the
same city.
Figure 3-5
Inventory Ordering
Customer information?
By age/sex/other?
Customers By neighborhood (ZIP code?)
How is it tracked?
Preliminary model.
Business rules cover a wide area of subjects, not all directly relating to your
database design. However, you must identify those that do impact the design
and ensure that they are included. Essentially, business rules cover:
▲ Any types of organizational policies of any form and at all levels of the
organization.
▲ Any types of calculations or formulas (such as loan amortization calcula-
tions for a mortgage lending company).
▲ Any types of rules and regulations (such as rules applied because of legal
requirements, self-imposed restrictions, or industry-standard requirements).
68 DATA MODELING
A relational database model cannot avoid defining some business rules. They
will impact what data is kept in the database, limits placed on that data, data
relationships, and how they are defined.
▲ High-value items are defined as items that cost us $1000 or more each.
▲ There should be no more than three of any high-value items in stock
and on order at any time.
In most cases, you could enforce this through the database system. Assume
that you have an inventory table that includes quantity on hand and quantity
on order. The logic used would be some variation of the following (expressed
below as pseudocode):
IF cost >= $1000 AND
((quantity on hand) + (quantity on order)
+ (new order amount)) <= 3
THEN
Allow the order
ELSE
Cancel the order
First, check to see if the cost is greater than or equal to $1000. If not, you
don’t do any additional checking. If so, then add up the quantity on hand,
already on order, and the new order amount. If this total is less than or equal
to 3, then you can place the order. If not, the order is canceled.
To take this one step further, you could have set stocking levels or order
points defined in the database. In that case, you would use similar logic in the
order point calculations to ensure that the order point is never set higher than
3 for items costing $1000 or more.
This is something that you would probably handle through the applica-
tion. When the customer tries to pay by credit card, the application must first
get an approval. This is usually done passing the credit card number and
amount to a credit card authorization service and then waiting for approval.
The approval is often accompanied by some kind of verification code for track-
ing and audit history, just in case there is any kind of question in the future.
What if there’s a problem, such as the phone lines being down, or the com-
pany providing authorization being unavailable for some reason. You might have
an additional business rule, such as:
▲ If credit card purchases cannot be automatically approved, then manager
override is required.
FOR EXAMPLE
Using Business Rules
Business rules will relate directly to the business for which you are design-
ing the database. The types of business rules you encounter for a retail sales
company will probably be quite different than the rules for a credit card
company. In either case, the rules will be a design factor.
Consider a situation where you’re designing a database for a company
that sells home improvement products. You might have a small number of
customers, such as contractors, who are allowed to charge their purchases
and pay later. When someone tries to charge an order, the person making
the sale must determine the following:
• Does the customer have an established account?
• Does the customer have credit available?
• Can the customer charge these types of items?
• Does the customer get the same discount on the order when charging?
Each of these would result in business rules. You would have a rule that
you can make charge sales only if the customer has an account and is not
over the credit balance. Any other attempts to charge an order would be
blocked. Also, you must have a way to keep track of the customer’s current
balance and credit limit. If the customer can charge only certain items, there
would have to be some way of identifying these items in the database and
associating them with customers as appropriate. If the customer normally gets
a discount, but the amount changes when the order is charged, the change
would need to apply automatically. Discount amounts would have to be
stored somewhere in the database and associated with the customer.
70 DATA MODELING
Once again, you would build this into the application. It would need to rec-
ognize that automatic approval didn’t occur because of a problem. Maybe you
would want to write the application so that the salesperson must manually report
the problem. The application would then need a way of accepting and verifying
the manager override and allowing the transaction.
SELF-CHECK
• List the steps in the design process and decisions that must be
made in each.
• Explain the importance of business rules and how they can be
identified.
• Compare the roles of transactional and decision support databases.
• Explain why a small business might use a hybrid database to sup-
port its database applications.
the information that you need to track about each of the entities. Together, these
give you your basic working set of data.
In the real world, entities never really stand alone; they are typically associ-
ated with each other, or intersect in some fashion. Parents are associated with
their children, automobile parts are associated with the finished automobile, fire-
fighters are associated with the fire engines to which they are assigned, and so
forth. Customers and the items they buy intersect at the point of sale. Recog-
nizing and recording the associations and intersections among entities provides
a far richer description of an environment than recording the entities alone. In
order to intelligently and usefully deal with the associations or relationships
between entities, you must be able to identify and describe both the entities and
the relationships.
Now, let’s stop for a moment, to look at pay rates closer. What’s the advan-
tage to tracking pay rate as a separate entity? Each employee will probably just
have one base pay rate? Consider the situation though, where you have set pay
rates and the employee’s pay rate must match up with one of these. You would
track pay rate as a separate entity then so that you could use those values to
validate the pay rate assigned to an employee.
Sometimes, something might initially look like a single entity, but as you
look closer, you need to break it down into two or more entities. Take purchase
orders placed with your vendors, for example. You issue a purchase order to a
specific vendor to order inventory items. There are two basic parts to the pur-
chase order. First, is the general information—things like who the order is for,
the date it is being placed, the purchase order number, and so forth. The sec-
ond part is the list of individual items that you are ordering. Each of these should
be treated as a separate entity. That’s because each is a separate, physical “thing,”
that you can see, touch, and describe. Each item is further described by its place
in the item list, the quantity you are ordering, the item cost, and so forth. These
are all pieces of data that can vary by item, the item attributes.
Attributes
As you define your entities, you also need to define attributes, which as you
already learned, is the information you track about an entity. Attributes, for the
most part, will be entity-specific, because you don’t keep the same information
about every entity. For employees, you might track information such as:
▲ Hire date
▲ Pay code or rate
▲ Social security number (for United States employees)
▲ Employee number
▲ Home address and phone number
This isn’t meant as a complete list. The attributes that you need to track
about an entity vary not only by entity, but also by business. Some businesses
might, for security reasons, want to keep a recent employee picture or bio-
metric data (uniquely identifying information about a person, such as a fin-
gerprint, voice print, or retinal pattern) for employee verification. However, if
you don’t use biometric verification, there’s no reason to track it. Attributes
should include all information that is needed for business purposes, and only
information that is needed for business purposes. Any other information is, at
best, a waste of space and, when dealing with people, a possible invasion of
their privacy.
One attribute will be used as the entity’s identifying value, referred to as the
entity’s identifier or primary key. It must be something that distinguishes each
3.2.1 ENTITY-RELATIONSHIP (E-R) MODELING CONCEPTS 73
instance of that entity from every other instance of that entity. Examples you
might encounter include:
In the United States, a common choice for the employee primary key is the
employee’s social security number (SSN). Every employee is required to have one
by law, so you know that it will be present as an attribute. Also, each employee’s
SSN is unique, so it uniquely identifies each instance of the entity employee.
Why doesn’t everyone use the SSN as the primary key? You might want to
treat that information as sensitive, to help prevent identity theft, for example. In
that case, you need to “make up” a primary key. It’s common to have the data-
base system or application generate these values for you automatically. Many
database systems let you add an attribute that is commonly known as an iden-
tifier, based on integer values and often using a one-up numbering system to
identify each entity instance.
This is another case where it’s important that you understand the context in
which a term is used. The term identifier can refer to any value you choose as
the entity’s primary key. It can also refer, as we have just seen, to an automati-
cally generated value that is used as an identifier or primary key.
Note that in the case of employees, you must uniquely identify each indi-
vidual employee. For inventory items, you would probably just want to identify
each unique type of item. For example, you would want to uniquely identify 12-
foot fiberglass ladders as a unique type of item, but if you have 100 ladders, you
wouldn’t want to keep track of each one of them separately. Instead, you would
manage them as a group, identifying the group and the number of items in that
group that you have on hand.
Be careful to select a value that doesn’t change as the primary key attribute
(identifier). Each item probably has a unique shelf location, for example, but
there’s also a possibility that the location could change in the future. Typically,
once an identifying attribute is assigned to an item, it rarely (if ever) changes.
Relationships
That brings us to relationships. We discuss relationships in detail later in this chap-
ter, but let’s look at a couple of simple examples now to introduce the concept.
Looking at employees and Figure 3-6, each employee must turn in a weekly time
card in order to get paid. This gives us a relationship between employees and time
cards, employees and pay checks, and possibly, depending on how you want to
track things in the database, between time cards and paychecks. You also have
74 DATA MODELING
Figure 3-6
Time Card
Accrued
Vacation
Figure 3-7
Time Card
Referencing
Entity
Employee
Referenced
Entity
also have a second use. Let’s say that you plan to keep all line items for all purchase
orders in a single table. You need a way to uniquely identify each item. An obvious
choice is to use two columns together as the primary key, the purchase order num-
ber and the item’s line item number, with each combination creating a unique value.
As the name implies, relationships are essential to a relational database. Iden-
tifying and defining relationships is a major part of the data modeling process.
You will find, as you go through the process, that more and more relationships
become apparent. A key part of the design is identifying which relationships are
important and including only those relationships in your design. Otherwise, you
risk the design becoming too complicated and confusing to be of any use. Keep
in mind that your database doesn’t sit alone in the universe as an isolated entity.
Figure 3-8
Customer Table
Customer View
Views
Once you deploy your database and as you define user access requirements and
create your application, you’ll usually find that the tables and columns that you
define don’t necessary match up directly with users’ data requirements. What do
you do then? You can use another database object, known as a view, to further
organize the data for viewing and manipulation. You will typically create a view
because you want to limit user access to table data, want to provide custom data,
or want to combine the contents of two or more tables.
When you create a view, you define the base objects, the objects used to
create the view, which will be tables or (in some cases) other views. As part of
the process, you identify the columns displayed through the view. Users who
access the data through the view see only those columns included in the view
definition, thereby limiting the user’s access. You might do this as a security mea-
sure to protect sensitive data or simply to avoid confusing the user with too
much unnecessary data.
A simple example of this is shown in Figure 3-8. The Customer table has
three columns: Customer Number, Customer Name, and Postal Code. If you
want to limit users to just the Customer Number and Customer Name, you could
create the view shown. You would then give users access to the view, but not
the underlying table.
Views can provide a customized look at table information. You can reorder the
columns and have them appear in any order you want. You can also define columns
3.2.2 INTRODUCING BASIC DATABASE OBJECTS 77
that are computed based on the contents of other columns. For example, you could
have a view based on employee records and time cards. The database stores the
employee’s rate of pay and hours worked, but you create a view that multiplies
these values and returns instead the amount the employee should be paid.
During database design and data modeling, you often end up breaking data
up into smaller, more specific entities and creating smaller tables. This is done
for various reasons, but it may be that many of these tables provide little useful
information when standing alone. Instead, to get any really useful information it
may be necessary to combine the contents of two or more tables. You can cre-
ate a view based on these tables, extracting the columns that you need and pre-
senting them in the order that best suites the users’ requirements.
Indexes
Indexes are a key part of database organization and optimization. A database
index has a similar function to an index in a book: it organizes and sorts the
data and provides a pointer to the specific physical location of the data on the
storage media.
An index can be based on one or more table columns. The database engine
organizes the data based on both the columns selected and the order in which
they are selected. You can define multiple indexes on a table, each with a dif-
ferent set of key columns, thereby organizing the data in different ways. The
query processor selects the optimum index or indexes to use for data retrieval.
Why different indexes? Here’s an example. Your database includes a Cus-
tomer table. Most reports run against the table sort data by last name and then
first name. It makes sense to have an index that uses the columns containing
the last name and first name values to make it easier to retrieve the data in that
order. However, you also run reports by geographic region, sorted by postal code
and address. To support this, you might create a separate index based on the
columns containing address information.
Most database management systems automatically create an index for a table’s
primary key, sometimes called the primary index. In that case, additional
indexes are called secondary indexes. Indexes can also be referred to as clus-
tered and nonclustered. The primary index is usually a clustered index. With a
clustered index, the data is physically sorted in the index order. This is used
to organize the data in the order in which it is most often needed. Secondary
indexes are usually all nonclustered indexes. Nonclustered indexes organize the
data for retrieval, but through the index only, and have no effect on the table.
The most common type of index used in modern databases is the balanced
tree or B-Tree (also called a binary tree) index, based on nodes with pointer
values directing you, in the end, to the desired data. An example of a primary
index created as a B-Tree index, along with index node contents, is shown in
Figure 3-9, illustrating how pointers are used to search for a specific record.
78 DATA MODELING
Figure 3-9
left data key right left data key right left data key right
pointer pointer value pointer pointer pointer value pointer pointer pointer value pointer
null 21043 null null 47374 null null 92654 null
Data File
B-tree index.
FOR EXAMPLE
Entities, Attributes, and Objects
Suppose that you are creating a database for a private golf course. It includes
full members who have access to all facilities and limited members who are
allowed on the course and in the Pro Shop only. The Pro Shop keeps a small
inventory of items available for purchase and also keeps track of tee time
reservations and golf carts.
There are several entities that should come to mind immediately, such
as members, employees, inventory, and golf carts. These are all tangible enti-
ties. What about intangible entities? One already mentioned is tee times, and
others will probably come to mind as you think about activities around a
golf course and private clubhouse.
However, some decisions might not be as obvious as they would seem at
first glance. For example, how many tables will you need to track the member
entity? Should full and limited members be tracked in the same table or do you
need separate tables? This points out the need for gathering as much informa-
tion as possible before you make modeling decisions. Right now, you probably
don’t know enough about the data requirements to make that decision, so you
need to go back to your human data sources and ask more questions.
3.3 UNDERSTANDING RELATIONSHIPS 79
Each node contains two pointers to nodes on the left and right sides. Leaf
nodes, which are the nodes located at the bottom of the index, also contain a
pointer to where the data for the key value is physically stored. To see how this
works, trace the search for a data record with the key value 92654. You search
at the root node containing index value 51247. The search value 92654 is greater
than 51247, so follow the right pointer from the root and go to the index record
with key value 82928. From here, you follow the right pointer which takes you
to the index record with key value 92654. This is the index record you are look-
ing for. The data pointer from this index record gives the physical location
needed to retrieve the data record.
SELF-CHECK
• Explain the role of entities, attributes, and relationships in data
modeling.
• Describe how logical entities relate to physical tables.
• Explain the term referential integrity and how it is maintained in a
relational database.
The majority of the relationships in most relational databases are binary rela-
tionships. An entity can have separate binary relationships with any number of
other entities in the database. Each binary and unary relationship can also be
described as one-to-one, one-to-many, or many-to-many. These terms are
described a little later in this chapter.
This section uses examples based on a fictitious company. The company has
a sales staff that sells directly to its customers. Salespersons are assigned specific
80 DATA MODELING
Figure 3-10
SALESPERSON PRODUCT
* Salesperson * Product
Number Number
Salesperson Sells Product
Name Name
Commission Unit Price Many
Many Percentage Products
Salespersons Year of Hire
customer accounts and normally sell to those accounts only. Each customer also
has a backup salesperson who handles the customer’s needs if its primary sales-
person is unavailable.
Figure 3-11
SALESPERSON OFFICE
* Salesperson * Customer
Number Number S a m ' s H a r d wa r e
Commission HQ City
One Percentage Many
Salesperson Year of Hire Customers
(b) One-to-many (1–M) binary relationship.
SALESPERSON PRODUCT
* Salesperson * Product
Number Number
Salesperson Sells Product
Name Name
Commission Unit Price
Many Percentage Many
Salespersons Year of Hire Products
(c) Many-to-many (M–M) binary relationship.
Binary relationships and cardinalities.
near the second entity, and then finally reach the other entity. Thus, Figure 3-11a,
reading from left to right, says, “A salesperson works in one (really at most one,
since it is a maximum) office.” The bar or one symbol involved in this statement
is the one just to the left of the office entity box. Conversely, reading from right
to left, “An office is worked in by (is assigned to) one salesperson.”
The symbols used here represent one standard way of documenting an
E-R model, but are not the only possibility. Your options will depend some-
what on the modeling tool you use. Some modeling tools show all relationships
as simply a line between the entities with no additional information provided.
82 DATA MODELING
Figure 3-12
No Cardinality
SALESPERSON OFFICE
Salespersons
Modality
* Salesperson * Office Number
Number Telephone
Salesperson Works in Size
Name
Commission
Percentage
One Year of Hire One
Salesperson Office
(a) One-to-one (1–1) binary relationship.
SALESPERSON CUSTOMER
No
* Salesperson * Customer Customers
Number Number S a m ' s Ha r dw a re S a m ' s Ha r dw a re
Name Name
Commission HQ City
Many
One Percentage
Customers
Salesperson Year of Hire
SALESPERSON PRODUCT
One
* Salesperson * Product Product
One
Number Number
Salesperson
Salesperson Sells Product
Name Name
Commission Unit Price Many
Percentage Products
Many Year of Hire
Salespersons
(c) Many-to-many (M–M) binary relationship.
Binary relationships and modalities.
Figure 3-13
SALESPERSON
One
* Salesperson Salesperson
Number
Salesperson
Name
Commission
Percentage
Year of Hire
One
Salesperson Backup
SALESPERSON No
Salespersons
* Salesperson
Number Many
Salesperson Salespersons
Name
Commission
Percentage
Year of Hire
One
Salesperson Manages
* Product Many
Number Products
Product
Name
Unit Price
No
Products
and so on until ultimately they form the entire object. Each basic part and each
subassembly can be thought of as a “part” of the object. Then, the parts are in
a many-to-many unary relationship to each other. Any one particular part can
be made up of several other parts, while at the same time itself being a compo-
nent of several other parts.
In Figure 3-13c, think of the products sold in hardware and home improve-
ment stores. Basic items like hammers and wrenches can be combined and sold
as sets. Larger tool sets can be composed of smaller sets plus additional single
tools. All of these single tools and sets of all sizes can be classified as products.
Thus, as shown in Figure 3-13c, a product can be part of no other products or
part of several other products. Going in the reverse direction, a product can be
composed of no other products or can be composed of several other products.
Figure 3-14
CUSTOMER
SALESPERSON PRODUCT
One
*Salesperson * Product Product
One
Number Number
Salesperson
Salesperson Sale Product
Name Name
Commission Unit Price Many
Percentage Products
Many Year of Hire
Salespersons
Date
Quantity
Ternary relationship.
3.3.4 BREAKING DOWN MANY-TO-MANY RELATIONSHIPS 87
simply and directly describe the entity, with no complications involving other
entities. Putting quantity in either the salesperson or the product entity box
just will not work.
Understanding Intersection Data
The quantity attribute doesn’t describe either the salesperson alone or the product
alone. It describes the combination of a particular occurrence of one entity type
and a particular occurrence of the other entity type. The quantity 170 doesn’t
make sense as a description or characteristic of salesperson number 137 alone.
She sold many different kinds of products. To which one does the quantity
170 refer? Similarly, the quantity 170 doesn’t make sense as a description or
characteristic of product number 24013 alone. It was sold by many different
salespersons.
In fact, the quantity 170 falls at the intersection of salesperson number 170
and product number 24013. It describes the combination of or the association
between that particular salesperson and that particular product, and it is known
as intersection data. This is similar to the mapping table described earlier, but
containing data in addition to the associations. Figure 3-15 shows the many-to-
many relationship between salespersons and products with the intersection data,
quantity, as represented in a special five-sided intersection data box. Notice that
the intersection data box is attached to the relationship diamond between the
two entity boxes. That is the natural place for it to be drawn. Pictorially, it looks
like it is at the intersection between the two entities, but there is more to it than
that. The intersection data describes the relationship between the two entities.
We know that an occurrence of the Sells relationship specifies that salesperson
137 has sold some of product 24013. The quantity 170 is an attribute of that
Figure 3-15
SALESPERSON PRODUCT
* Salesperson * Product
Number Number
Salesperson Sells Product
Name Name
Commission Unit Price
Percentage
Year of Hire
Quantity
Intersection data.
3.3.4 BREAKING DOWN MANY-TO-MANY RELATIONSHIPS 89
Figure 3-16
Associative entity.
Defining Uniqueness
Since, as we have just seen, a many-to-many relationship can appear to be a
kind of an entity, complete with attributes, it also follows that it should have a
unique identifier, like other entities. In its most basic form, the unique identi-
fier of the many-to-many relationship or the associative entity is the combina-
tion of the unique identifiers of the two entities in the many-to-many relation-
ship. So, the unique identifier of the many-to-many relationship of Figure 3-15
or the associative entity of Figure 3-16 is the combination of the Salesperson
Number and Product Number attributes.
Sometimes, an additional attribute or attributes must be added to this com-
bination to produce uniqueness. This often involves a time element. For exam-
ple, if we wanted to keep track of the sales on a weekly basis, we would have to
have a date attribute or a week number attribute as intersection data and the
unique identifier would be Salesperson Number, Product Number, and Date. If
we want to know how many units of each product were sold by each salesper-
son each week, the combination of Salesperson Number and Product Number
would not be unique because for a particular salesperson and a particular prod-
uct, the combination of those two values would be the same each week! Date
must be added to produce uniqueness, not to mention to make it clear which
week a particular value of the Quantity attribute applies to a particular sales-
person product combination.
The third and last possibility occurs when the nature of the associative entity
is such that it has its own unique identifier. For example, a company might
FOR EXAMPLE
Describing Relationships
It is common to have multiple relationships between one entity and other
entities. Consider this situation. Your company does telephone sales to estab-
lished customers. Each customer has a customer ID. Each employee is iden-
tified by an employee ID. Whenever an order is placed, it must include both
the employee ID of the employee creating the order and the customer ID of
the customer placing the order.
Here, we have two separate one-to-many relationships. One is
between customers and orders—each customer can have multiple orders.
The other is between employees and orders—each employee can create
multiple orders. Even though they are related, there is no need to explic-
itly identify a relationship between employees and customers. This is
taken care of for you, because the two are associated through the order
entity.
3.4 COMPARING DATA MODELS 91
specify a unique serial number for each sales record, in which case the combi-
nation of Salesperson Number, Product Number, and Date isn’t needed. Another
example would be the many-to-many relationship between motorists and police
officers who give traffic tickets for moving violations. The unique identifier could
be the combination of police officer number and motorist driver’s license num-
ber plus perhaps date and time. But, typically, each traffic ticket has a unique
serial number, and this would serve as the unique identifier.
SELF-CHECK
• Compare binary and unary relationships.
• Give one example each of a one-to-one, one-to-many, and many-
to-many relationship.
• Discuss the role of using a mapping table when implementing a
many-to-many relationship.
• Explain the role of intersection data.
▲ Manual
▲ Generic draw program
▲ Custom modeling program
It could be that the only modeling tools you need are a pencil and a piece
of paper. When designing a relatively simple database with only a few rela-
tionships, this is often the fastest method. You could possibly have the model
finished before you could even get a modeling application up and running.
And as a “just in case note”, be sure and use a pencil so you can erase your
mistakes.
A common option is to use a generic drawing or modeling program.
Some, like Microsoft Visio, include templates and symbols specific to data-
base modeling so that you can produce models similar to the ones you’ve
seen here. It also includes an object modeling template you can use to cre-
ate object-relation models (ORM) for object-relational databases. ORMs,
because of their flexibility, also are sometimes used when modeling relational
databases.
There are also applications available that are specifically designed for data-
base modeling. Some are model-specific while others support a variety of
standard modeling languages and templates. If you only plan to design and
implement one, or very few, databases in your career, there probably isn’t
much justification for buying an application specific to that purpose. How-
ever, if you plan to do this on a regular basis, you may find it worth your
while.
One advantage of both Visio and some drawing applications is that you have
some flexibility in what is shown in your model. For example, Figure 3-17
shows two different options for how Visio could document the same relation-
ship. The one on the left is the default—a simple line with the arrow pointing
to the referenced table. The one on the right looks more like the drawings
you’ve seen in this chapter, indicating that this is a one-to-many relationship.
You have the option of showing as many or as few details of the relationship
as you want.
Another advantage of drawing programs that include modeling templates
and modeling applications is that they do some of the checking for you to
help you avoid errors. For example, depending on the application, you might
3.4.2 THE GENERAL HARDWARE COMPANY 93
Figure 3-17
SalesPerson SalesPerson
PK ID PK ID
FirstName FirstName
LastName LastName
Department Department
Supervisor Supervisor
Creates/Is Created By
OrderHead OrderHead
PK OrderNumber PK OrderNumber
Customer Customer
FK1 SalesPerson FK1 SalesPerson
Date Date
ID ID
be forced to identify the key columns in the referenced and referencing enti-
ties when defining a relationship. You might also be forced to provide cardi-
nality and modality information for each relationship. This added functionality
can help you avoid mistakes when designing a very large or very complicated
database.
Figure 3-18
OFFICE
* Office Number
Telephone
Size
Works In
SALESPERSON CUSTOMER
* Salesperson * Customer
Number Number
Salesperson Sells to Customer
Name Name
Commission HQ City
Percentage
Year of Hire
PRODUCT CUSTOMER
EMPLOYEE
* Product
Number * Employee
Product Number
Name Employee
Unit Price Name
Title
General Hardware.
modality symbol) or one salesperson (the one or bar cardinality symbol). Start-
ing again at the SALESPERSON entity box and moving to the right, a salesper-
son has no customers or many customers. Remember that the customers are
hardware or home improvement stores. The CUSTOMER entity has three attrib-
utes and Customer Number is the unique identifier. In the reverse direction, a
customer must have exactly one salesperson.
From the CUSTOMER entity downward is the CUSTOMER EMPLOYEE
entity. According to the figure, a customer must have at least one but can have
many employees. An employee works for exactly one customer. This is actually
a special situation. General Hardware has an interest only in maintaining data
about the people who are its customers’ employees as long as their employer
remains a customer of General Hardware. If a particular hardware store or home
improvement chain stops buying goods from General Hardware, then General
Hardware no longer cares about that store’s or chain’s employees. Furthermore,
while General Hardware assumes that each of its customers assigns their employ-
ees unique employee numbers, those numbers can only be assumed to be unique
within that customer store or chain. Thus the unique identifier for a customer
employee must be the combination of the Customer Number and the Employee
Number attributes. In this situation, CUSTOMER EMPLOYEE is called a depen-
dent entity. As shown in the CUSTOMER EMPLOYEE entity box in Figure 3-18,
a dependent entity is distinguished by a diagonal hash mark in each corner of
its attribute area.
Returning to the SALESPERSON entity box and looking downward, there is
a many-to-many relationship between salespersons and products. A salesperson
is authorized to sell at least one and possibly (probably, in this case) many prod-
ucts. A product is sold by at least one and possibly many salespersons. The
PRODUCT entity has three attributes, with Product Number being the unique
identifier. The attribute Quantity is intersection data in the many-to-many rela-
tionship, meaning that the company is interested in keeping track of how many
units of each product each salesperson has sold.
Figure 3-19
Date
Bought Price
Quantity
CUSTOMER
* Customer
Number
Customer
Name
Street
City
State
Country
SELF-CHECK
• Describe the relationship in Figure 3-17, including all of the informa-
tion available in the diagram.
• Explain why it is helpful to study other data diagrams before creat-
ing your own data model.
SUMMARY
This chapter introduced data modeling and the use of data diagrams. We started
with a general look at the design process and some of the decisions you must
make about the type of database you need to design and your design goals. We
looked at relational database modeling and the E-R model in some detail, includ-
ing an introduction to some fundamental database objects. You were introduced
to the various types of relationships that might be included in your database
model. Finally, we talked about modeling tools and compared two example data
diagrams.
KEY TERMS
Associative entity Data model
Balanced tree index Data warehouse
Base object Decision support system (DSS)
Binary relationship Entity-Relationship diagram (ERD)
Binary tree index Entity-Relationship (E-R) modeling
B-tree index Field
Bulk loading Hybrid database
Business rules Identifier
Cardinality Index
Clustered index Intersection data
Conceptual design Large object (LOB) data
Concurrency Leaf node
Database object Logical design
Data diagram Many-to-many (M-M) binary relation-
ship
Data mart
98 DATA MODELING
Summary Questions
1. Entities and attributes are identified during logical design. True or False?
2. A transactional database must support which of the following activities?
(a) adding new data
(b) changing existing data
(c) deleting data
(d) all of the above
3. Your design must ensure that any customer purchases at least $20.00 in
merchandise. This is an example of which of the following?
(a) entity definition
(b) business rule
(c) attribute
(d) relational integrity
4. Each table in your database typically represents which of the following?
(a) an entity
(b) an attribute
(c) a relationship
(d) a data diagram
5. You are defining an employee entity. Which of the following would you
likely use as the identifier?
(a) first name
(b) last name
(c) employee ID
(d) hire date
(e) none of the above
6. A foreign key is based on a unique value in the referenced entity. True or
False?
7. Which statement best describes attributes?
(a) You should have a table for each attribute defined in your model.
(b) Any attribute can be used as an entity’s identifier.
(c) Attributes describe entities.
(d) Attributes are used in object modeling only.
100 DATA MODELING
17. Which term refers to the maximum number of entities that can be
involved in a relationship?
(a) cardinality
(b) modality
(c) intersection
(d) association
18. What kind of relationship can be used when relating an entity to
instances within itself?
(a) one-to-one
(b) one-to-many
(c) many-to-many
(d) all of the above
19. Which of the following can you use to create E-R diagrams?
(a) pencil and paper
(b) a generic drawing program
(c) a custom modeling program
(d) all of the above
20. When reading an E-R diagram, how is a “one” relationship identified?
(a) by an arrow
(b) by a bar
(c) by a crow’s foot
(d) no special symbol is used to identify a “one” relationship
Figure 3-20
CUSTOMER
PK ID
Name
City
PostalCode
Phone
AccountBalance
PK AccountNum
CustomerID
Current
Due
PastDue
home address, home telephone number, current grade, and age. With
regard to a student’s school assignment, the school system is only
interested in keeping track of which school a student currently attends.
Each school has several administrators, such as the principal and
assistant principals. Administrators are identified by an employee number
and also have a name, telephone number, and office number. What
attributes are described for the entity STUDENT? Which attribute should
you use as the entity’s identifier? What is the relationship between
student and school? What is the relationship between schools and
administrators?
YOU TRY IT
Figure 3-21
* Orchestra *Musician
Name Number
Degree
City Works for Musician Name Earned University
Country Instrument
Year
Music Annual Salary
Director
Recorded Year
Price
COMPOSITION COMPOSER
* Composition * Composer
Name Wrote Name
Year Country
Date of Birth
Starting Point
Go to www.wiley.com/college/gillenson to assess your knowledge of database
design.
Determine where you need to concentrate your effort.
INTRODUCTION
As you’ve learned, several steps are involved in designing a database to support
a business and business applications. You must collect business data in the form
of business documents, employee interviews, and other data sources. You use
this data to generate an entity-relationship (E-R) diagram that describes the busi-
ness and its requirements. From this diagram you will create the logical and then
physical design of your database.
In this chapter, we’ll focus on the next step in the process: identifying the
relational tables you will create based on the entities in the E-R diagram and
then going through a process known as normalization, which will help you iden-
tify duplicate data and optimize data storage. Your final result is a finished rela-
tional design from which you can identify the database objects you need to
implement your physical database.
Figure 4-1
SALESPERSON
* Salesperson
Number
Salesperson
Name
Commission
Percentage
Year of Hire
SALESPERSON entity.
106 DESIGNING A DATABASE
Figure 4-2
SALESPERSON
SALESPERSON table.
the same name, with a column for each of its attributes. This is shown in Figure
4-2. Notice that Salesperson Number is underlined to indicate that it is the entity’s
unique identifier and will be used as the table’s primary key.
Figure 4-3
SALESPERSON OFFICE
Figure 4-4
SALESPERSON/OFFICE
SALESPERSON/OFFICE table.
OFFICE entities. Each Salesperson works in one Office. Office, in this case,
identifies a single office cubicle.
There are at least three options for designing tables to represent this data.
The first option is shown in Figure 4-4, with the two entities combined into one
relational table. This design is possible because the one-to-one relationship
means that for one salesperson, there can only be one office associated with the
salesperson and conversely, for one office there can be only one salesperson.
Because of this relationship, a particular salesperson and office combination
could be stored as one record, as shown.
There are three reasons why Figure 4-4 is not a good data design. The first
two can be determined from the diagram in Figure 4-3. First, the very fact that
salesperson and office were drawn in two different entity boxes in the E-R dia-
gram indicates that they are thought of separately in this business environ-
ment. This means that they should be kept separate in the database. Second
is the modality of zero at the SALESPERSON entity. Reading Figure 4-3 from
right to left, it says that an office might have no one assigned to it. In the table
shown in Figure 4-4, a few or possibly many record occurrences could have
values for the office number, telephone, and size attributes but have the four
attributes pertaining to salespersons empty or null. A null value is an unde-
fined value, usually used to identify that no value is provided for that attribute.
Even though considered as an undefined value, it is still considered a valid
value and provides useful information by the fact that the attribute is not
defined. This would result in wasted storage space. It also means that the Sales-
person Number cannot be declared to be the primary key of the table, because
there would be records with no primary key values, which is not allowed.
Before going on, there are a couple of points about storage costs and the rela-
tionship between database design and data storage that you should be aware
of. Even though the cost per byte has dropped significantly over the years,
wasted space continues to be an issue deserving consideration in your data-
base design and implementation. Inefficient design and space use can lead to
inefficient indexes and could mean less than optimal performance. That’s not
the only performance issue. As database tables (and the database as a whole)
grow, access performance tends to suffer because of the increased volume of
108 DESIGNING A DATABASE
data involved. Also, just because the price of storage has dropped, installing
additional storage when you run out of space can be time consuming and usu-
ally means down time for the database, which typically is very expensive.
The third reason is not visible from the partial E-R diagram in Figure 4-3.
However, when you look at the full E-R diagram in Figure 4-5, the reason
Figure 4-5
OFFICE
* Office Number
Telephone
Size
Works in
SALESPERSON CUSTOMER
* Salesperson * Customer
Number Number
Salesperson Sells to Customer
Name Name
Commission HQ City
Percentage
Year of Hire
PRODUCT CUSTOMER
EMPLOYEE
* Product
Number * Employee
Product Number
Name Employee
Unit Price Name
Title
Figure 4-6
SALESPERSON
OFFICE
Office
Number Telephone Size
Figure 4-7
SALESPERSON
OFFICE
Office Salesperson
Number Telephone Number Size
example, perhaps lots of the salespersons travel most of the time and don’t
need offices.
While we’re in a “what if” mode, what if the modality was zero on both sides
and we didn’t have the other relationships to consider? Then we would have to
make a judgment call between the designs of Figure 4-6 and Figure 4-7. If the
goal is to minimize the number of null values in the foreign key, then we have
to decide whether it is more likely that a salesperson is not assigned to an office
or that an office is empty.
Figure 4-8
SALESPERSON CUSTOMER
* Salesperson * Customer
Number Number
Salesperson Sells to Customer
Name Name
Commission HQ City
Percentage
Year of Hire
Figure 4-9
SALESPERSON
CUSTOMER
of the entity on the “one side” of the one-to-many relationship is placed as a for-
eign key in the table representing the entity on the “many side.” In this case, the
Salesperson Number attribute is placed in the CUSTOMER table as a foreign key.
Each salesperson has one record in the SALESPERSON table, as does each customer
in the CUSTOMER table. The Salesperson Number attribute in the CUSTOMER
table links the two, and since the E-R diagram tells us that every customer must
have a salesperson, there are no empty attributes in the CUSTOMER table records.
This solution also fits the full E-R diagram in Figure 4-5. The same Sales-
person Number attribute used to establish a relationship with the CUSTOMER
table can also be used in the relationship between SALESPERSON and OFFICE.
SALESPERSON PRODUCT
* Salesperson * Product
Number Number
Salesperson Sells Product
Name Name
Commission Unit Price
Percentage
Year of Hire
Quantity
Figure 4-11
Figure 4-12
SALESPERSON
PRODUCT
Product Product
Number Name Unit Price
SALE
Salesperson Product
Number Number Quantity
key might also include another attribute, such as a date and time stamp, to
ensure unique values. A key that is defined by multiple columns is called a
composite key.
The table based on the associative entity can include additional attributes
representing intersection data. This is shown as the Quantity attribute in this
example. However, this is not a requirement. The associative table often
includes only the data values needed to establish and maintain the many-to-
many relationship.
Figure 4-13
SALESPERSON
* Salesperson
Number
Salesperson
Name
Commission
Percentage
Year of Hire
Backup
Figure 4-14
SALESPERSON
Figure 4-15
SALESPERSON
* Salesperson
Number
Salesperson
Name
Commission
Percentage
Year of Hire
Manages
Figure 4-16
SALESPERSON
Figure 4-17
PRODUCT
* Product
Number
Product
Name
Unit Price
Part of Quantity
Figure 4-18
PRODUCT
Product Product
Number Name Unit Price
COMPONENT
Product Subassembly
Number Number Quantity
FOR EXAMPLE
Requirements for Converting Diagram Entities
The basic requirements for creating an E-R diagram and relational tables are
essentially the same. However, the reason behind these requirements
becomes more evident as you get closer to your final physical model.
Consider this situation. You are responsible for the initial interviews and
data gathering. You are expected to sketch out a simple E-R diagram of the
company, which is then handed off to an analyst who will create the rela-
tional table design. You sketch the model and hand it off, but you haven’t
determined any identifiers for the entities. Can you see the problem this is
going to cause?
The analyst can describe the tables, but not the relationships between
the tables. In the relational model, you need two things to establish the rela-
tionship. You need a primary key in the referenced table that is used as the
foreign key in the referencing table. Because you did the initial research, you
are the best person to recommend identifiers for the initial entities, which
are then converted to primary keys in the relational table design.
4.2.1 DESIGNING GENERAL HARDWARE 117
SELF-CHECK
• Describe the information you need to convert related entities into
related tables and how you determine primary and foreign key
columns.
• Explain why an additional table is needed when converting a many-
to-many entity relationship into related tables.
• In a binary one-to-zero-or-one relationship with no other related
entities, explain how can you determine which should be the refer-
enced and which should be the referencing table.
Figure 4-19
OFFICE
* Office Number
Telephone
Size
Works in
SALESPERSON CUSTOMER
* Salesperson * Customer
Number Number
Salesperson Sells to Customer
Name Name
Commission HQ City
Percentage
Year of Hire
PRODUCT CUSTOMER
EMPLOYEE
* Product
Number * Employee
Product Number
Name Employee
Unit Price Name
Title
right of the SALESPERSON entity box in the E-R diagram, we see a one-to-many
relationship (“sells to”) between salespersons and customers. The database then
includes a CUSTOMER table with the Salesperson Number attribute as a foreign
key because Salesperson is on the “one side” of the one-to-many relationship and
Customer is on the “many side” of the one-to-many relationship.
4.2.1 DESIGNING GENERAL HARDWARE 119
Figure 4-20
SALESPERSON
CUSTOMER
CUSTOMER EMPLOYEE
PRODUCT
Product Product
Number Name Unit Price
SALES
Salesperson Product
Number Number Quantity
OFFICE
Office
Number Telephone Size
The PRODUCT table contains the three Product entity attributes. The
many-to-many relationship between the SALESPERSON and PRODUCT
entities is represented by the SALES table in the database. Notice that the
combination of the unique identifiers (Salesperson Number and Product Num-
ber) of the two entities in the many-to-many relationship defines the primary
key of the SALES table. Finally, the OFFICE entity has its table in the data-
base with its three attributes, which brings us to the presence of the Office
Number attribute as a foreign key in the SALESPERSON table. This is needed
to maintain the one-to-one binary relationship between Salesperson and
Office. We put the foreign key in the SALESPERSON table rather than in the
OFFICE table because the modality adjacent to SALESPERSON is zero while
the modality adjacent to OFFICE is one. An office may or may not have a
salesperson assigned to it, but a salesperson must be assigned to an office.
The result is that every salesperson must have an associated office number;
the Office Number attribute in the SALESPERSON table can’t be null. This
requirement can be easily enforced, automatically helping prevent data errors
(missing Office Number). If we reversed it and put the Salesperson Number
attribute in the OFFICE table, many of the Salesperson Number attribute val-
ues could be null since the zero modality going from office to salesperson tells
us that an office can be empty. Because the value might or might not be null,
data integrity, ensuring that the data is entered and stored correctly, is harder
to enforce.
One last thought: Why did the PRODUCT table end up without having any
foreign keys? Simply, it is because there is no situation that requires insertion of
a foreign key. It is not the “target” (it is not on the “many side”) of any one-to-
many binary relationship. It is not involved in a one-to-one binary relationship
that requires a foreign key to ensure relational integrity, that is, to maintain the
relationship. (Maintaining the relationship between referencing and referenced
tables is also called referential integrity.) Finally, the PRODUCT table is not
involved in a unary relationship that requires that the primary key to be repeated
in the table.
Date
Bought Price
Quantity
CUSTOMER
* Customer
Number
Customer
Name
Street
City
State
Country
FOR EXAMPLE
Following Up on Your Design
You might discover during the conversion process that your E-R design isn’t
as complete as it should be. When this happens, you may find it necessary
to go back to your original data to look for missing attributes.
Consider the following situation. You are designing a database for a cus-
tomer support call center. When customers call they are automatically
routed to the next available customer support agent. The agent with whom
the customer speaks is, in this case, completely random. Your first choice
as an associative entity might be the combination of the customer and
employee identities. You would use this as the primary key.
When a customer calls back, he or she might get a different customer
service agent or, by luck of the draw, the same customer service agent. This
would mean the same combination of customer and employee identities,
and would mean a duplicate value in the primary key. This is an illegal con-
dition. You need an additional value, such as the date and time of the call,
to produce a unique primary key value.
122 DESIGNING A DATABASE
Figure 4-22
PUBLISHER
Publisher Year
Name City Country Telephone Founded
AUTHOR
BOOK
CUSTOMER
Customer Customer
Number Name Street City State Country
WRITING
Book Author
Number Number
SALE
Book Customer
Number Number Date Price Quantity
SELF-CHECK
• Explain the role of associative tables when defining many-to-many
relational tables.
• Compare and contrast associative data and intersection data and
the role of each.
• Explain how relational integrity is assured in a one-to-one or one-
to-many binary table.
and, after all, a person can have only one last name. Informally, we might say
that Salesperson Number defines Salesperson Name. If I give you a Salesper-
son Number, you can give me back the one and only name that goes with it.
These defining associations are commonly written with a right-pointing arrow
like this:
Salesperson Number → Salesperson Name
In the more formal terms of functional dependencies, the attribute on the
left side is referred to as the determinant attribute. This is because its value
determines the value of the attribute on the right side. Conversely, we also
say that the attribute on the right is functionally dependent on the attribute
on the left.
Data normalization is best explained with an example. In order to demon-
strate the main points of the data normalization process, we will modify part
of the General Hardware Company business environment and focus on the
SALESPERSON and PRODUCT entities. Let’s assume that salespersons are
organized into departments and that each department has a manager who is
not herself a salesperson. Then the list of attributes that we will consider is
shown in Figure 4-23.
The list of defining associations or functional dependencies is shown in
Figure 4-24. Notice a couple of fine points about the list of defining associations
in Figure 4-24. The last association:
Salesperson Number, Product Number → Quantity
shows that the combination of two or more attributes may define another
attribute. That is, the combination of a particular Salesperson Number and a par-
Figure 4-23
Salesperson Number
Salesperson Name
Commission
Percentage
Year of Hire
Department
Number
Manager Name
Product Number
Product Name
Unit Price
Quantity
Figure 4-24
through a number of normal forms, which are rules for data normalization. First,
we will examine what unnormalized data looks like. Then, we will work through
the three main normal forms in order:
There are certain exception conditions that have also been described as nor-
mal forms. Exception conditions are nonstandard normal forms in addition to
the three accepted standard normal forms. These include Boyce-Codd Normal
Form, Fourth Normal Form, and Fifth Normal Form. They are relatively less
common in practice and will not be covered here. Here are three additional
points to remember:
1. Once the attributes are arranged in third normal form (and if none of the
exception conditions is present), the group of tables that they comprise
is, in fact, a well-structured relational database with no data redundancy.
2. A group of tables is said to be in a particular normal form if every table
in the group is in that normal form.
3. The data normalization process is progressive. If a group of tables is in
second normal form, it is also in first normal form. If the tables are in
third normal form, they are also in second normal form.
Figure 4-25
SALESPERSON/PRODUCT table
the data. The attributes under consideration have been listed in one table, and
a primary key has been established. In this definition of normal forms, the
requirement for a primary key is not listed as part of any normal form, but is
considered an assumed requirement of the initial E-R diagramming process.
As the sample data in Figure 4-27 shows, the number of records has increased
compared to the unnormalized representation. Every attribute of every record has
just one value. The multivalued attributes from Figure 4-25 are eliminated.
The combination of the Salesperson Number and Product Number attributes
constitutes the table’s primary key. The business context tells us that the combi-
nation of the two provides unique identifiers for the records of the table and that
there is no single attribute that will do the job. In terms of data normalization,
Figure 4-26
SALESPERSON/PRODUCT table
Figure 4-27
SALESPERSON/PRODUCT table
Figure 4-28
SALESPERSON table
PRODUCT table
QUANTITY table
Salesperson Product
Number Number Quantity
Figure 4-29
SALESPERSON table
PRODUCT table
Product Product Unit
Number Name Price
QUANTITY table
Salesperson Product
Number Number Quantity
Figure 4-30
SALESPERSON table
DEPARTMENT table
Department Manager
Number Name
PRODUCT table
QUANTITY table
Salesperson Product
Number Number Quantity
Figure 4-31
SALESPERSON table
DEPARTMENT table
Department Manager
Number Name
59 Lopez
73 Scott
PRODUCT table
Product Product Unit
Number Name Price
QUANTITY table
Salesperson Product
Number Number Quantity
only once, in the second record of the DEPARTMENT table. Notice that the
Department Number attribute in the SALESPERSON table continues to indicate
the salesperson’s department.
One problem is that this process can be taken to the extreme. Consider cus-
tomer addresses for example. It’s likely that you will have several customers in
the same state. You could create a separate STATE table to contain this informa-
tion and create a relationship between the CUSTOMER and STATE tables, but it’s
probably less work and requires less storage space to just store the 2-character
state abbreviation with each customer.
The other problem is that a design that makes for efficient data entry does
not always make for efficient data retrieval. During normalization, you tend to
break a table down into smaller, related tables. There is a good possibility that
at least some of your queries will require you to recombine that data into a sin-
gle result. This is done through a query process known as joining, where you
combine the data from two tables based on a linking column or columns. Typ-
ically, you will combine two related tables based on the foreign key, but that’s
not the only possibility. This can be a resource-intensive process, becoming more
intensive as more tables are added.
This can be an issue in any database, but tends especially to be a problem
in decision support databases where the majority of the database activity relates
FOR EXAMPLE
Finding New Tables
It’s fairly common to “discover” new tables during the normalization process.
Your E-R diagram includes an ORDER entity. For each order, you have the
customer placing the order, the employee writing up the order, the order
date, order number, and other information that applies to the order as a
whole. You also have information about individual line items, such as the
item ID, quantity, and selling price. There’s no reason, usually, to store the
extended total (quantity times selling price) because that can be calculated
whenever you need it.
If you create an ORDER table, for it to be properly normalized, you will
need a row for each line item in the order. That means that the customer,
employee, order date, and any other general information about the order are
also repeated for each line item. This could result in a significant amount
of data and wasted space. A better solution is to have two tables. One, call
it ORDERHEAD, contains the information that applies to the order as a
whole. The other, call it ORDERITEM, contains the information for each line
item. You would use the Order Number as the identifier in ORDERHEAD,
and also use it as the foreign key in ORDERITEM to maintain the relation-
ship between the two tables.
136 DESIGNING A DATABASE
to data retrieval. If you regularly need the joined data, you could find it more
efficient in the long run to denormalize the data, combine two or more nor-
malized data tables into one less normalized table. For example, you might need
to draw on data from three or four different tables to generate employee pay-
checks, including columns from an EMPLOYEE table, a TIMESHEET table, a
PAYRATE table, and other tables in a single report. You might find it better to
create a separate table named EMPLOYEEPAY that contains all of this informa-
tion. Keep in mind, however, that if you also keep all of the other tables, you
are introducing duplicate data into the database. Whether or not the perfor-
mance increase is worth the additional overhead will have to be evaluated on
a case-by-case basis.
Why the concern about performance? The more operations the database has
to perform, the greater the load on resources, which can result in performance
loss. If you create a new EMPLOYEEPAY table while also keeping the same data
in the EMPLOYEE, TIMESHEET, and PAYRATE tables, you are forcing the data-
base server to make additional updates anytime you add or modify data. If you
change a rate in the PAYRATE table or hours in the TIMESHEET table, for exam-
ple, you will also have to update records in the EMPLOYEEPAY table to reflect
these changes. Introduce too many situations where duplicate updates are needed
and performance eventually suffers.
SELF-CHECK
• List and describe the three normal forms.
• Explain how normalizing to the third normal form can result in addi-
tional relational tables.
• Explain the meaning of the term non-loss decomposition.
SUMMARY
This chapter discussed the process of creating a relational database design. You
saw how to convert simple entities, unary relationships, and binary relationships
to relational tables. This included choosing the foreign keys needed to establish
and maintain the relationships. You compared examples in which E-R diagrams
were converted to relational tables. You also learned about the normalization
process and how apply the three normal forms.
KEY TERMS 137
KEY TERMS
Composite key Non-loss decomposition
Data integrity Normal forms
Data normalization Null value
Decomposition process Partial functional dependency
Defining association Referential integrity
Determinant attribute Relational integrity
Exception conditions Second normal form
First normal form Third normal form (3NF)
Functional dependency Transitive dependency
Joining
138 DESIGNING A DATABASE
Summary Questions
1. There is always a one-to-one relationship between entities in an E-R dia-
gram and the tables in a database design. True or False?
2. Which statement best describes the process of converting a one-to-one
binary relationship to related tables?
(a) You should convert the entities into a single related table.
(b) There will typically be only one way to define the tables.
(c) There will typically be more than one way to define the tables.
3. You are converting a many-to-many unary relationship. How many rela-
tional tables will result from this conversion?
(a) one
(b) two
(c) three
(d) four
4. There is a many-to-many relationship between the Customer and Issue
entities. Each combination of Customer and Issue occurrences is unique.
Which should you use as the associative table primary key?
(a) the Customer identifier
(b) the Issue identifier
(c) the combination of the Customer and Issue identifiers
(d) you don’t have enough information to determine the best primary key
5. Customer relates to Order as a one-to-many relationship. How should
you define the relationship between the CUSTOMER and ORDER tables?
(a) Add the CUSTOMER primary key to the ORDER table as a foreign key.
(b) Add the ORDER primary key to the CUSTOMER table as a foreign key.
(c) Create a new value to use as a foreign key in both the CUSTOMER
and ORDER tables.
(d) Create an associative table.
6. An associative table is needed when converting a one-to-many or many-
to-many relationship. True or False?
7. When converting related entities, it is necessary to consider all
relationships in which an entity is included. True or False?
SUMMARY QUESTIONS 139
(a) vendor name (primary key), alternate vendor, PO Box, city, state,
postal code
(b) employee number (primary key), last name, first name, department
number, department name, e-mail address, cubicle location, manager ID
(c) product SKU (primary key), description, warehouse location, quantity
on hand, quantity on order, vendor
(d) order number (primary key), line item number (primary key), prod-
uct SKU, quantity, selling price
Figure 4-32
Year
Recorded
Price
COMPOSITION COMPOSER
*Composition *Composer
Name Wrote Name
Year Country
Date of Birth
Creating and Normalizing a Relational Table 2. For each of the relational tables in Table 4-1, list
Design the attributes based on the E-R diagram. Indi-
Figure 4-34 is the E-R diagram for Lucky Rent-A Car. cate any attributes that violate the first normal
1. Complete Table 4-1 with the information re- form.
quested.
142
Figure 4-34
MANUFACTURER
*Manufacturer
Name
Manufacturer
Country
Sales Rep Name
Sales Rep Telephone
Manufactured
CAR CUSTOMER
Rental Date
Repaired Return Date
Total Cost
MAINTENANCE
EVENT
*Repair Number
Date
Procedure
Mileage
Repair Time
143
Non-loss Decomposition
Figure 4-35 is the E-R diagram for General Hardware
Company.
Figure 4-35
OFFICE
* Office Number
Telephone
Size
Works in
SALESPERSON CUSTOMER
* Salesperson * Customer
Number Number
Salesperson Sells to Customer
Name Name
Commission HQ City
Percentage
Year of Hire
PRODUCT CUSTOMER
EMPLOYEE
* Product
Number * Employee
Product Number
Name Employee
Unit Price Name
Title
144
Sa
le
s
Cu pers
sto on
N
Em mer um
pl N be
o u
Pr ye mb r
od e e
u N r
O ct N umb
ffi
ce um er
Sa Nu ber
le m
s b
Co pers er
m o
m n
Ye iss Nam
ar ion e
o
Hardware Company taken to the first normal form.
D f H Perc
ep ir en
Figure 4-36 represents the attributes of the General
ar e ta
M tme ge
an nt
ag N
Cu er u
Figure 4-36
sto N mb
m a m er
H er e
Q N
Ci am
Em ty e
pl
Ti ye o
General Hardware Company attributes.
tle e N
am
Pr
od e
u
U ct N
ni
t P am
Q rice e
ua
nt
normal forms. Show the result below.
Te ity
le
p
Si hon
ze e
145
Continue the process and apply the second and third
5
IMPLEMENTING
A DATABASE
Starting Point
Go to www.wiley.com/college/gillenson to assess your knowledge of database
implementation.
Determine where you need to concentrate your effort.
INTRODUCTION
The better your database design, the better it will meet your business and appli-
cation requirements. This means collecting business data, identifying entities and
relationships, and then converting them to database tables. All this is done with
your final goal in mind, implementing your physical database.
Before you can begin creating database tables you need to finish the remaining
details of your database design. Review the physical design requirements and
how they impact your final design. These include not only data requirements,
but application requirements and requirements relating to the environment in
which the database will be deployed. It may be necessary to make additional
adjustments to your design to more accurately reflect the real world. Now you
must not only consider entities and attributes, but also the volume of data that
you must support and how it impacts your physical design requirements.
With the design process out of the way, you’re ready to implement the design
and deploy your database. Even here, there are additional decisions to be made.
These relate to how to create database objects that match your design. The
process starts with database tables and continues on through other database
objects as needed, such as indexes and views.
Figure 5-1
Throughput Requirements
• Data Characteristics
Data Volume Assessment
Data Volatility
• Application Characteristics
Application Data Requirements
Application Priorities
• Operational Requirements
Data Security Concerns
Hardware Characteristics
Figure 5-1 lists the inputs to physical database design, which are also fac-
tors important to the design process. These inputs fall into several subgroups.
We’ll take a look at each of these physical design inputs and factors, one-by-one.
The tables produced by the logical database design process (which for sim-
plicity we will now refer to as the logical design) form the starting point of the
physical database design process. These tables reflect all of the data in the busi-
ness environment. In theory, they have no data redundancy (or at least minimal
redundancy), and they have all of the foreign keys in place that are needed to
establish the relationships in the business environment.
This does not necessarily reflect an optimum performance environment. Keep
in mind that at this point, we have limited our design to database tables only. We
still need to design the objects, like indexes, that will help to optimize data access.
Even then, there may be additional performance issues. For example, it is possi-
ble that a particular query may require the joining of several tables, which may
cause an unacceptably slow response from the database. Clearly then, these tables,
in their current form, might be unacceptable from a performance point of view.
This is why we need to make additional modifications and physical design changes.
One important point is that none of these factors operates in a vacuum. Each
influences the others. Actions taken to help you meet the needs of one require-
ment better could actually make it harder to meet another set of requirements.
5.1.3 DATA CHARACTERISTICS 149
Figure 5-2
Row 0
(2 KB)
Row 0
Row 1 (5 KB)
(2 KB)
Row 2
(2 KB)
Row 3
(2 KB)
Page Page
Data storage.
Take Microsoft SQL Server for example. The fundamental storage unit is a
page, which is 8 KB of contiguous space. A table row cannot span pages, but a
page can hold multiple rows, assuming that they are small enough.
Now, look in Figure 5-2. The maximum row size for the Employee table in
this example is 2 KB. That means that four rows fit on each page with no wasted
space. If you have 1000 rows in the table, you need 1000 times 2 KB, which is
2000 KB, or 2 MB to store the data.
You have a very different situation in the Customer table. The maximum row
size is 5 KB. That means you can fit only one row per page. What happens to
the remaining 3 KB? It’s lost, wasted, and can’t be used for any other purpose.
That means that each row effectively requires 8 KB. If you have 1000 rows, you
need 8 MB of storage.
When processing data in the Employee table, you need to work with 2 MB of
data (in this example) for each 1000 rows. However, performing an operation on
the Customer table means having to move around 8 MB of data. As you add rows
to the Customer table, the space needed increases, as does the wasted space. If you
just based your calculations on row size and didn’t factor in page use, your space
requirement calculations would be off by nearly 40 percent for the Customer table.
Data volatility can directly impact response time and throughput. Some data,
such as active inventory records that reflect the changes in goods constantly being
put into and taken out of inventory, is updated frequently. Other data, such as his-
toric sales records, is seldom or never updated, except possibly for data from the
5.1.4 APPLICATION CHARACTERISTICS 151
latest time period being added to the end of the table. Data volatility is often a jus-
tification for how far you take the normalization process, but it doesn’t stop there.
Data volatility is an important factor in physical design decisions. It will directly
impact database object design, placement, and hardware platform requirements.
The tables used to track sales in our retail example are used by the point-
of-sale application, but that’s probably not the only application needing access
to those tables. Suppose you also have an accounting application that collects
sales information and writes in the tables used to track financial accounting infor-
mation. The two applications have to compete for the same resources. In this
case, the higher-priority application should be the point-of-sale application,
because of the need to quickly complete customer transactions.
Sometimes priority decisions are based, at least partially, on which applica-
tion’s sponsor has greater political power in the company. No matter how the
determination is made, the changes you make and how you make them rely on
these relative priorities. The higher the application’s relative priority, the more
important that the modifications do not adversely affect it.
Figure 5-3
r
esso
oc
Di
Pr
sk
Memory
Scaling Up
<ÿ=>
…
Scaling Out
taken into account during database design. For example, different DBMS systems
support different options for how data is stored and organized within the data-
base tables. Microsoft SQL Server, for example, supports a data type that directly
supports storage and manipulation of XML data. Data types are data storage for-
mats used when defining database tables. Most other DBMSs, at least at this point
in their development, do not include a data type designed for this purpose. XML
data storage requires you to develop your own work-arounds based on more tra-
ditional data types.
Certain hardware characteristics, such as processor speeds and disk data
transfer rates, are associated with the physical database design process though
not directly part of it. Simply put, the faster the hardware, the more tolerant the
system can be of a less than optimal physical design. In fact, some database
designers might try to scale up or scale out the hardware platform rather than
correcting problems found in the physical design after the database has been
deployed. Options are shown in Figure 5-3. Scaling up refers to improving the
server hardware, such as by installing additional processors, more memory, or
faster disk subsystems. Scaling out refers to spreading the data across multiple
database servers in a distributed data environment. Both are valid responses to
changing conditions such as increased user support requirements or an increase
in the volume of data stored on the database. However, these responses can also
be used in attempt to cover up an inefficient design.
154 IMPLEMENTING A DATABASE
Figure 5-4
At this point, the table is named Table_1, as you can see in the window tab.
You’ll be prompted to specify the table name when you save it. Additional graph-
ical utilities let you modify table columns for existing tables and manage prop-
erties for the table as a whole.
The disadvantages might not be immediately obvious. There are usually some
actions that you can take or properties that are accessible through the SQL com-
mand that are not available through the GUI-based command. The biggest dis-
advantage is the command environment itself. You must have access to SQL
Server Management Studio, either locally or through a remote virtual desktop,
which is a utility that lets you run the command on one computer, but see the
screens and control the command remotely from another computer. This is pos-
sible with programs like Windows Remote Desktop Connection. Because Man-
agement Studio is graphic- and resource-intensive, it cannot be used in this way
over a low-bandwidth connection such as a dial-up connection.
environment for running SQL commands. When you use GUI utilities, in some
environments they work by building SQL command strings and then executing
them in the background where you don’t see them used. Depending on whose
definition you prefer, you will see SQL commands divided into either two or
three types of commands. You will learn more about these types of commands
in Chapter 6. The most common categories you’ll see are data manipulation
language (DML) and data definition language (DDL) statements. DML state-
ments are, like the name implies, statements used to modify data, including
statements used to insert new data or delete data rows. DDL statements are used
to define server and database objects, like logins, users, tables, views, indexes,
and so forth.
You will sometimes see a third type of statement, data query language (DQL)
statements. A DQL statement refers specifically to SELECT command queries that
are used to retrieve data. Because the term query can be used to refer to state-
ments used to retrieve or modify data, the SELECT statement is often rolled in
as part of the DML command set.
The full command syntax is beyond the scope of this chapter, but if you
take a close look at this command example, it’s relatively easy to see what the
command is doing. The command name tells you that you are creating a table.
The table name in this example is Car. The dbo is the relational schema and is
part of the fully qualified object name, which is the complete object name that
defines the object as globally unique. The period (.) is used as a delimiter to
identify the different parts of the fully qualified name. The object name syntax
is a SQL concept, but its specific implementation can vary somewhat by DBMS.
The column names, along with basic column parameters, are enclosed in
5.1.7 EVALUATING IMPLEMENTATION OPTIONS 157
parentheses, along with any constraint definitions. You can see that the com-
mand creates four columns named SerialNumber, Model, Year, and Class, set-
ting the data type and nullability for each. “NOT NULL” says a value must be
provided during data entry or modification. “NULL” means that you can leave
the column blank during data entry and a null value will be entered as the
default. The primary key is identified as SerialNumber. The command is creat-
ing a clustered index for the primary key, so the table data will be physically
sorted in primary key order. The ON [PRIMARY] clause at the end specifies the
physical storage location, in this case, specifying the default storage location.
As you learned in Chapter 4, a null value is a special type of value. In some
cases, you want to set the column value as unknown or undefined rather than
as what you might consider “nothing” values of zero or an empty string. A null
value represents “no value,” either unknown, undefined, or no value applied.
Even though it stands for “nothing,” a null can be significant.
Consider an inventory table. One of the columns you might have is the
QuantityOnHand column, the number of items that you currently have in stock.
Typically, you would expect to see an integer value. A value of 0 indicates that
you do not currently have any of that item in stock. But what about a null value?
You could have a null entered automatically if a record is added to the table
without specifying an amount for QuantityOnHand. You could then retrieve a
list of items that you have never had in stock by retrieving all rows with a null
value in QuantityOnHand.
One place null values can be problematic is in text fields. You use charac-
ter data types to store text data. If you don’t see a value, it could be a null, or
it could be a zero-length string, a valid character string that happens to contain
zero characters. The two might look the same to you, but a database server sees,
and sorts, the two differently. A query used to retrieve zero-length strings would
ignore null-valued fields. By the same token, a query retrieving records with a
null value would ignore zero-length strings.
There are several advantages to using SQL commands. They are typically
more powerful and more flexible than their GUI equivalents. They require min-
imal resources to run, so they can be used over a remote connection. You can
also use them inside of a batch file, which is a file that contains a set of exe-
cutable statements that can run as a group. In fact, you could create all of the
tables for a database, and the database itself, using a batch. You could even send
the batch to a user in another location as an email attachment and let that user
run the batch locally.
The biggest disadvantage is that some SQL commands can be complicated and
difficult to use. In order to use them effectively, you need to understand the state-
ment syntax and available options. The order in which options are specified in
SQL commands can even be an issue. Command documentation can sometimes
run on for several pages for a single command. In the case of creating tables, for
example, you need to know the available data types and how they are specified.
158 IMPLEMENTING A DATABASE
FOR EXAMPLE
Remote Updates
Consider this situation. You are a consultant and one of your clients has a
database that you created as part of a business solution. The client requested
some additional security features that will require you to create some new
tables in the database. The application developer has already created an
installation file that patches the application, but you need a way to modify
the database.
You don’t have remote access to the database server, can’t go on site at
the client’s location in time to create the tables, and don’t want to chance
talking a local user through the process of creating the tables. The client has
already loaded data, so you can’t easily replace the complete database. This
is a situation made for using CREATE TABLE.
You can create a SQL batch file that contains the appropriate CREATE
TABLE statements. You can then attach this batch file to an email message
and send it to a local user. Include instructions for running the batch file in
the body of the email. The user runs the batch and updates the database, the
user can now run the application setup file to patch the application, you’re
a hero in your client’s eyes, and everyone is happy (until the next crisis).
SELF-CHECK
• List business environment requirements that can directly impact
your database design.
• Create a table that compares and contrasts advantages and disad-
vantages of using GUI-based utilities and SQL commands to create
database objects.
• Discuss why it is necessary to consider design influences as a
group, rather than individually.
You need to ensure that data is accurate when it is entered and remains accu-
rate during updates and modifications. Two of the primary tools for ensuring
this have already been mentioned, primary keys and foreign keys. Another fac-
tor already mentioned is controlling nullability.
As fast as computers have become, their speeds are certainly not infinite and
the time it takes to find data stored on disks and bring it into primary memory
for processing is a crucial performance issue. Data storage, retrieval, and pro-
cessing speeds all matter. Regardless of how elegant an application and the data-
base structures it uses are, it must meet performance expectations to be a suc-
cess. Because of this, you may need to make changes to your table design that
go beyond converting entities and applying normalization rules.
▲ Domain integrity: ensures that the values entered into specified columns
are legal.
▲ Entity integrity: ensures that each row is uniquely identified.
▲ Referential integrity: ensures that references with other tables remain
valid.
▲ Policy integrity: ensures that values adhere to established business rules.
Domain, entity, and referential integrity issues are actually relatively straight-
forward. It’s a little more difficult to make universal statements about policy
integrity. Policy integrity requirements are business and business rule specific, as
are actions taken to ensure them.
Addressing Domain Integrity
The first step in ensuring domain integrity is specifying an appropriate data type
for each table column. This limits the type of data that you can store in the col-
umn and to some extent sets the data storage format. Data type also relates to
data volume assessment because it determines how much space is required for
storing column data.
You also control domain integrity through:
You must specify a data type for each table column, though some DBMSs
allow for a default data type. You can apply nullability, check constraints, unique
constraints, and default constraints to any or all table columns as appropriate.
to one or more matching columns (with the same or compatible data types) in the
referenced table. The referenced column values must be unique, so the foreign key
must be related to the referenced table’s primary key or a unique constraint.
Think about customer orders. Each order contains one or more inventory items.
You would use referential integrity between the orders table(s) and the inventory
table to ensure valid part numbers, and through these, valid inventory items.
In some cases, the requirements for ensuring referential integrity go beyond
what can be accomplished through a foreign key constraint. In that situation, it
might be necessary to use a trigger to enforce integrity. A trigger is a specialized
type of executable procedure, in short, a program.
▲ You require that each customer order must identify the employee who
wrote up the order. You could enforce this through relational integrity,
relating orders to employees.
▲ You require a phone number for each customer. This can be enforced
through nullability requirements and by applying a check constraint.
▲ More than 8 hours worked in a day or 40 hours worked in a week must
be identified as overtime. You could use a simple check constraint to
limit the hours, but to calculate the overtime hours and store them sepa-
rately would require more. In this case, you would likely create a trigger
to perform the calculation and save the overtime hours appropriately.
Introducing Triggers
Triggers run, or fire, in response to events. Events can be related to data or data
objects. For example, you can have triggers that fire when data is entered, updated,
or modified in a table, either through the actions of a GUI-based utility or a DML
162 IMPLEMENTING A DATABASE
FOR EXAMPLE
Security through Triggers
Your database includes several tables that contain sensitive information. A
small number of employees are allowed to view the information, but no
employees are authorized to directly modify the information. You’ve set
access limits, but you need to audit attempts to change the data in the tables.
This is a situation where you could use triggers as a security tool. You
would create DML triggers on each of the tables that execute when a user
attempts to modify or delete table data. You configure the trigger to run
instead of the triggering statement. You could then have the trigger write
information about the attempt to a separate audit table. You could include
not only that the action occurred, but also what the employee attempted to
do, when, and even which employee made the attempt.
statement. With a few DBMSs, including SQL Server 2005, you can also have trig-
gers that fire when database or database server objects are created or modified,
such as a trigger that fires any time you create a table; these are DDL triggers. As
you learned earlier, DDL statements are statements used to create and manage
server and database objects, such as tables, views, and even databases themselves.
However, most current DBMS products do not support DDL triggers.
Triggers are extremely flexible in what they can do. They can include nearly
any SQL language executable statement. They can be set up to run after the trig-
gering event, the statement that causes the trigger to fire. In the case of SQL
Server and some other DBMS products, you can have the trigger run in place of
the statement or statements in the triggering event. Triggers can be used to
enforce data integrity that cannot be enforced through normal constraints and
to set limits on what can be done to server and database objects. You can also
use triggers for security auditing, to block and track attempts to perform unau-
thorized actions.
You will typically limit trigger use to situations where you cannot enforce
your requirements with a constraint, such as a check or foreign key constraint.
That is because executing a trigger is more resource-intensive than enforcing a
constraint and has a bigger potential adverse affect on database performance.
but it can be unacceptably slow in many cases. Another common issue is the
need to calculate and retrieve the same totals of numeric data over and over
again. In a normalized database, you would normally calculate rather than store
these values to save space. Technically, you would consider these to be redun-
dant data. However, when performance becomes an issue, it might be better to
calculate the values once and then store them with the rest of the table data.
Also, the “correct” primary key and foreign keys values, from a logical design
standpoint, aren’t always the best choices for physical implementation.
Data volume is another issue. Data is the lifeblood of an information system,
but when there is a lot of it, care must be taken to store and retrieve it efficiently
to maintain acceptable performance. Certain factors involving the structure of
the data, such as the amount of direct access provided and the presence of
clumsy, multi-attribute primary keys, can affect performance. If related data in
different tables that has to be retrieved together is physically dispersed on the
disk, retrieval performance will be slower than if the data is stored physically
close together on the disk.
Finally, the business environment often presents significant performance chal-
lenges. We want data to be shared and to be widely used for the benefit of the
business. But a very large number of access operations to the same data can cause
a bottleneck that can ruin the performance of an application environment. Also,
giving people access to more data than they need to see can be a security risk.
We need to spend a little time looking at additional design changes that you
might make that relate directly to managing performance issues. At this point
we are not limiting ourselves to table design only. We’re also looking at how we
might organize the data stored in the table to make our database more efficient.
Combining Tables
Most of what you do during normalization relates to non-loss decomposition,
breaking entities down into smaller and smaller related tables. While good in
theory, this doesn’t always result in the most efficient design. Sometimes, it’s best
to combine some of your design’s individual tables into larger tables. In some
cases, you will keep both the normalized and combined tables in your database,
trading increased storage requirements for improved performance. Let’s start with
a look at the tables created for a fictional company, General Hardware. This is
shown in Figure 5-5.
Consider a one-to-one relationship between salespersons and offices. Typically,
you would use this entity definition to create two tables, but it could be created
as a single table as shown in Figure 5-6. After all, if a salesperson can have only
one office and an office can have only one salesperson assigned to it, there can
be nothing wrong with combining the two tables. Since a salesperson can have
only one office, a salesperson can be associated with only one office number, one
(office) telephone, and one (office) size. A like argument can be made from the
164 IMPLEMENTING A DATABASE
Figure 5-5
SALESPERSON
CUSTOMER
CUSTOMER EMPLOYEE
PRODUCT
Product Product
Number Name Unit Price
SALES
Salesperson Product
Number Number Quantity
OFFICE
Office
Number Telephone Size
Figure 5-6
SALESPERSON/OFFICE
perspective of an office. Office data can still be accessed on a direct basis by sim-
ply creating an index on the Office Number attribute in the combined table.
The advantage is that if we ever have to retrieve detailed data about a sales-
person and his office in one query, it can now be done without joining. How-
ever, this is not a perfect solution and, in fact, has three definite negatives. First
is that the tables are no longer logically or physically independent. If we want
information just about offices, there is no longer an OFFICE table. The data is
still there, but we have to be aware that it is buried in the SALESPERSON/OFFICE
table. Second, retrievals of salesperson data alone or of office data alone could
be slower than before because the larger combined SALESPERSON/OFFICE
records would tend to spread the combined data over a larger area of the disk.
Third, the storage of data about unoccupied offices is problematic and may
require a reevaluation of which field should be the primary key. If you can jus-
tify combining the tables, it’s likely that you will also keep the original, separate
tables as well, in this situation.
Another way of adjusting data is to find alternatives for repeating groups.
Suppose we change the business environment so that every salesperson has
exactly two customers, identified respectively as their “large” customer and their
“small” customer, based on annual purchases. This would result in repeating
groups of customer attributes, with one “group” of attributes (Customer Num-
ber, Customer Name, etc.) for each customer. The attributes are so well con-
trolled they can be folded into the SALESPERSON table. What makes them so
well controlled is that there are exactly two for each salesperson and they can
even be distinguished from each other as “large” and “small.” This arrangement
is shown in Figure 5-7. With this structure, the foreign key attribute of Sales-
person Number from the CUSTOMER table is no longer needed.
This arrangement avoids joins when salesperson and customer data must
be retrieved together. However, retrievals of salesperson data alone or of cus-
tomer data alone could be slower than before because the longer combined
SALESPERSON/CUSTOMER records spread the combined data over a larger area
of the disk. Retrieving customer data alone is also now more difficult. As with
the last example, even if you can justify this combined table, it’s likely you would
also keep the two separate tables as is.
Figure 5-7
SALESPERSON/CUSTOMERS
Figure 5-8
CUSTOMER
Figure 5-9
CUSTOMER
table shows that salesperson number 137’s name is Baker four times, his com-
mission percentage is 10 four times, and his year of hire was 1995 four times.
The added storage requirement would normally be avoided unless it can be jus-
tified by the performance improvement.
CUSTOMER EMPLOYEE
Customer
Employee Customer Employee Employee
Number Name Number Name Title
Splitting Tables
It shouldn’t surprise you that the larger the table, the greater the resources
required to sort and retrieve the data. The bigger the table, the bigger the poten-
tial performance hit when applications must reference that table. One way to
reduce the resources requirement is to split the table. There are two basic options
for this, horizontal partitioning and vertical partitioning. Horizontal partitioning
is partitioning by rows, storing different rows in different tables. Vertical
5.2.2 ADJUSTING FACTORS RELATING TO PERFORMANCE 169
Figure 5-11
SALESPERSON
CUSTOMER
FOR EXAMPLE
Splitting Tables
Consider a situation where a company has offices in different geographic loca-
tions. Each location keeps an inventory specific to its warehouse’s needs. You
keep a master inventory list on a database server in the home office in San
Diego, California. You also have a database server at each of the warehouse
locations. At the warehouses, you want to minimize data storage require-
ments and optimize performance when performing inventory lookups.
This is a situation where you would split a table across multiple loca-
tions. The database server would store only those records appropriate to that
warehouse, keeping the size of the INVENTORY table to a minimum. Reduc-
ing the table size also helps improve performance when retrieving data. You
can have the database servers in each of the warehouses send updates to the
server in San Diego to keep the master inventory file up-to-date.
170 IMPLEMENTING A DATABASE
is accessed much more frequently than the rest of the records in the table.
Another is to manage the different groups of records separately for security or
backup and recovery purposes. For example, you might need to access the
records for sales managers in the CUSTOMER EMPLOYEE table more fre-
quently than the records of other customer employees. Separating out the more
frequently accessed group of records means that they can be stored near each
other on the disk, which will speed up their retrieval. The records can also be
stored on an otherwise infrequently used disk, so that the applications that use
them don’t have to compete with other applications that need data on the same
disk. The downside of horizontal partitioning is that a search of the entire table
or the retrieval of records from more than one partition can be more complex
and slower.
In vertical partitioning a table is subdivided by columns, producing the same
advantages as horizontal partitioning. In this case, the separate groups, each
made up of different columns of a table, are created because different users or
applications require different columns. For example, it might be beneficial to
split up the columns of the SALESPERSON table so that the Salesperson Name
and Year of Hire columns are stored separately from the others. But note that in
creating these vertical partitions, each partition must have a copy of the primary
key, Salesperson Number in this example. In all but one of the tables it is also
used as the foreign key, relating all of the vertically partitioned tables as a set. The
major disadvantage is one you’ve already seen. A query that involves the retrieval
of complete records, that is, data that is in more than one vertical partition,
requires that the vertical partitions be joined to reunite the different parts of the
original records.
In some situations, vertical partitioning may be physically required by the
DBMS. For example, when using SQL Server a table’s row size is limited to
approximately 8 KB. This is a physical limit that cannot be exceeded. If you have
too many attributes, or the attributes are too large to fit in a single physical row,
you must vertically partition the table into multiple tables.
SELF-CHECK
• List and describe key data integrity requirements.
• Explain how entity integrity can be enforced if an entity does not
have a uniquely-valued attribute.
• Describe changes you might make to table design to optimize
performance.
5.3.1 IMPLEMENTING YOUR FINAL TABLE DESIGN 171
Data type selections can also impact performance and flexibility. DMBSs can
store, retrieve, and manipulate some data types more quickly than others. What
you can do with data, including the types of calculations in which the data can
be used, also varies by data type.
Data types fall into general categories, with most DBMSs supporting the same
basic categories. Typically, these will include:
In addition to defining the data columns, you can also define table con-
straints. Depending on the DBMS, these can include primary key, foreign key,
check, unique, and default constraints. You can specify them when you create
the table or, in most cases, add them later.
The basic CREATE TABLE syntax requires you to specify the table name and
column list, along with the data type for each column. For example, to create
the General Hardware Company CUSTOMER table, you could use the following
command:
CREATE TABLE CUSTOMER
(CustomerNumber as int,
CustomerName as nvarchar(40),
SalespersonNumber as int,
HQCity as nvarchar(40))
This is similar to the statement example you saw earlier in this chapter. This
statement creates the table CUSTOMER with the following four columns:
Figure 5-12
SALESPERSON
CUSTOMER
CUSTOMER EMPLOYEE
Product Product
D Number Name Unit Price
PRODUCT
Salesperson Product
E Number Number Quantity
SALES
Office
F Number Telephone Size
OFFICE
foreign key, it connects those two tables in the join. If we need to frequently
find salesperson records on a direct basis by Salesperson Name, then we would
want an index with that as a key column (index H).
Consider the SALES table. If we have an important, frequently run applica-
tion that has to find the total sales for all or a range of the products, then the
query would run more efficiently if the Product Number attribute was indexed
(as index I, in this example).
The SQL language statement for creating an index is the CREATE INDEX
command. In its most basic format you specify the table on which the index is
based and the index key columns, the columns used to organize the data.
Columns are applied in the order in which they are listed. An index is created
as nonclustered by default, which means that it does not affect the table’s phys-
ical storage order. You can include the keyword CLUSTERED when you to cre-
ate the index if you want the table physically sorted in index order. Because of
the impact on the table sort order, a table can obviously have only one clustered
index.
Suppose you want to create an index on the CUSTOMER table. You might
use the following statement:
CREATE INDEX ix_CUSTOMER_NAME
ON CUSTOMER (Customer Name)
This creates an index on the CUSTOMER table using Customer Name as the
only key column. This might be used, for example, to optimize an application
that prints customer lists.
FOR EXAMPLE
Choosing the Correct Index Keys
You have a table with a two-column identifier, PartNumber and Salesper-
sonNumber. You want to create a clustered index to optimize data retrieval,
but you need to know which column you should list first when you create
the index. What should you do?
The first column listed has the most effect on how the data is physically
sorted. Before you can make your decision, you need to know one thing:
When data is retrieved from the table, in what order is the data usually
needed? If you usually need the data in PartNumber order, that’s your first
column. If you need it in SalespersonNumber order, then that’s your choice.
If the data retrieval requirements are too evenly split to tell for sure, then
choose based on selectivity. If it’s still too close to call, the choice comes
down to your personal preference. If the order doesn’t matter, then you
should probably create a nonclustered index instead of a clustered index.
As you can see in the SELECT statement column list, the column names
immediately following the SELECT statement, the query returns the three
columns specified as making up the view. Salesperson Number and Salesperson
Name are enclosed in square brackets so that the command will recognize and
process the names correctly. This is needed because the names include embed-
ded spaces. Whenever a user accesses data through this view, the SELECT state-
ment executes and the specified columns are returned.
To improve performance when accessing data through a view, SQL Server
supports a type of view called an indexed view that persists the view data. In
other words, it stores the result of the SELECT query on disk for faster retrieval.
There is no reason to run the query unless the underlying table data changes.
SQL Server places several restrictions on the creation and use of indexed views,
so they cannot be used in all situations. Also, as we have already seen, there are
usually trade-offs when addressing a design issue like access performance. In this
KEY TERMS 177
case, the improved performance comes at the cost of the additional disk storage
needed to keep the query result.
SELF-CHECK
• Describe the basic information needed to create a table.
• Describe the basic information needed to create an index.
• Describe the basic information needed to create a view.
SUMMARY
This chapter takes a close look at some final physical design issues and the basic
implementation process. It introduced additional design requirements. You were
also introduced to the types of utilities available for object creation. You learned
about data integrity and performance optimization needs. You also saw how to
create basic tables, indexes, and views.
KEY TERMS
Summary Questions
1. Data volatility refers to which of the following?
(a) how often data changes
(b) the size of data rows
(c) the number of rows in a table
(d) the number of constraints placed on a table
2. Which of the following can be used to determine relative application
priority?
(a) requirements to support primary processes
(b) impact on company profits
(c) cost savings provided
(d) all of the above
3. Which of the following is an advantage of using SQL language commands
to create database objects?
(a) The commands are easier to use than graphic utilities.
(b) The commands must run locally at the server.
(c) Commands can run as part of a SQL batch file.
(d) Commands prompt you for options and replaceable parameters.
4. Your database server hardware platform in not a factor in database
physical design. True or False?
5. When storing numeric data, a null value is the same as a zero. True or false?
6. Domain integrity ensures which of the following?
(a) that business rules are implemented
(b) that each row is uniquely defined
(c) that values entered into columns are legal
(d) that users are members of a Windows domain
7. Which of the following is used to enforce entity integrity?
(a) foreign key constraint
(b) check constraint
(c) default constraint
(d) primary key constraint
SUMMARY QUESTIONS 179
(c) string
(d) numeric
17. Unicode data is stored as one byte per character. True or False?
18. SQL Server automatically creates an index based on which of the following?
(a) primary key constraint
(b) foreign key constraint
(c) check constraint
(d) default constraint
19. Selectivity refers to the number of unique values present in a column.
True or False?
20. In which situation should you create a view?
(a) You want to vertically partition a table.
(b) You want to limit user access to table columns.
(c) You want to audit attempts made to modify table data.
(d) You want to physically combine tables.
Figure 5-13
MANUFACTURER
CAR
MAINTENANCE
CUSTOMER
RENTAL
3. You are preparing to deploy the database tables shown in Figure 5-13.
This is the logical design for the Lucky Rent-a-Car company’s database.
How can you ensure that a legal value is used for the Manufacturer
Name in the CAR table? How can you have Customer Number gener-
ated automatically as a one-up count of customers in the CUSTOMER
table with minimal resource requirements? In the MANUFACTURER
table, how can you ensure that Sales Rep Telephone is formatted
properly?
YOU TRY IT
182
YOU TRY IT 183
The CALL CONTACT table will be used with a 4. You may need to split the CALL CONTACT and
screen form that has the employee collect initial infor- ONLINE CONTACT tables if they become too
mation about the customer and purpose of the call. large (contain too many rows). What should you
The ONLINE CONTACT table will store a complete use to filter the rows? Justify your answer.
transcript of customers who contact the service cen- 5. At minimum, what information in the ONLINE
ter through an Internet chat program. Both CALL CONTACT table would be needed to enforce en-
CONTACT and ONLINE CONTACT must give you a tity integrity? How can you minimize the space
way of identifying the customer, time and date of the needed to relate other tables to the ONLINE
contact, the employee, the general call category, and CONTACT table?
whether or not the problem was resolved. 6. You want to restrict each manager to seeing in-
formation relating to employees he or she man-
1. How can you avoid the need to store detailed
ages in the ONLINE CONTACT, CALL CONTACT,
information about a customer in the CALL
and CALL COMMENT tables. How can you en-
CONTACT and ONLINE CONTACT tables?
force this, considering the volatility of the infor-
2. Chat sessions can sometimes run on for an hour mation? What could you add to these tables to
or more. What category of data type should you optimize the data retrieval?
use to store the transcript and why?
3. What can you do to optimize performance when
looking up entries in the CALL COMMENT table?
6
UNDERSTANDING THE SQL
LANGUAGE
Starting Point
Go to www.wiley.com/college/gillenson to assess your knowledge of the SQL
language.
Determine where you need to concentrate your effort.
INTRODUCTION
Most modern database management systems are based on the SQL language stan-
dard. You’ll find, however, that most are based somewhat loosely on that stan-
dard, choosing instead to employ their own variations and additions to the SQL
language. There is also some confusion as to what is considered part of the SQL
language and what has been added as part of manufacturers’ DBMS products.
In this chapter, you’ll get a formal introduction to the SQL language stan-
dard and to the types of variations that have been added by different DBMSs.
You will learn about basic language components through simple command exam-
ples. The chapter introduces representative DDL and DML command statements
through the standard command syntax and gives some insight into how some
DBMS providers have modified these statements to meet their particular needs
and design visions.
They offer the standard features and support most, if not all, of the standard
command set, but provide their own extension to the language, giving them
added functionality. These extensions usually take the form of additional com-
mands or additional command options. There are also differences in how fea-
tures that are not fully defined in the SQL standard are implemented.
For example, the SQL standard defines basic database objects such as tables,
views, and indexes. It does not define how these features are physically imple-
mented. Microsoft SQL Server stores database objects in one or more database
files, with each file containing one or more database objects. MySQL, on the
other hand, creates separate files for database objects. Each table in MySQL can
be effectively thought of stored as its own operating system file.
Basic features of SQL can be described as falling within the following categories:
What this means is that some things that you might have considered part of
SQL because they are common to relational database systems are not. Instead
they are features and command extensions implemented first by one DBMS
provider and sometimes copied by others. For example, the SQL language spec-
ifies nothing about scheduled command execution, but this is a feature sup-
ported by Microsoft SQL Server and many others. The SQL language defines
procedures, which are compiled sets of executable statements, in the context of
the statements used to create and modify them. It does not include any specific
management procedures to be included with a DBMS, but most do support a
wide array of predefined management procedures that install with the database
server and are treated as part of the vendor’s language set.
Interactive SQL
All DBMSs provide some sort of command interface or command prompt for
running SQL commands interactively. Microsoft SQL Server, for example, sup-
ports two basic options. It provides a character-based command prompt where
statements can be typed in one at a time. Statements can be executed individu-
ally or as a set of executable statements known as a batch. You can also load
and run scripts, groups of SQL Server commands stored as a file. Microsoft has
released various versions of its command interface over the years, retaining older
versions to provide backward compatibility as new ones are released. The pre-
ferred command interface for SQL Server 2005, sqlcmd, is shown in Figure 6-1.
One of the biggest problems with a character-based interface is that it can
be confusing and difficult to use. Sqlcmd, for example, supports a large num-
ber of startup options to let you configure the command environment. The
options let you log in to a specific server, choose a working database, specify
a script file to load and run automatically, and set options such as communication
Figure 6-1
c SQLCMD B
Microsoft l/indows XP [Uersion 5.1.2600]
<C> Copyright 1985-2001 Microsoft Corp.
D:\Docunents and Sett ings\FMiller>sqlcnd -S "webpro" -E
1> use aduentureworks
2> go
Changed database context to 'AdventureWorks'.
1> select lastnane from person.contact where contact id <5
2> go
lastnane
Achong
Abel
Abercrombie
Acevedo
<4 rows affected)
1>
D
Sqlcmd command interface.
188 UNDERSTANDING THE SQL LANGUAGE
time-out values and network packet sizes for network communication. You can
set default options for how data returned by queries is displayed. Set wrong,
they can impair your performance and even keep you from doing what you
need to do.
Using a character-based interface also requires either a thorough under-
standing of SQL language commands or a readily available reference. The inter-
face is unforgiving and the error messages returned by most DBMSs are usually
brief and not always helpful. Editing command strings in the interface can be
difficult and even small typographic errors could lead to major problems.
The biggest advantage of using a character-based interface is that it provides
a way of executing SQL language commands through operating system batch
files. This can be significant with a DBMS that does not internally supported
scheduled command executions. In that case, you can still run periodic activi-
ties, such as data backups, through command scheduling supported by the oper-
ating system.
SQL Server, as well as many other popular DBMS options, also provides a
windowed command environment using a graphic user interface (GUI). In the
case of SQL Server, it is included as part of SQL Server Management Studio,
shown in Figure 6-2. The interface has menu-driven option support, making it
somewhat easy to manage the command environment. You also have windowed
Figure 6-2
help available with access to database and database object structures, and easy
access to documentation. You can check commands for syntax errors before run-
ning them and easily edit your commands using a tool called Query Analyzer.
The biggest problem with using Query Analyzer is that it is resource inten-
sive and is not always available. If you want to run it from a client computer,
which is what you would normally want to do for security reasons, then you
must install the SQL Server client tools on that computer. The client tools include
SQL Server Management Studio.
Embedded SQL
Embedded SQL is best suited to activities that must be performed periodically
and that you want performed in the same way each time. Embedded SQL is also
a critical part of any database application. Embedded SQL uses the same SQL
language commands as when you are running interactive SQL statements. The
difference is that they are included as part of an executable program.
SQL Server, for example, has two basic programmable objects that provide
embedded SQL support. These are stored procedures and user-defined functions.
Both are similar in that they are sets of executable statements, they can accept
parameters to help control execution, and they can perform actions and return
results. The primary difference is how they are used. Stored procedures are most
often used to automate periodic procedures or to hide the details of those pro-
cedures from the users. User-defined functions are used when you want to return
either a scalar value to a user or a result formatted as a table.
For example, you might create a stored procedure to process customer pay-
ments. The user would need to provide the stored procedure name, the customer
ID, and the payment amount. Statements inside the stored procedure would han-
dle the details of exactly what changes need to be made to the table or tables involved
and could detect and respond to errors that might occur during the process. Not
only do you ensure that the payments are posted properly, this also helps to ensure
data security. Users do not need to know how the tables involved in the process are
structured. Because they don’t know how the data is structured, it’s harder for some-
one to retrieve information from the database without proper authorization.
Embedded SQL is also used in application programs. The programming envi-
ronment provides the connectivity tools to communicate with the database server
and execute SQL commands. This can be done by passing literal command
strings to the database server for execution or through application programming
interfaces (APIs) that provide the necessary functionality. For example, SQL
Server 2005 provides a set of.NET Framework management objects that make it
easy to build management applications. Also, the data objects provided with
ADO.NET support direct manipulation of database tables and other database
objects. In fact, ADO.NET is able to mirror the database and table structure,
down to column names and table constraints, in memory.
190 UNDERSTANDING THE SQL LANGUAGE
You need to understand that most commands, with the exception of some
DDL commands, run in the context of a database. When you connect to a SQL
Server database server, for example, through a command-line or graphical inter-
face, your connection is associated with a default database, also known as the
working database. When you execute a command that depends on a database,
the default database is assumed. For example, when you specify a table by name
or by schema and name only, SQL Server assumes the table is in the default
database. Some commands, mostly maintenance commands, can run against the
default database only. Other database products have similar defaults for connec-
tions, but the terminology used can vary between different DBMS products.
Each connection will also have an associated user—the user specified when
you connected to the database server. The actions you can take depends on the
permissions assigned that user. In some cases, you can temporarily override this
by specifying to run a command using the security context of a different user,
but this feature is not supported by all DBMSs.
The result returned by the command after successful completion is somewhat
command specific. When you run a SELECT command to retrieve data, it returns
the requested columns and qualifying rows, known as the result set. You might
also see this referred to as a relational result. It will also return a message stating
FOR EXAMPLE
About SQL Compatibility
If all modern DBMSs are based on the SQL language standard, then why is
there such a concern about selecting a DBMS? The difference relates pri-
marily to the version of the SQL standard on which the DBMS is based and
how close it is to that standard. You will find that the closer a DBMS is to
the SQL standard, the fewer features and less functionality it supports. Mov-
ing away from the standard brings more power and flexibility, but also
means that you have to deal with more proprietary command syntaxes and
language components. The learning curve before you can successfully deploy
a database solution increases.
Microsoft SQL Server is one of the most popular modern DBMS prod-
ucts. It is also one of the fastest, if not the fastest, depending on whose
benchmark tests you prefer. It is also one of the most difficult to learn. An
understanding of the SQL language standards will be enough to get you
started so you can employ basic features, but to make use of its unique ben-
efits, you need to learn new commands and learn about features that aren’t
part of the SQL standard. It also means that scripts containing executable
command statements that are written for SQL Server may or may not run
on another DBMS, depending on what statements you’ve used, and how.
192 UNDERSTANDING THE SQL LANGUAGE
how many rows were returned. DML statements usually just return a count of
rows affected. You receive even less with most DDL statements. In most cases,
they only return a statement that the command completed successfully.
If the command didn’t complete successfully, the DBMS returns an error.
Often, the errors returned give little information other than the fact than an error
has occurred. One reason for this is security. One possible source of errors is
that someone is trying to break into a database without authorization. The less
information you provide them about why a command didn’t work, the more dif-
ficult it is for them to figure out a solution without knowing the database and
database object structures.
SELF-CHECK
• Briefly describe the importance of the SQL language standard.
• Compare interactive and embedded SQL.
• Explain why DBMS providers vary from the SQL standard.
Another point is that SQL SELECT commands can be run in either an inter-
active query or an embedded mode. In the query mode, the user types the com-
mand at a workstation and presses the Enter key. In the embedded mode, the
SELECT command is embedded within the lines of a higher-level language pro-
gram and functions as an input or “read” statement for the program. When the
program is run and the program logic reaches the SELECT command, the pro-
gram executes the SELECT. The SELECT command is sent to the DBMS where,
as in the query mode case, the DBMS processes it against the database and
returns the results, this time to the program that issued it. The program can then
use and further process the returned data. The only tricky part to this is that
some programming environments are designed to retrieve one record at a time.
In the embedded mode, the program that issued the SQL SELECT command
and receives the result set must process the rows returned one at a time. How-
ever, many newer languages and programming environments are designed to rec-
ognize and process result sets. Microsoft.NET Framework, for example, includes
data objects that let you access rows one at a time, in the order received, or store
the entire result set in memory and randomly access its contents.
Figure 6-3
SALESPERSON table.
194 UNDERSTANDING THE SQL LANGUAGE
containing SQL functions. The syntax for using SELECT to perform a calcu-
lation is:
SELECT expression
You can’t get much simpler than that. The expression is any mathematical or log-
ical expression that returns a result. Here’s an easy example:
SELECT 5 ⫹ 7
When you execute the command, the database server returns a result of 12. The
expression must resolve to a legal value or an error is returned. For example,
the following will return a divide by 0 error:
SELECT 5/0
The syntax is similar for resolving a function. In this case, the syntax is:
SELECT function [(parameter_list)]
The parameter_list is in square brackets because some functions do not include
any parameters. Others can accept multiple parameters, depending on what the
FOR EXAMPLE
The SELECT Statement
The SELECT statement is the basic workhorse of most database applications.
To understand this, think about how a database might meet the needs of a
business, such as an online sales business. A customer logs onto your Web
site. When this happens, it’s likely that your application will attempt to
retrieve information from the customer’s computer and compare it with
information it has on file, using the information to recognize returning cus-
tomers. How will it do this? By running a SELECT statement filtered by this
identifying information. The customer wants to view information about
products you have for sale. This information is often spread across multiple
tables, requiring the use of one or more SELECT statements used to retrieve
the information and organize the result.
The checkout process relies on a series of SELECT statements, retriev-
ing shipping options and formulas for calculating shipping costs, the cus-
tomer’s preferred shipping address and possibly one or more alternate
addresses. Some online vendors will also use queries against secure database
tables to retrieve credit card information you’ve left on file. Even when using
embedded SQL features that don’t necessarily execute an explicit SELECT
statement, they are still based on SELECT statement functionality.
196 UNDERSTANDING THE SQL LANGUAGE
function is designed to do. This syntax for evaluating expressions and functions
is not supported by all DBMS products.
SELF-CHECK
• Explain the basic SELECT statement syntax.
• Explain the importance of the WHERE clause when retrieving
SELECT results.
This gives you the following result, with “total_value” used as the column name.
total_value
10
To concatenate (join together) numbers in SQL Server, you would need to
use a different expression. Instead of numeric data, you would use numbers rep-
resented as strings. Use the ⫹ operator with values that are explicitly identified
as strings, as in the following:
SELECT ’5’ ⫹ ’5’ total_value
This gives you the result:
total_value
55
decision statements that take an action depending on the result of a logical oper-
ation, such as: “If A AND B are true, then perform an action.” The most com-
mon logical operators are shown in Table 6-3.
In addition to these operators, there are special operators that are used with
subqueries. A subquery is a special way of retrieving information where one
query is dependent on another query. A discussion of subqueries is beyond the
scope of this chapter.
We should, however, look at a couple of examples of using common logical
operators. We’re going to use them in the SELECT statement WHERE clause.
This time, we’ll be querying the CUSTOMER table shown in Figure 6-4.
Figure 6-4
CUSTOMER table.
First, let’s look at an example where we have two conditions and both must
be met for a row to qualify.
List the customer numbers, customer names, and headquarter cities of the
customers that are headquartered in New York and that have a customer
number higher than 1500.
In this case, you would run:
SELECT CUSTNUM, CUSTNAME, HQCITY FROM CUSTOMER
WHERE HQCITY=’New York’ AND CUSTNUM>1500
This gives us the result:
CUSTNUM CUSTNAME HQCIT
1826 City Hardware New York
2198 Western Hardware New York
2267 Central Stores New York
Notice that customer number 0121, which is headquartered in New York, was
not included in the results because it failed to satisfy the condition of having a
customer number greater than 1500. With the AND operator, it had to satisfy
both conditions to be included in the result.
To look at the OR operator, let’s change the last query to:
List the customer numbers, customer names, and headquarter cities of the
customers that are headquartered in New York or that have a customer
number higher than 1500.
In this case, you would run:
SELECT CUSTNUM, CUSTNAME, HQCITY FROM CUSTOMER
WHERE HQCITY=’New York’ OR CUSTNUM>1500
200 UNDERSTANDING THE SQL LANGUAGE
All functions could be divided into two broad categories: deterministic func-
tions and nondeterministic functions. Deterministic functions always return the
same result if you pass in the same arguments; nondeterministic functions
might return different results, even if they are called with exactly the same argu-
ments. For example, ABS, which returns the absolute value of a number passed
to it as an argument, is a deterministic function. No matter how many times
you call it with, say argument –5, it will always return 5 as a result. The
Microsoft SQL Server function GETDATE() accepts no arguments and returns
only the current date and time, and so is an example of a nondeterministic
function. Each time you call it a new date and time is returned, even if the
difference is less than one second.
One reason this is important is that some DBMSs restrict use of the nonde-
terministic functions in database objects such as indexes or views. For example,
SQL Server disallows use of such functions for indexed computed columns and
indexed views.
The list of functions changes slightly with each new release of the ANSI
SQL standards. However, the functions supported by the various DBMS
providers vary widely from the standard function list and from each other. As
an example, the functions specified in the SQL-99 standard (as well as earlier
standard versions) are described in Table 6-5. The SQL-99 standard refers to a
version of the SQL standard released in 1999. New SQL standard versions are
released every few years.
Rather than expecting a standard implementation of any of these functions,
you should refer to the documentation specific to your DBMS for available func-
tions. Limit use of functions to statements, procedures, and other executables to
be used with a specific SQL implementation.
functions you are likely to use. Table 6-6 lists commonly accepted function cat-
egory descriptions.
To give you an idea of the functions supported by a specific DBMS, let’s take
a quick look at the aggregate functions supported by SQL Server. An aggregate
function operates on a set of values, returning a single result based on these val-
ues. These are listed in Table 6-7.
In the case of SQL Server, these functions are also part of the set of functions
referred to as built-in functions. Built-in functions are simply the predefined func-
tions that install with SQL Server. They are considered part of the general function-
ality of the SELECT statement as implemented by SQL Server. Aggregate functions
can be used on a result set as a whole or to provide summary or intermediate
values based on groups of rows within a result as specified by another SELECT
statement clause, the GROUP BY clause. Let’s take a look at a couple of exam-
ples based on the SALES table shown in Figure 6-5.
6.3.4 FUNCTION VARIATIONS 205
The SALES table shows the lifetime quantity of particular products sold by
particular salespersons. For example, the first row indicates that Salesperson 137
has sold 473 units of Product Number 19440 dating back to when she joined
the company or when the product was introduced.
Figure 6-5
SALES table
SALES table.
206 UNDERSTANDING THE SQL LANGUAGE
of the rows for Salesperson Number 137 will form one group, all of the rows
for Salesperson Number 186 will form another group, and so on. The Quantity
attribute values in each group will then be summed—SUM(QUANTITY)—and
the results returned to the user.
FOR EXAMPLE
The Importance of Parentheses
Parentheses and their placement can play an important role in SQL expres-
sions. Most people don’t take the time to memorize operator precedence,
the order in which operators are evaluated. That means that they don’t nec-
essarily know how an expression containing multiple operators, especially
one containing different types of operators, is going to be evaluated. Take,
for example, the following statement:
SELECT 2 * 2 + 6
This statement returns a result of 10. When evaluating the expression, the
database server first multiplies 2 times 2 (for 4), then adds 6 to the prod-
uct, giving you 10.
What if we turn it around, as in the following:
SELECT 2+6*2
This time, the result is 14. The multiplication operator has a higher prece-
dence than the addition operator, so it is evaluated first as 6 times 2 (for
12), with 2 then added to the product for a result of 14.
Typically, you will want to use parentheses so you can explicitly control
the order in which operators are evaluated. What if you want the database
server to first add 2 and 6 then multiply the sum by 2? Here’s the statement
you need:
SELECT 2*(2+6)
The addition operator is evaluated first, because it is enclosed in parenthe-
ses, then the multiplication operator. As expressions become more compli-
cated, especially when they depend on function results or scalar values
returned by queries, parentheses become more important in ensuring the
expressions are evaluated properly. Errors caused by operator precedence
can easily go unnoticed during application design, can lead to costly mis-
takes, and can be very difficult to locate and correct.
208 UNDERSTANDING THE SQL LANGUAGE
SELF-CHECK
• List and describe available operator categories.
• Explain the significance of operator precedence.
• List and describe function categories.
You must specify the table or view to which you are inserting the new row
or rows. If you specify a view, it must either be based on a single table or you
must create a trigger to handle the details of the INSERT process. The column_list
is specified only if you are not providing values for all columns. In that case,
you would list the columns for which values are provided.
You have several options for providing values. You can use literal values,
such as numeric values or character strings, as long as the value is appropriate
to the column’s data type. Character strings must be enclosed in quotes to be rec-
ognized as such. You can use expressions that evaluate as an appropriate value.
For columns that allow NULL values, you can specify NULL. In some cases, you
can use a SELECT statement to provide the values.
When inserting a row into a table, you must provide a value for every col-
umn unless the column:
▲ Is an identity column (in those DBMSs where identity columns are sup-
ported).
▲ Has a value calculated from the values of other columns.
▲ Allows NULLs (in which case, a value of NULL is inserted if none is
provided).
▲ Has a defined default value (which can be overridden by a supplied value).
SQL Server is one of the few DBMSs that currently support the DEFAULT
VALUES keyword. It specifies to use all default values when inserting the row.
These can include the next value of an identity column, automatically gener-
ated values such as GUID and timestamp values, default values based on lit-
erals or expressions, and, for nullable columns that don’t otherwise have a
default, NULL.
As an example, think back to the SALESPERSON table in Figure 6-3. Assume
that the COMMPERCT column has a default value of 10. You might use the fol-
lowing statement to add another salesperson to the table.
INSERT INTO SALESPERSON (SPNUM, SPNAME, YEARHIRE,
OFFNUM)
VALUES (427, ’Smythe’, 2005, 1247)
Notice that in this example, we didn’t include COMMPERCT in the column list,
and therefore, it will receive the default value when the row is added. The other
columns take their values from the literal values we’ve provided.
▲ You did not have all of the information for the row when you inserted it
into the table.
▲ You need to change database information to reflect changes in the real
world.
When data becomes available, or when it changes, you need to update table
data as appropriate, using the UPDATE statement. The ANSI SQL-99 standard
UPDATE syntax is:
UPDATE table_or_view_name
SET column_name = literal | expression |
(single_row_select_statement) |
NULL | DEFAULT,...
[WHERE search_condition]
In general, the rules and restrictions are the same as for using the INSERT
statement. You must specify the destination table or view. If the view is based on
more than one table, you must have a trigger that makes the update for you. You
can update directly through a view only if the view is based on just one table.
Use the SET keyword, followed by the column that you want to modify and
the new value. You can also specify a column list, with values provided for each
column in the list. As with INSERT, you can specify a literal value, an expres-
sion, the NULL keyword, or the DEFAULT keyword to set a column to its default
value (if any). If you specify an illegal value, an error is generated. This would
include an invalid value, that is, one not supported by the column data type, a
NULL value for a column that does not accept NULLs, or the DEFAULT key-
word for a column that does not have a defined default.
If you want to modify one row or a specific set of rows, you must include
the WHERE clause with a search condition to filter the rows affected by the com-
mand. Consider this situation. The salesperson you previously entered, Smythe,
married and took her husband’s name (Watson). She wants all company records
to reflect her new last name. To make this change, run:
UPDATE SALESPERSON SET SPNAME = ’WATSON’
WHERE SPNUM = 427
One of the most common problems when using the UPDATE statement is
either forgetting to include the WHERE clause (so that all rows are updated), or
using inappropriate values in the WHERE clause. Either can result in the wrong
rows being updated.
In this example, you didn’t have to use SPNUM as the qualifying column.
However, assuming that it is the primary key, it’s the value you would want to
use because it uniquely identifies the row. What if you had multiple salesper-
sons with the name Smythe and ran the following:
6.4.3 USING DELETE 211
The command does not support any addition of optional keywords or para-
meters, making it very simple to use. Keep in mind, however, that constraints
or other restrictions placed on a table or individual columns could still prevent
you from deleting rows from the table.
The table or view name is required. The FROM keyword is optional with
some DBMSs, including SQL Server, but required by others. Some DBMSs
require a WHERE clause and search condition. However, SQL Server does not,
which could lead to a problem. If you run DELETE without a search condi-
tion, all rows are deleted from the table. For example, consider the following
statement:
DELETE SALESPERSON
That statement would delete all rows from the SALESPERSON table. Once
the change is committed, permanently written to the table, the only way to
retrieve the rows would be either to recover from backup or manually reenter
the information. However, reentering won’t work if the table has identity
columns because deleting all rows does not revert the identity back to its seed
value. That means that all rows would have new identity values.
It’s more likely that you would want to delete one row, or a subset of the
rows, from the table. In an earlier example, you added a salesperson with a
SPNUM value of 427 to SALESPERSON. To delete this person, you could
run:
DELETE SALESPERSON WHERE SPNUM=427
Microsoft SQL Server does support a statement that will both delete all rows
and reset the table’s identity column, if it has one, to its seed value. This is a
nonstandard command and is not always included as a DML statement. You
might find it classified as a DDL statement, or as neither. This is the TRUNCATE
TABLE statement, using the syntax:
TRUNCATE TABLE table_name
The potential problem is that this runs as a non-logged operation. Trans-
action logging is beyond the scope of this chapter, but in brief, a non-logged
operation provides a way of rolling back changes as if they were never run. That
means that the change isn’t written to the transaction log, but instead is made
directly to the database. SQL Server 2005 changed the description to say that
the command is minimally logged, which means that the fact that the operation
ran is logged, but not its effect on the database or its data. Even if the command
is run from within a transaction, which normally lets you roll-back your actions
if caught early enough in the process, there is no way to completely reverse the
result other than by restoring the table from a recent backup. TRUNCATE TABLE
is not a standard ANSI SQL command.
6.5 UNDERSTANDING DDL COMMANDS 213
FOR EXAMPLE
Embedding DML Statements
As you can imagine, DML statements play a major role in most database
applications. One of the justifications for normalizing database tables is to
optimize performance when entering and modifying data. To use the
INSERT, UPDATE, and DELETE statements, you need to understand the
table structure. However, knowing too much about how tables are struc-
tured could make it easier for someone looking to steal, or maliciously
change, database data.
You can avoid this by using embedded DML statements. You run the
statements in the context of custom procedures or your application. You
design the embedded statements so that users only need to provide the data
values as parameters without necessarily knowing anything about table
structure, the number of columns and column names, data types, or much
of anything else about the tables. You are able to ensure that updates are
made accurately and consistently while protecting data against unauthorized
access.
SELF-CHECK
• Explain why vendor command syntax is often different than SQL
standard syntax.
• Describe the purpose of the INSERT, UPDATE, and DELETE com-
mands.
as tables. There are three basic commands that relate to object management. They
are as follows:
Each command includes both required and optional keywords and para-
meters. Required parameters must be specified when you run the command.
For example, when you create a table, the table name is a required parameter.
So is specifying at least one table column. Optional parameters are parameters
that you can include, but aren’t required to. Optional parameters sometimes
have a default value that is used if none is specified. For example, a table’s pri-
mary key is an optional parameter that does not have a default value. If you
don’t specify a primary key, the table is created without one. The table’s stor-
age location is also an optional parameter, but there is a default storage loca-
tion if none is specified.
The ANSI SQL-99 statement syntax for the CREATE TABLE statement, for
example is:
CREATE [{GLOBAL | LOCAL} TEMPORARY] TABLE table_name
(column_name [domain_name |
datatype [size1[,size2]]
[column_constraint,...] [DEFAULT default_value]
[COLLATE collation_name],...)
[table_constraints]
[ON COMMIT {DELETE | PRESERVE} ROWS]
The ANSI SQL standard uses the GLOBAL and LOCAL keywords to define tem-
porary tables, tables that are often created in memory only and having a lim-
ited scope. SQL Server does not support these keywords, but instead identifies
temporary tables by the table name. A # is used as the first character in a local
temporary table name and ## is used to identify a global temporary table.
The basic column definition is provided by the column name and data type,
both of which are required. The SQL standard lets you specify a domain as an
alternate to a data type. A domain is an object definition that includes a stan-
dard data type, but can also include a default value, collation, and constraints.
Domains are not supported by SQL Server, but SQL Server does support a sim-
ilar concept, known as a user-defined data type, which you can use instead of
a standard SQL data type. Constraints, as well as supplying a default value of
the column and column collation, are optional. You can also specify optional
table constraints which include the primary key, foreign keys, and unique con-
straints.
The ON COMMIT clause is also part of the SQL standard that is not sup-
ported by SQL Server. It is used with some DBMSs to control the actions of tem-
porary tables.
However, you shouldn’t get the idea that you are shortchanged in func-
tionality if you are using SQL Server. Microsoft supports several options that
are not supported in the SQL standard; these include options to control phys-
ical storage.
216 UNDERSTANDING THE SQL LANGUAGE
SELF-CHECK
• Describe the purpose of CREATE, ALTER, and DELETE statements.
• Explain the role of DDL statements in the SQL language.
KEY TERMS 217
SUMMARY
This chapter introduced the SQL language. We began with a general introduc-
tion to the SQL language and a comparison between interactive and embedded
SQL. We spent some time working with the most basic version of the SELECT
statement to retrieve values from a single table and to evaluate expressions. The
chapter also introduced the concept of operators and SQL language functions.
We also looked at function categories used in vendor implementations of SQL
and took a relatively close look at the SQL Server aggregate functions. We ended
with a discussion of DDL and DML statements, including the standard SQL syn-
tax for selected commands.
KEY TERMS
Access path Keyword
Aggregate function Minimally logged operation
Batch Non-logged operation
Binary Operator Nondeterministic function
Built-in functions Operator precedence
Clause Parameters
Command operators Parsing
Command syntax Procedure
Data definition language (DDL) Qualifying conditions
Data manipulation language (DML) Query mode
Data query language (DQL) Relational result
Declarative statement Result set
Default database Script
Deterministic function Search argument
Domain Sqlcmd
Dynamic SQL Subquery
Embedded SQL Temporary table
Explicit conversion Transaction control
Function User-defined data type
Implicit conversion Unary operator
Interactive SQL Working database
218 UNDERSTANDING THE SQL LANGUAGE
Summary Questions
1. The ANSI SQL standard defines other commands in addition to DDL and
DML command statements. True or False?
2. What is the advantage of a command-line based interface for interactive
SQL?
(a) A command-line interface is easier to use than a graphic-based inter-
face.
(b) A command-line interface lets you run commands from an operating
system batch.
(c) A command-line interface does NOT require you to provide connec-
tion information.
(d) A command-line interface is more forgiving of user error than other
interfaces.
3. Embedded SQL is supported by.NET Framework development language
only. True or False?
4. Which of the following statements accurately describes running SQL
commands?
(a) You must specify the database context when running any SQL com-
mand.
(b) All SQL commands require you to specify a WHERE clause.
(c) Unless otherwise specified, commands run in the security context of
the user establishing the database connection.
5. SELECT commands are considered to be declarative in nature. True or
False?
6. You must always specify the FROM clause when executing the SELECT
command. True or False?
7. What is the purpose of the SELECT command WHERE clause?
(a) to provide search conditions to filter the result set
(b) to identify the source database
(c) to identify the source tables or views
(d) to specify the destination for reporting the command result
SUMMARY QUESTIONS 219
Starting Point
Go to www.wiley.com/college/gillenson to assess your knowledge of SQL
language use.
Determine where you need to concentrate your effort.
INTRODUCTION
The primary reason for designing and deploying a database is to support a busi-
ness. A production database provides support for day-to-day business activities.
A decision support database provides the information needed to make strategic
business decisions. With both types of databases, a large percentage of the activ-
ity is related to retrieving data. Because of this, it is important to have a good
understanding of what you can do with the SELECT statement, keeping in mind
that the features available to you will depend on the DBMS you select for your
database solution. It also helps to understand the role that batches and scripts
can play in automating recurring procedures.
The column_list specifies the columns (fields) you want to return and can
include calculated values. Use an asterisk (*) as the column_list to have the state-
ment return all columns from a table. The source can include tables, views, and
table-valued user-defined functions, which are special functions supported by
some DBMSs that you can design and create to meet specific data retrieval needs.
The optional search_condition is used to filter the result so that you only get the
rows (records) that you actually want in your result.
This, however, is just the starting point for the SELECT statement. In order
to learn more about its features and functionality, you need to see more of the
statement syntax.
This is still not the complete SELECT statement syntax. There are some fea-
tures and functionality supported by the SELECT statement that are beyond the
scope of this text. A discussion of some of the remaining keywords and options,
at this point, might do more to confuse than to enlighten.
Because of this, you might see uses of the SELECT statement that include
keywords not presented here, such as the FOR XML clause that is used to return
database table data formatted as an XML document fragment, a portion of an
XML document containing data organized in a hierarchical fashion. Refer to the
Books Online that install with SQL Server and the Microsoft Developer Network
(MSDN) Web site for more complete information on advanced SELECT statement
syntax (in SQL Server) and keywords not covered as part of this text.
Table 7-1 has a brief explanation of the additional keywords introduced here.
You will get a chance to see these keywords used in this chapter.
The easiest way to understand how these keywords work is to see them
in use. For this purpose, we’ll use a database based on the fictitious General
Hardware Company business. Figure 7-1 is a SQL Server database diagram of
General Hardware showing you the tables contained in the database.
The key icon identifies each table’s primary key. The links between tables
identify foreign key relationships established between the tables. These relation-
ships are used to maintain relational integrity between the tables. If you want to
retrieve data based on these relationships, you must use joining statements,
which are statements that combine results from multiple tables.
This query captures customers with an HQCITY value of New York, as well as
those who meet both of the remaining conditions, CUSTNUM over 1500 and
an HQCITY of Atlanta. This results in:
CUSTNUM CUSTNAME
0121 Main St. Hardware
1525 Fred’s Tool Stores
1826 City Hardware
2198 Western Hardware
2267 Central Stores
If you want the OR operator evaluated first, you would use the following
query:
SELECT CUSTNUM, CUSTNAME FROM CUSTOMER
WHERE (HQCITY=’New York’ OR CUSTNUM>1500) AND
HQCITY=’Atlanta’
7.1.2 FILTERING YOUR RESULT 227
Figure 7-1
The result changes significantly. In this case, only one row qualifies:
CUSTNUM CUSTNAME
1525 Fred’s Tool Stores
This would mean that, with the AND outside of the parentheses, both of two
conditions have to be met for a row to qualify for the results. One condition is
that the headquarters city is New York or the customer number is greater than
1500. The other condition is that the headquarters city is Atlanta. Since for a
given row the headquarters city can’t be both Atlanta and New York, the result
is extremely limited.
Suppose that you want to find the customer records for those customers
headquartered in Atlanta, Chicago, or Washington. Rather than using the OR
operator, you can use the IN keyword with a list of qualifying values, as in the
following query:
SELECT CUSTNUM, HQCITY FROM CUSTOMER
WHERE HQCITY IN (’Atlanta’, ’Chicago’, Washington’)
This returns:
CUSTNUM CUSTNAME
0839 Chicago
1525 Atlanta
1700 Washington
7.1.3 MANAGING YOUR RESULT SET 229
Using LIKE
You get even more flexibility when you use the LIKE operator and its supported
wildcard characters for pattern matching, so you can filter by results that match
any string of characters (%), any single character (__), characters in a range or
set ([]), or that don’t match a specific character ([^]). Common uses of the LIKE
keyword are to retrieve customers by name or geographically by the first few
numbers in the customer’s ZIP code. For example, you could look for all cus-
tomers with a name beginning with the letter C using:
SELECT * FROM CUSTOMER WHERE CUSTNAME LIKE ’C%’
Or, you can reverse the logic and return those customers whose names don’t
begin with the letter C.:
SELECT * FROM CUSTOMER WHERE CUSTNAME LIKE ’[^C]%’
You can combine the wildcard characters in different ways to give you con-
trol over the result. For example, consider this query:
SELECT CUSTNAME FROM CUSTOMER
WHERE CUSTNAME LIKE ’C_t%’
What will this return? You’ll get customer names that have the first letter C, any
character (including a space) in the second character position, the letter t, and
then any other character string. The result would be:
CUSTNAME
City Hardware
This gives you the calculated value based on the quantity on hand times the unit
price. The resulting column is named VALUE, as specified in the query.
To limit this to returning just the first three rows, you could run the
following:
SELECT TOP 3 PRODNUM, (QOH * UNITPRICE) AS [VALUE]
FROM PRODUCT
Limiting the returned rows usually makes more sense if you are also ordering
the result. You’ll learn how to do that a little later in this chapter.
What if you want a more permanent record of the result? You can use the
INTO keyword to write the result to a table. For example:
SELECT PRODNUM, (QOH * UNITPRICE) AS [VALUE]
INTO PRODVAL FROM PRODUCT
Keep in mind that error text and error message numbers are DBMS-specific.
This is the error that you would see using SQL Server 2005 if you specify a table
name that already exists. If you do want to recreate the same table, you must
drop the table before running the query.
Next, let’s take a quick look at the DISTINCT keyword. Suppose you want
to find out which salespersons currently have one or more customers assigned.
To do so, you might run a statement like the following:
SELECT SPNUM FROM CUSTOMER
Using the data in our sample database, the result you get is:
7.1.4 SORTING, ORGANIZING, AND GROUPING DATA 231
SPNUM
137
186
137
137
361
361
137
204
186
Notice that most of the salesperson numbers are listed more than once in the
result. Supposed you don’t need to know how many times the salesperson is
listed, but whether or not the number appears in the list. To limit the result to
unique values, you would run the query as:
SELECT DISTINCT SPNUM FROM CUSTOMER
This time, you only get four rows back. Here’s the result:
SPNUM
137
186
204
361
As you can see in this example, use the DISTINCT keyword when you want to
specifically limit the result to unique values only.
Here’s a slightly different pair of examples:
SELECT SPNUM, HQCITY FROM CUSTOMER
SELECT DISTINCT SPNUM, HQCITY FROM CUSTOMER
Now, let’s compare the results side-by-side in Table 7-2.
The entire result must be duplicated for the row to be dropped. In this case,
it means the combination of SPNUM and HQCITY. This time, only two dupli-
cates are removed from the list: one for SPNUM 137 and New York and the
other for 137 and Los Angeles. The rest are all unique.
However, it’s easier to see the highest cost products if you list them that way,
using a query like:
SELECT PRODNUM, PRODNAME, UNITPRICE
FROM PRODUCT ORDER BY UNITPRICE DESC
You must include the DESC keyword to sort the results in descending order
(from highest to lowest). The default is to sort the result in ascending order.
You can also use the ASC keyword to explicitly specify an ascending order
sort.
What if you only want the top three items? This is where you would use
the TOP keyword, as in the following:
SELECT TOP 3 PRODNUM, PRODNAME, UNITPRICE
FROM PRODUCT ORDER BY UNITPRICE DESC
The result is still in the same order, but is limited to the first three rows only.
PRODNUM PRODUCTNAME UNITPRICE
21766 Drill 98.27
21765 Drill 32.99
24013 Saw 26.25
You can specify multiple columns in the ORDER BY clause. The result is
sorted by all of the columns, with the first columns listed having the highest
precedence. For example, you might want to do the following:
List the contents of the sales table sorted by PRODNUM. When there are
multiple rows with the same PRODNUM value, sort them by SPNUM.
The query for this might be:
SELECT * FROM SALES
ORDER BY PRODNUM, SPNUM
If you change column order in the ORDER BY clause, the result also changes
significantly. For example, the query:
SELECT * FROM SALES
ORDER BY SPNUM, PRODNUM
The query still returns the same rows, but returns them in a different order.
It’s important to note that the column or columns used in the ORDER BY
list don’t necessarily have to be part of the SELECT clause column list. For exam-
ple, the following query would run without generating an error:
SELECT PRODNUM, PRODUCTNAME FROM PRODUCT
WHERE QOO =0 ORDER BY QOH
Let’s break down exactly what this is requesting. The query should return the
PRODNUM and PRODUCTNAME columns, limited to those columns in which
the QOO (quantity on order) column has a value of 0. Sort the result in ascend-
ing order by the value in the QOH (quantity on hand) column. This gives you
the following result:
PRODNUM PRODUCTNAME
26722 Pliers
21765 Drill
24013 Saw
19441 Hammer
16386 Wrench
Notice in the SELECT statement that neither QOO or QOH is included in the
SELECT column list.
the GROUP BY clause. For instance, suppose you wanted to find the total num-
ber of units of all products that each salesperson has sold. That is, you want to
group together the rows of the SALES table that belong to each salesperson and
calculate a value—the sum of the Quantity attribute values in this case—for each
such group. Here is the way such a query might be stated:
Find the total number of units of all products sold by each salesperson.
The SQL statement, using the GROUP BY clause, would look like this:
SELECT SPNUM, SUM(QUANTITY) AS [SUM]
FROM SALES GROUP BY SPNUM
Notice that GROUP BY SPNUM specifies that the rows of the table are to
be grouped together based on having the same value in their SPNUM attribute.
All of the rows for Salesperson Number 137 will form one group, all of the rows
for Salesperson Number 186 will form another group, and so on. The Quantity
attribute values in each group will then be summed—SUM(QUANTITY)—
and the results returned to the user. But it is not enough to provide a list of
sums:
1331
9307
1543
9577
These are the sums of the quantities for each salesperson, but because they
don’t identify which salesperson goes with which sum, they are meaningless!
That’s why the SELECT clause includes both SPNUM and SUM(QUANTITY).
Including the attribute(s) specified in the GROUP BY clause in the SELECT
clause allows you to properly identify the sums calculated for each group. A SQL
statement with a GROUP BY clause may also include a WHERE clause. Thus the
query
Find the total number of units of all products sold by each salesperson
whose salesperson number is at least 150.
would look like:
SELECT SPNUM, SUM(QUANTITY) AS [SUM]
FROM SALES WHERE SPNUM> =150 GROUP BY SPNUM
236 DATA ACCESS AND MANIPULATION
In addition to the SUM aggregate function, you could use GROUP BY with
other aggregate functions such as AVG and COUNT. You cannot, however, use one
function to evaluate the result of another function. Consider the following query:
Find the average number of units of all products sold by each salesperson.
You might want to use a query like the following:
SELECT SPNUM, AVG(SUM(QUANTITY)) AS [AVG]
FROM SALES GROUP BY SPNUM
The query can accomplish this by adding a HAVING clause to the end of
the SELECT statement, as follows:
SELECT SPNUM, SUM(QUANTITY) AS [SUM]
FROM SALES WHERE SPNUM>=150 GROUP BY SPNUM
HAVING SUM(QUANTITY)>=5000
Salesperson Number 204, whose total is only 1543 units sold, is dropped from
the results.
Notice that in this last SELECT statement, there are two limits specified. One
limit, that the Salesperson Number must be at least 150, appears in the WHERE
clause as filter logic. The other limit, that the sum of the number of units sold
must be at least 5000, appears in the HAVING clause. It is important to under-
stand why this is so. If the limit is based on individual attribute values that appear
in the database, then the condition goes in the WHERE clause, in this example,
the Salesperson Number value. If the limit is based on the group calculation per-
formed with a built-in function, then the condition goes in the HAVING clause.
In our example, this is the case with the sum of the number of product units sold.
In queries like the one used here, the order in which keywords appear in
the statement is also critical. HAVING can be used only with the GROUP BY
clause. The GROUP BY clause must always precede HAVING. You have the same
restriction when using both GROUP BY and ORDER BY. The GROUP BY clause
must always come before the ORDER BY clause.
The UNION operator combines the results of two queries into one. The final
result is a list of unique rows unless you also include the ALL keyword, with
the syntax UNION ALL. UNION ALL returns all rows, including duplicate rows.
First, let’s look at a sample query:
SELECT HQCITY FROM CUSTOMER WHERE SPNUM = 137
FOR EXAMPLE
Combining Keywords
One of the real strengths of the SELECT queries lies in what you can do by
combining keywords in different ways to get the exact results that you want.
GROUP BY, for example, is useful in generating all types of summary result
reports. For example, you might want to generate a report that provides a
daily sales report by salesperson over a month. You would use the GROUP
BY clause with two grouping conditions, grouping first by day and then by
salesperson. Suppose you have an ORDER_HEAD table from which you can
get this information. You might use:
SELECT SPNUM, SPNAME, SUM(ORDER_TOT)
FROM ORDER_HEAD
GROUP BY DAY_DATE, SPNUM, SPNAME
This would return a result grouped first by day, and then within that group,
by salesperson. The SPNUM and SPNAME groupings are identical. The rea-
son you need to have both in the GROUP BY clause is so that you can have
both in the SELECT list. Otherwise, either one could be used by itself to
define the grouping. This particular query, however, might not give you the
information in the format in which you need it. It would probably be eas-
ier to review the result if you return the result in SPNUM order, by adding
an ORDER BY clause, as in:
SELECT SPNUM, SPNAME, SUM(ORDER_TOT)
FROM ORDER_HEAD
GROUP BY DAY_DATE, SPNUM, SPNAME
ORDER BY SPNUM
In this situation, you couldn’t order the result by SPNAME. Why not?
Because it isn’t an aggregate and it doesn’t appear in the GROUP BY clause.
If you wanted, to sort by SPNAME, you might change the query to read:
SELECT SPNUM, SPNAME, SUM(ORDER_TOT)
FROM ORDER_HEAD
GROUP BY DAY_DATE, SPNAME, SPNUM
ORDER BY SPNAME
242 DATA ACCESS AND MANIPULATION
The result only tells you how many rows are updated. To see the effect of the
change, you would need to retrieve the row data from the SALESPERSON table
and be sure and include the COMMPERCT column.
SELF-CHECK
• Describe the purpose of the TOP, DISTINCT, GROUP BY, ORDER
BY, and HAVING keywords.
• Describe the limits on the ORDER BY clause when used with the
GROUP BY clause.
• Compare and contrast the use of UNION, EXCEPT, and INTERSECT.
and the join attributes in the tables being joined must be declared and matched
to each other in the WHERE clause. Since two or more tables are involved in a
join, the same column (field) name may appear in more than one of the tables.
When this happens, the column names must be qualified with a table name
when used in the SELECT statement. All of this is best illustrated by an exam-
ple. Consider the following request:
Find the name of the salesperson responsible for Customer Number 1525.
A SELECT statement that will satisfy this query is:
SELECT SPNAME FROM SALESPERSON, CUSTOMER
WHERE SALESPERSON.SPNUM=CUSTOMER.SPNUM
AND CUSTNUM=1525
In brief, the query uses filtering logic to limit the result to customer 1525 in the
CUSTOMER table. The tables are joined on the SPNUM column to locate the
correct salesperson name. We’ll take a closer look at this in a moment.
It’s common to use a table alias, where a different value is used to repre-
sent the table name, to reduce your typing when using joins. For example, you
might rewrite the previous example as:
SELECT SPNAME FROM SALESPERSON s, CUSTOMER c
WHERE s.SPNUM=c.SPNUM AND CUSTNUM=1525
The result would be the same because you haven’t made any changes to what
the SELECT statement is doing. You’ve just changed how you identify the tables
so you don’t have to retype the full table name each time.
Let’s break that SELECT statement down so that we can see exactly what’s
going on. Notice that the two tables involved in the join, SALESPERSON and
CUSTOMER, are listed in the FROM clause. Also notice that the first line of
the WHERE clause links the two join columns: SPNUM in the SALESPERSON
table (SALESPERSON.SPNUM) and SPNUM in the CUSTOMER table (CUS-
TOMER.SPNUM).
SALESPERSON.SPNUM=CUSTOMER.SPNUM
The notational device of having the table name, then a period, and then the
column name is known as qualifying the column name. This qualification is nec-
essary when the same name is used in two or more tables in a SELECT state-
ment, resulting in an ambiguous column reference. SPNAME and CUSTNUM
don’t have to be qualified because each appears in only one of the tables included
in the SELECT statement, so there is no question as to which column you want.
244 DATA ACCESS AND MANIPULATION
Here is an example of a join involving three tables, assuming for the moment
that salesperson names are unique:
List the names of the products of which salesperson Adams has sold more
than 2,000 units.
The salesperson name data appears only in the SALESPERSON table, and
the product name data appears only in the PRODUCT table. The SALES table
shows the linkage between the two, including the quantities sold. And so the
SELECT statement is:
SELECT PRODNAME FROM SALESPERSON, PRODUCT, SALES
WHERE SALESPERSON.SPNUM=SALES.SPNUM
AND SALES.PRODNUM=PRODUCT.PRODNUM
AND SPNAME=’Adams’ AND QUANTITY>2000
As described earlier, the SELECT statement joins two tables, SALESPERSON and
SALES, then joins this result to a third table, PRODUCT. The result is:
PRODUCTNAME
Hammer
Saw
Both of these are examples of inner joins. In an inner join, only qualifying
results are returned from the joined tables. There are two other types of joins
you can run: outer joins and cross joins. An outer join returns qualifying rows
from one table and all rows from the other (outer) table. A cross join returns
all possible rows, whether or not they meet the qualifying join logic, in every
possible combination. This is known as a Cartesian product. However, before
we show these, we need to introduce a different join syntax.
This is an inner join. The result is limited to rows where the SPNUM meets the
qualifying logic in both tables, giving you the following result:
SPNAME CUSTNAME
Baker Main St. Hardware
Adams Jane’s Stores
Baker ABC Home Stores
Baker Acme Hardware Store
Carlyle Fred’s Tool Stores
Carlyle XYZ Stores
Baker City Hardware
7.2.2 USING DIFFERENT JOIN SYNTAXES 245
The syntax used here, though supported by many DBMSs (including SQL
Server), is not the ANSI standard syntax for specifying a join. For that, you
would need to run:
SELECT SPNAME, CUSTNAME
FROM SALESPERSON JOIN CUSTOMER
ON (SALESPERSON.SPNUM=CUSTOMER.SPNUM)
The JOIN keyword identifies the joined table with the joining logic going in
the ON clause. You could have specified INNER JOIN instead of just JOIN, but
the INNER keyword is assumed. If you wanted to include additional qualifying
logic, you would add a WHERE clause and include the search argument there,
such as:
SELECT SPNAME, CUSTNAME
FROM SALESPERSON JOIN CUSTOMER
ON (SALESPERSON.SPNUM=CUSTOMER.SPNUM)
WHERE SALESPERSON.SPNUM=1525
To turn our initial example into an outer join, you might write it as:
SELECT SPNAME, CUSTNAME
FROM SALESPERSON LEFT OUTER JOIN CUSTOMER
ON (SALESPERSON.SPNUM=CUSTOMER.SPNUM)
In this query, rows from the SALESPERSON table are returned whether or not
they qualify. We know this because we specify a LEFT OUTER join. The LEFT
is in reference to the join logic, and as you can see, the SALESPERSON table is
on the left side of the equality operator. These are matched to NULL values for
columns in the right (CUSTOMER) table. This SELECT statement returns:
SPNAME CUSTNAME
Smith NULL
Potter NULL
Baker Main St. Hardware
Baker ABC Home Stores
Baker Acme Hardware Store
Baker City Hardware
Adams Jane’s Stores
Adams Central Stores
Dickens Western Hardware
Carlyle Fred’s Tool Stores
Carlyle XYZ Stores
This returns two additional SPNAME values, Smith and Potter. These salespersons
are not assigned any customers, so they are not included in the CUSTOMER
246 DATA ACCESS AND MANIPULATION
table. If you ran this as a RIGHT OUTER JOIN, the difference is that unquali-
fied rows in the RIGHT table (if any) would be returned by the query.
To return all rows and every possible column combination from both tables,
use a cross join. In this case, you would write the SELECT statement as:
SELECT SPNAME, CUSTNAME FROM SALESPERSON CROSS JOIN
CUSTOMER
Notice that there is no qualifying logic when you run a cross join. Rather
than qualifying the rows, you are returning all rows. Because the SALESPERSON
table has 6 rows and the CUSTOMER table has 9 rows, this query’s result set
would return a total of 54 rows, every possible combination. Let’s limit the result
a bit so we can see this a little easier. We’re going to limit our query to the first
three customers and the first three salespersons. The query becomes:
SELECT SPNAME, CUSTNAME
FROM SALESPERSON CROSS JOIN CUSTOMER
WHERE SALESPERSON.SPNUM < 140 AND CUSTNUM < 1000
From what you just learned, you would expect the result to return nine rows,
every combination of three customers with three salespersons. When you check
the result, you find that to be true:
SPNAME CUSTNAME
Smith Main St. Hardware
Smith Jane’s Stores
Smith ABC Home Stores
Potter Main St. Hardware
Potter Jane’s Stores
Potter ABC Home Stores
Baker Main St. Hardware
Baker Jane’s Stores
Baker ABC Home Stores
There are seldom many valid opportunities to use a cross join, especially in
a production database, though it is sometimes used as part of your analysis in
decision support databases.
Just as when you used a join to retrieve this information, the query returns a
single row, as follows:
SPNAME
Carlyle
Let’s break down this query statement and take a closer look at what it’s
doing. Because the innermost SELECT (the indented one), which constitutes the
subquery, is considered first, the CUSTOMER table is queried first. The record
for Customer Number 1525 is found and 361 is returned as the SPNUM result.
How do you know that only one salesperson number will be found as the result
of the query? Because CUSTNUM contains unique values, Customer Number
1525 can only appear in one record, and that one record only has room for one
salesperson number! Moving along, Salesperson Number 361 is then fed to the
outer SELECT statement, which, in effect, makes the main query, that is, the outer
SELECT, look like:
248 DATA ACCESS AND MANIPULATION
SELECT SPNAME
FROM SALESPERSON
WHERE SPNUM=361
The result, as you already know from the previous result, is:
SPNAME
Carlyle
The problem is that neither of these will give you the result you want. It’s
like asking SQL to perform two separate operations and somehow apply one to
the other in the correct sequence. This turns out to be asking too much. But
there is a way to accomplish the query, and it involves subqueries. You need to
ask the system to determine the minimum commission percentage first, in a sub-
query, and then use that information in the main query to determine which sales-
persons have that value in the COMMPERCT column:
SELECT SPNUM FROM SALESPERSON
WHERE SPNUM>200 AND COMMPERCT=
(SELECT MIN(COMMPERCT)
FROM SALESPERSON WHERE SPNUM>200)
FOR EXAMPLE
Practical Joins
Most database applications make extensive use of joins. In fact, the closer
you come to the 3NF goal, the greater the need for joins when retrieving
data. Sometimes, joins can become quite complicated. Consider this
situation. You structure customer orders so that they use two tables,
ORDER_HEAD and ORDER_DETAIL. When you print a copy of the order
for the customer, you want it to include the order number, customer name,
the employee who wrote the order, the date, and the detail line items with
descriptions. In a properly normalized database, you won’t store the customer
or employee names in the ORDER_HEAD table, just the identifying values
so you can retrieve the information from the appropriate related tables.
Internally, your application would use a SELECT statement to retrieve
the data. Actually, it would probably use two SELECT statements: one to
retrieve the header information and a second to get the detail line times. In
the application, you would probably use a replaceable parameter for the
order number, so we’ll do the same in our example. If you use @ordnum
for the order number, the first SELECT statement might look like:
SELECT C.COMPANYNAME, E.FIRSTNAME + ’ ’ + E.LASTNAME,
O.ORDERNUM, O.ORDERDATE
FROM ORDER_HEAD O, CUSTOMER C, EMPLOYEE E
WHERE C.CUSTID = O.CUSTID AND E.EMPID ⫽ O.EMPID
AND
O.ORDERNUMBER = @ORDNUM
In this statement, you join the ORDER_HEAD table to the CUSTOMER table
on CustID and then join the result to EMPLOYEE on EmpID.
Now, to get the detail line item information, you might use:
SELECT D.ITEMNUM, P.DESCRIPTION, D.QUANTITY,
D.SELLPRICE, (D.QUANTITY*D.SELLPRICE)
FROM ORDER_DETAIL D, PRODUCT P
WHERE P.PRODUCTID = D.SKUNUM AND D.ORDERNUM = @ORDNUM
There are a couple of additional things you should notice about these state-
ments. In each case, the joining column is not part of the result set. It’s used
to retrieve the result, but doesn’t need to appear anywhere in the result.
Also, in the second query, notice that the linking column has different names
in the ORDER_DETAIL and PRODUCT tables. The columns don’t have to
have the same names, just contain the same data. Also, as a final note, you
will typically use all of the join columns as index key columns in the source
tables if performance during data retrieval is a concern.
250 DATA ACCESS AND MANIPULATION
This yields the result of salesperson number 204. Actually, this is a very inter-
esting example of a required subquery. What makes it really interesting is why
the predicate, SPNUM⬎200, appears in both the main query and the subquery.
Clearly it has to be in the subquery because you must first find the lowest com-
mission percentage among the salespersons with salesperson numbers greater
than 200. But then why does it have to be in the main query, too? The answer
is that the only thing that the subquery returns to the main query is a single
number, specifically a commission percentage. There is no memory passed on
to the main query of how the subquery arrived at that value. If you remove
SPNUM⬎200 from the main query so that it now looks like:
SELECT SPNUM FROM SALESPERSON
WHERE COMMPERCT=
(SELECT MIN(COMMPERCT) FROM SALESPERSON)
WHERE SPNUM>200)
this query returns every salesperson (without regard for salesperson number)
whose commission percentage is equal to the lowest commission percentage of the
salespersons with salesperson numbers greater than 200. Of course, if for some
reason you do want to find all of the salespersons, regardless of their salesperson
number, who have the same commission percentage as the salesperson who has
the lowest commission percentage of the salespersons with salesperson numbers
greater than 200, then this last SELECT statement is exactly what you should write.
SELF-CHECK
• Compare and contrast joins and subqueries.
• Explain why it is typically preferred to use a join instead of a
subquery when either will work.
• Describe the three basic join types.
this is to use batches and scripts. A batch is a set of SQL commands that run
as a group. A script is one or more batches saved as a file.
Scripts are a commonly used tool for automating periodic procedures. For
example, you might have a script you run each month that uses current sales
information to recalculate stocking levels and order points. A video rental com-
pany could use a script to send remainder emails to customers who have over-
due videos. One drawback is the fact that scripts don’t give you a way to pass
parameters when the script runs, so you can’t control execution through replace-
able parameters. Because of this, scripts are well suited to situations where you
need to run multiple steps the same way each time.
You can write a batch ad hoc in a query window or using any text editor,
such as Windows Notepad. When you save the batch to an operating system
file, it becomes a script. When saving a script for SQL Server, you should use
the extension.sql so that SQL Server will recognize it as a script file. Because the
dividing line between the two is so fine, the terms batch and script are often
used interchangeably.
A script can contain multiple batches. In SQL Server, the keyword GO is
used to identify when one batch ends and the next one starts. For example:
USE GENERALHARDWARE
GO
SELECT * INTO #SALESTEMP FROM SALESPERSON
UPDATE #SALESTEMP SET COMMPERCT = 10
GO
The query processor executes all of the statements up to the GO as one batch.
The query processor parses all of the statements in the batch as a set, checking
for syntax and other errors. The statements are also optimized and compiled
for execution as a group. This is more efficient than running individual state-
ments because they must each be parsed, optimized, compiled, and then executed
individually.
A result is returned when the entire batch (up to the GO) finishes executing.
If an error occurs anywhere in the batch, batch execution stops at that point and
the remaining statements are not executed. Unless running in the context of a
252 DATA ACCESS AND MANIPULATION
transaction, any statements completed before the error are not affected. If you
are running a script that contains multiple batches, execution will continue (if
possible) with the start of the next batch.
There are a few things you should keep in mind when using scripts. You
don’t know the context under which the script will run, so you should run a
USE statement as the first statement in the script to set the default database. You
can’t add a column to a table as part of a batch and then reference that column
in the same batch (but you can reference it in the same script). Also, if an object
(such as a table) is dropped in the context of a batch, it cannot be recreated in
the same batch.
Let’s look at a script that creates a table and then adds data to that table:
USE GeneralHardware
GO
CREATE TABLE EmpTest
(SPNUM CHAR(3), SPNAME VARCHAR(20))
GO
INSERT EmpTest SELECT SPNUM, SPNAME
FROM SALESPERSON
This script contains three batches. The first sets the default database. The sec-
ond creates a table with two columns. The third batch inserts rows from the
SALESPERSON table into the newly created EmpTest table.
Using Variables
There are two things common to every variable. Each variable will have an iden-
tifier, which is a name unique within the scope in which is used, and each vari-
able will have a data type. Data types can be standard system data tables or user-
defined database.
7.3.2 UNDERSTANDING BASIC PROGRAMMING CONCEPTS 253
To create a local variable, which exists only in the context of the batch in
which it is defined, you use a commercial at symbol (@) as the first character
in the variable name. The syntax for defining, or declaring, a local variable is:
DECLARE @name data_type
Now, let’s take a look at a batch that declares a variable, sets its value, then
uses the variable in a DML statement:
DECLARE @avgpct REAL
SELECT @avgpct = (SELECT AVG(COMMPERCT) FROM SALESPERSON)
UPDATE SALESPERSON SET COMMPERCT = @avgpct
This is functionally the same as an example you saw earlier. You declare a REAL
type variable named @avgpct and set it equal to the average of all commission
percentages. Then, you use the variable in an UPDATE statement to set COMM-
PERCT to that value for all salespersons.
There are several situations where your batch might need to make a deci-
sion on what to do next based on current conditions or values. You do this with
IF . . . ELSE. The basic syntax is:
IF boolean_expression
[BEGIN]
statement_block
[END]
[ELSE
[BEGIN]
statement_block
[END]]
This is one of the places where you used statement blocks to identify sets
of SQL statements, so you can identify which statements are associated with IF
and which are associated with ELSE. It’s easiest to understand how this works
by looking at an example. Consider the following:
If there are no items currently on order, run ORDERPROC to generate restock-
ing orders. If there are items shown as on order, first run RECPROC to receive
any pending orders that have been posted and then run ORDERPROC().
There is a decision to be made based on whether or not there are any items
on order. This is the basis of your Boolean expression, which must evaluate as
True or False. Here’s what the batch might look like:
IF (SELECT COUNT() FROM PRODUCT WHERE QOO = 0) = 0
BEGIN
ORDERPROC
END
ELSE
BEGIN
RECPROC
ORDERPROC
END
The statements are indented to make the batch easier to read. The indenta-
tions have no effect on execution. If the count of items with a QOO (quantity
on order column) value of 0 is equal to zero, then there is nothing on order and
you need to run ORDERPROC. If you get any value other than zero, execution
skips to the statements following the ELSE clause and you run RECPROC and
ORDERPROC. Technically, you are not required to use BEGIN and END when
executing a single statement, like the statement after the IF clause, but it’s good
to get in the habit of using them in case you need to go back and make changes
later, such as adding more statements.
It’s a good idea to document everything you do when writing scripts. It acts
as a reminder of what you did and why in case there are problems or a need to
modify the script later. You can embed nonexecuting statements in a batch or
7.3.2 UNDERSTANDING BASIC PROGRAMMING CONCEPTS 255
FOR EXAMPLE
Using Scripts
Here’s a common situation. You have several procedures that have to run at
the end of each month to do periodic cleanup and to prepare data for trans-
fer to a decision support database. You want to ensure that it is run the
same way each time, but you also want to minimize the time and effort
required. This is a situation made for scripting.
You can include all of the procedures that you want to run as batches
inside a script. You might want to use multiple batches. Group procedures
that are dependent on each other to run in the same batch so that if one
fails, the other doesn’t try to run. Also, if one part of the process fails because
of an error, it doesn’t prevent procedures that aren’t dependent on that oper-
ation from running. You can use control language if you have any situations
where processing decisions have to be made based on current conditions.
For example, the way that you process records for transfer could depend on
the total number of rows in a table. If they exceed a set threshold level, you
could segment processing to minimize the impact on other activities.
SQL Server also has the ability to create scripts based on any and all
objects in a database. You can automatically generate scripts that can be used
to create duplicate objects in another database or at another location, drop
objects, and (in some cases) modify objects. You can even generate table and
view scripts for running standardized INSERT, UDPATE, and DELETE state-
ments. That way, you let SQL Server do a big part of the work for you and
you then modify the scripts generated to meet your particular needs.
256 DATA ACCESS AND MANIPULATION
In the preceding example, you see both syntaxes used for adding comments.
The batch begins with a multi-line comment enclosed in /* */. The DECLARE
statement is followed by a short comment on the same line. Execution stops
before the identifying --. The remaining two comments are both single-line
comments, one using -- and the other using /* */.
SELF-CHECK
• Describe the purpose of batches and scripts.
• Explain the role of variables.
• Explain the basic role of control language.
Summary
In this chapter, you learned about data access and data manipulation. You learned
how to use the SELECT statement to run data retrieval queries and how to build
queries with SELECT statement keywords. You also learned about various methods
for filtering your result set and organizing the data returned by a query. You learned
how to use joins and subqueries to combine the results from multiple tables into a
single result. You also learned the basics of using batches and scripts to execute sets
of statements as a group and to save these so that they can be retrieved and exe-
cuted later, including the use of control statement to enable the command proces-
sor to control execution by making decisions based on current data and conditions.
KEY TERMS
AND Local variable
Bitwise operations Noncorrelated subquery
Boolean expression OR
Cartesian product Outer join
Control statement Subquery
Control-of-flow statement Table alias
Correlated subquery Table order
Cross join User-defined function
Identifier Variable
Inner join XML document fragment
Join
SUMMARY QUESTIONS 257
Summary Questions
1. You can only use tables as your data source in the FROM clause of a
SELECT statement. True or False?
2. The HAVING keyword must be used with what other keyword?
(a) GROUP BY
(b) ORDER BY
(c) JOIN
(d) DISTINCT
3. When using UNION to combine the results of two SELECT statements,
the column lists must have columns with identical data types. True or
False?
4. What keyword is used to return only those rows with identical values
from two different SELECT statements?
(a) UNION
(b) EXCEPT
(c) INTERSECT
(d) DISTINCT
5. A SELECT statement includes a GROUP BY clause. In what clause is sort
order for the groups specified?
(a) HAVING
(b) GROUP BY
(c) WHERE
(d) ORDER BY
6. Which statement will return only the first three rows from the
PRODUCT table?
(a) SELECT * FROM PRODUCT WHERE TOP 3
(b) SELECT TOP 3 * FROM PRODUCT
(c) SELECT * FROM PRODUCT TOP 3
(d) SELECT * TOP 3 FROM PRODUCT
7. When would you use the INTO keyword in a SELECT statement?
(a) when creating a view based on the result.
(b) when creating a table based on the result.
258 DATA ACCESS AND MANIPULATION
2. You want a list of cities in the CUSTOMER table and the number of cus-
tomers in each headquarters city (HQCITY). What query should you use?
3. Modify the query in question 2 to order the list by count, lowest to
highest. What query should you use?
4. You want a count of customers assigned to salespersons, by salesperson,
in salesperson number order. The query should return the salesperson
number, name, and customer count.
5. You need a list of customers in the CUSTOMER EMPLOYEE table and
the number of employees on file for each. Label the columns Customer
and Employee Count.
Tip: Enclose the table name in square brackets.
260 DATA ACCESS AND MANIPULATION
Figure 7-2
Sample tables.
7. Write a script that creates a table named EmpCount. The table will have
two columns, one with data type CHAR(4) and the other with data type
INT. Name the columns Customer and Count. After creating the table,
insert rows into the table using the query in Question 5 as your source.
Before doing anything, the script should make sure that GeneralHardware
is the default database.
8. How can you verify that the batch in Question 7 worked correctly? What
specific statements should you use?
YOU TRY IT
262
8
IMPROVING DATA ACCESS
Starting Point
Go to www.wiley.com/college/gillenson to assess your knowledge of ways to
improve data access.
Determine where you need to concentrate your effort.
INTRODUCTION
Data access is a broad subject, so any discussion of ways to improve data access
performance will range across various topics. This chapter covers some of the
issues that you are most likely to encounter, which includes providing a look at
several database objects and providing general guidelines for their use. You will
learn about performance bottlenecks, including what they are, the types of
symptoms they are likely to cause, and the most likely solutions to correct
performance problems. You will learn how to design indexes and views to support
data access and help improve query response. You will also learn about the
potential role of user-stored procedures and user-defined functions in your data-
base design.
▲ Hardware platform: primarily the database server, but in some cases, can
also include the network.
▲ Database: performance during data reads and writes directly impacts
application performance.
▲ Application: a poorly written, inefficient application can result in poor
performance even if the hardware and database server are working at
optimum levels.
8.1.2 UNDERSTANDING HARDWARE PERFORMANCE 265
Figure 8-1
a database and application server. This puts an added load on all resources,
including the hard disk.
Disk fragmentation can impair disk performance, though it usually isn’t con-
sidered a bottleneck in itself and often isn’t a serious issue in database systems.
Fragmentation occurs when files are broken up into smaller pieces to use
available disk space. The more you delete and create files, the faster the drive
fragments. Most computers create and delete more files than you would suspect,
many of them temporary files used by applications and deleted when no longer
needed.
Disk fragmentation is less of an issue for database servers than for comput-
ers operating in other roles. In a well-designed database system, the database
files are kept somewhat segregated from other files, often on their own drive.
Because you are protecting the database files, they are more likely to remain con-
tiguous, minimizing the potential impact if other files become fragmented.
Microsoft Windows ships with Defrag.exe, a disk defragmentation utility,
which physically moves the file segments so that they form contiguous files. You
can also launch the disk defragmenter via the Computer Management console,
as shown in Figure 8-2.
Figure 8-2
Disk defragmenter.
268 IMPROVING DATA ACCESS
Figure 8-3
System memory
Program
Data cache Read/write request
cache
Read/write request
Database Client
Database
Cache memory.
8.1.2 UNDERSTANDING HARDWARE PERFORMANCE 269
is little you can do to reduce memory requirements. That leaves increasing mem-
ory, typically installing more system memory.
Before installing memory, however, check the database and operating system
configuration to see how memory is being used. With some DBMSs, such as SQL
Server, you can specify the amount of memory made available to database oper-
ations. This could place an artificial limit on memory, causing a memory bottle-
neck, even though there is more physical memory available.
As mentioned before, memory directly impacts disk use. Modern operating
systems increase available memory through use of a virtual memory paging file.
This is hard disk space that is set aside and used like system memory. Paging is
the process of moving data between system memory and the paging file, as
shown in Figure 8-4. When physical RAM memory doesn’t meet your memory
requirements, paging increases, interfering with other disk operations and mak-
ing the hard disk look like your bottleneck.
Figure 8-4
System memory
Paging read/write
Paging file
Hard disk
Paging.
270 IMPROVING DATA ACCESS
determine if the processor is the bottleneck, look at the processor free and busy
time. The higher the percentage of busy time, the greater the load in the proces-
sor. When this averages over 90 percent of total time, then processor perfor-
mance is likely a bottleneck.
The only ways to correct a processor bottleneck are to reduce the processor
load, install a faster processor, or install one or more additional processors. You
need to be aware that there are a couple of possible problems with using a mul-
tiple processor computer. Different DBMSs have different support options, some
limiting the number of processors supported. A few lower-end database products
cannot recognize or use multiple processors, which means that an additional
processor would be a waste of money.
Many DBMSs, including SQL Server, let you configure processor affinity.
This controls how, or even if, the database server makes use of multiple proces-
sors. The problem might not be with the processors, but that the database has
been configured to ignore any additional processors.
Figure 8-5
Figure 8-6
Performance counters.
The Windows XP and Windows Server 2003 Performance utility also lets
you create alerts. An alert monitors one or more specified performance coun-
ters. When you create an alert, you specify the counters that it monitors and the
threshold value at which the alert is activated, or fires. A threshold value is a
specified minimum or maximum value that, when exceeded, indicates a poten-
tial problem. For example, you might suspect an intermittent processor bottle-
neck. You could create an alert configured to fire when the %Processor Time,
which monitors processor activity, exceeds 95 percent. You can have the alert
notify you each time this threshold is exceeded.
8.1.4 PERFORMANCE MONITORING 273
SQL Server also provides tools for monitoring database server activity. A
complete discussion of the available tools and utilities is beyond the scope of
this chapter, but it’s worth the time to take a brief look at some examples.
There are two utilities specifically identified by SQL Server 2005 as perfor-
mance tools. These are SQL Server Profiler and Database Engine Tuning Advi-
sor. Profiler lets you capture database activity for analysis. You can even capture
sample activity to use for automated testing and then replay the activity on a
development server to observe the effect of configuration changes. Database
Engine Tuning Advisor, which was introduced with SQL Server 2005, analyzes
system activity and reports on index use. It can suggest improvements to table
indexes and even automatically create the indexes for you.
Another way of monitoring database activity, specifically query activity, is by
reviewing execution plans. SQL Server provides options for both textual and
graphical execution plans. An example of a graphical execution plan is shown
in Figure 8-7.
The execution plan identifies how the query processor resolves the query,
the specific steps involved and resources required. One way that this can help
monitor performance is that it shows you not only what indexes are used by the
query, but how they are used by the query. SQL Server Books Online provides
Figure 8-7
Execution plan.
274 IMPROVING DATA ACCESS
detailed documentation about how to collect and read information about query
execution plans.
FOR EXAMPLE
Database Optimization
Over time, users have started complaining about database performance. You
tried to write it off with the idea that no application is ever fast enough to
please all users, but the complaints become louder and more frequent. You
finally have to face facts—you have a problem.
You go through all the right steps. You monitor performance and com-
pare it against your performance baseline. You identify the most likely causes
of the performance bottleneck and take corrective action. You document
what you did and you’re finished, right?
Wrong! Any time you are trying to correct a performance problem, you
have to test and verify that you’ve corrected it. Otherwise, how can you be
sure that the changes you made had any impact at all, let alone a positive
impact? There’s a possibility that you’ve actually made the problem worse,
especially if your corrective action included changing database or server con-
figuration settings.
Monitoring and evaluation is at least as important during resolution and
correction as it is when diagnosing the problem. Don’t forget that everything
is closely related to everything else. This includes hardware resources, operat-
ing system configuration, and database server configuration. Changes to any
one almost always have some impact on others, and not always for the better.
After you’ve taken your corrective action, run the same tests as you did
before when you were trying to diagnose the problem. Compare these both
to that data and to your baseline data to see where you stand. Only then,
when you have evidence that you’ve corrected the situation, should you ten-
tatively consider the problem solved. But you still are not completely fin-
ished. Spot check performance over time to make sure the problem doesn’t
reappear. Also, go back to the users who originally reported the problem to
see if they think it’s fixed. Only then should you consider yourself to be in
the clear.
Until the next problem, that is.
8.2 USING INDEXES AND VIEWS 275
system, so you would use operating systems utilities to monitor system hard-
ware and to look for bottlenecks. Database performance monitoring is a func-
tion of the DBMS, so to monitor database performance you would use database
utilities, either supplied with the DBMS or through a third-party developer. Keep
in mind that performance tuning isn’t an either/or proposition. You need to tune
both the database server and the hardware platform (and operating system) on
which it is deployed.
The tools you have available will depend on your operating system and
DBMS. Even though we have used Windows and SQL Server utilities as exam-
ples, the specific utilities we’ve discussed are vendor-specific and apply to these
products only. With any vendor’s products, refer to the documentation provided
with the product or on the manufacturer’s Web site for information about avail-
able utilities. With that in mind, let’s take another quick look at your options.
The primary Windows utilities for anything to do with performance are the
Performance utility and the various hardware property dialog boxes, which are
where the majority of the hardware configuration settings are found. The data-
base tuning process starts with two utilities, SQL Server Profiler and Database
Engine Tuning Advisor. Together, they can point out weak points in your data-
base design. Once located, how you fix them is up to you.
SELF-CHECK
• Explain how server performance relates to improving data access.
• List the most common hardware bottlenecks.
• Describe the potential impact of indexes on both read and write
performance.
your database because of potential security concerns. Also, nearly any database
is going to have sensitive data that must be protected against unauthorized
access. This is where views come into the picture as one of your primary tools
for managing data access.
We’re going to take a little closer look at both indexes and views in this
chapter. We’ll look at some additional guidelines and options for creating each,
and give you some ideas for how they might fit into your database design.
Unfortunately, indexes are not something that you can fix once and ignore.
As the table sizes change, so does the effectiveness of some of your indexes.
Application requirements also can change over time, so that some indexes are
seldom used and others that would be used by application queries don’t exist.
You need to periodically monitor index use to determine their effectiveness, espe-
cially if users report degraded performance.
Considering their effect on data retrieval, you might be tempted to just cre-
ate indexes for all table columns. There are two potential problems with this.
One is disk space. Each index row requires space based on its key columns. The
more indexes you create, and the more columns you include, the more space
required. Some DBMSs also limit the number and size of index key columns,
restricting your index design. You might also be limited in the types of columns
you can include in an index. SQL Server, for example, doesn’t let you use large
object (LOB) data types as index key columns.
You also have to consider the effect of indexes on write performance. Each
time you add a row to a table, you add a row to each table index. If you change
a key column value, the same change must be made to all indexes that use that
key. As you add more indexes, you increase write requirements.
There is one thing you can do to help improve performance when work-
ing with very large tables in SQL Server. You should organize your physical
storage so that the table is located on a different physical hard disk than its
nonclustered indexes. This allows the database to split read and write activity
across multiple hard disks, reducing the load on any one. Clustered indexes,
by design, must be stored with the table because they are physically part of the
table structure.
Creating Indexes
The CREATE INDEX command is used to create table indexes. The basic syn-
tax for this command in SQL Server Transact-SQL is as follows:
CREATE [CLUSTERED] | [NONCLUSTERED] [UNIQUE]
INDEX name
ON object (column_list)
INCLUDE (column_list)
As you can see here, you can create an index as clustered or nonclustered,
defaulting to nonclustered indexes. Specify UNIQUE if you want the index to
act as a unique constraint on the key columns. This requires that each row of
combined values for the key columns be unique. For example, take a look at
Figure 8-8.
You could not create a unique index on the CUSTNUM column in CUS-
TOMER EMPLOYEE because the column contains duplicate values. However,
you could create a unique index that includes both CUSTNUM and EMPNUM
as key columns.
278 IMPROVING DATA ACCESS
Figure 8-8
The object on which the index is based is typically a table. In some DBMSs,
and with some restrictions, you can also create an index on a view. The
INCLUDED keyword is specific to SQL Server 2005. Columns specified as
included columns through the INCLUDED keyword benefit from improved
access through indexes, but do not impact index sort order like the key
columns.
The CREATE INDEX statement supports additional options that are beyond
the scope of this chapter, but are fully documented in SQL Server Books
Online. These include options that let you manage how much free space is left
for inserting new rows when you create an index and whether or not users
will be able to access the table on which the index is based during index creation.
There is also an option that lets you drop and then recreate an index in the
same operation.
To create the index described earlier for the CUSTOMER EMPLOYEE table,
you could run:
CREATE UNIQUE INDEX ix_CE_NUMS
ON [CUSTOMER EMPLOYEE] (CUSTNUM, EMPNUM)
You should try to give your indexes descriptive names. This will be helpful in
identifying the index you want if you need to modify indexes in the future.
Additional commands are also available to manage existing indexes. ALTER
INDEX lets you manage an index and DROP INDEX lets you delete indexes.
However, you cannot use ALTER INDEX to change the index key and nonkey
columns. If you want to change the index columns, you must drop and then
recreate the index.
8.2.1 WORKING WITH INDEXES 279
The query optimizer finds this information in the database metadata. The
query optimizer uses the information about the tables, together with the various
components of the SELECT statement itself, to try to find an efficient way to
retrieve the data required by the query. For example, in the General Hardware
Company SELECT statement:
SELECT SPNUM, SPNAME
FROM SALESPERSON
WHERE COMMPERCT=10
In this case, the query optimizer should be able to recognize that since CUSTNUM
is the CUSTOMER table primary key and only one customer number is specified
in the SELECT statement, only a single record from the CUSTOMER table, the
one for customer number 1525, will be involved in the join. Once it finds this
CUSTOMER record (in an index), it can match the SPNUM value found in it
against the SPNUM values in the SALESPERSON records. If you’ve identified
SPNUM unique through a unique index, all it has to do is find the single SALES-
PERSON record (preferably through an index) that has that salesperson number
and pull the salesperson name (SPNAME) out of it to satisfy the query. Thus, in
this case, an exhaustive join can be completely avoided.
When a more extensive join operation can’t be avoided, the query optimizer
can choose from one of several join algorithms. The most basic way of resolv-
ing a join, which runs using a Cartesian product of the tables involved, is known
280 IMPROVING DATA ACCESS
Creating Views
The basic syntax for creating a view in SQL Server is:
CREATE VIEW name AS select_statement
SQL Server also supports additional options that control how the view is
created. These options are beyond the scope of this chapter, but are documented
in SQL Server Book Online.
Because the view is based on a SELECT statement, you can join tables
through a view. This provides easy access to denormalized data without chang-
ing the structure of the database. You can also use views to limit access by fil-
tering tables. You can filter the base tables vertically, limiting the columns
included, or horizontally, limiting the rows included. You can also include com-
puted columns in your view to support queries that need values based on val-
ues computed from table columns or function results.
SQL Server does put some limits on the SELECT statement. For example,
you can’t use the INTO keyword. Also, you can’t have an ORDER BY clause in
the statement unless you also include the TOP keyword. You can’t base the
SELECT statement on temporary objects, like a temporary table, because they
might not be there when you try to retrieve data from the view. That’s because
SQL Server, in most cases, doesn’t persist the contents to the view. It generates
the view result set each time you call the view. That way, changes to the view’s
base objects are automatically included.
8.2.2 WORKING WITH VIEWS 281
You also have the ALTER VIEW and DROP VIEW commands available to
manage existing views. ALTER VIEW lets you modify the view definition, which
includes letting you specify a different SELECT statement for the view. Use DROP
VIEW to delete existing views.
Using Views
Views are most often used to provide restricted access to the underlying base
objects. In SQL Server, views can also simplify security management because you
can give users (or an application) access to a view without granting them access
to the underlying objects. This helps prevent unauthorized access or modifica-
tions to the base data.
Let’s look at an example using the CUSTOMER table in Figure 8-9. There is
a requirement for operations that are performed on a city-by-city basis.
You’ve been asked to create a view based on CUSTOMER with the data fil-
tered by city. Here’s the statement you might use to create the view for New York
customers:
CREATE VIEW v_ny_cust AS SELECT * FROM CUSTOMERS
WHERE HQCITY = ’New York’
To retrieve data from this view, you could use
SELECT * FROM v_ny_cust
Here’s a view based on a join that lets you retrieve customer employee infor-
mation with the customer name rather than CUSTNUM value:
CREATE VIEW v_custemp_name AS
SELECT CUSTNAME, EMPNUM, EMPNAME
FROM CUSTOMER C, [CUSTOMER EMPLOYEE] E
WHERE C.CUSTNUM=E.CUSTNUM
Figure 8-9
CUSTOMER table
CUSTNUM CUSTNAME SPNUM HQCITY
0121 Main St. Hardware 137 New York
0839 Jane’s Stores 186 Chicago
0933 ABC Home Stores 137 Los Angeles
1047 Acme Hardware Store 137 Los Angeles
1525 Fred’s Tool Stores 361 Atlanta
1700 XYZ Stores 361 Washington
1826 City Hardware 137 New York
2198 Western Hardware 204 New York
2267 Central Stores 186 New York
CUSTOMER table.
FOR EXAMPLE
To View, or Not to View
Any business database is going to contain sensitive data, data to which you
need to limit access. In some cases, you are required by law to protect the
data. The more direct a user’s access is to the data, the greater the possibil-
ity that something can go wrong and that the wrong person will see or even
modify what should be secure data.
One solution is to manage table security at the column level, defining
column by column who can view what data. The biggest potential problem
with this solution is that it is easy to make mistakes. A better solution is to
avoid direct access to the data completely and instead give users indirect
access through views. You can create views as necessary to meet different
users’ access requirements.
Consider the types of data that you might have in employee records and
related tables. Database administrators will likely have access to all of that
data, simply because someone has to have complete access for management
and maintenance purposes. If you can’t trust your database administrators,
you likely have a lot more serious problems than someone taking an unau-
thorized peek at employee records.
What about other departments? Human resource managers would prob-
ably need access to most, if not all, of the employee data. Keeping with our
plan to use indirect access, you would create a view that includes most, or
all, of the employee data. The person doing your payroll, however, needs
much more limited access. You might create a view for them, and your pay-
roll application, that gives access to information needed to create a paycheck,
such as time cards, pay rate, tax ID, and similar information. The benefits
advisor needs to know about things like which health plan the employee
has chosen, but the access needed is limited and easily defined.
What about the employee’s manager? Most of the time, there is very little
information to which the manager needs access, which is the guideline to use
when designing any views. Access is probably limited to things like contact
information, accrued and used vacation, performance reviews, and the infor-
mation you need to write performance reviews. As to the rest of the informa-
tion available, if you can’t justify a need for it, leave it out. Other managers, if
given any access at all, should have even less, which means another view.
Throughout the design process, keep this in mind: if you don’t give peo-
ple the information they need to do their jobs, they will let you know. If
you give them more access than they need, you may not find out until after
the damage has been done.
Does creating all of these views take time and effort? Obviously, but the
benefits in simplifying access and managing data security far outweigh the
costs. Also, most views are objects that you can create once and forget. Unless
access needs change or you have to restructure the objects on which they are
based, there is seldom any reason to modify views after you create them.
8.2.2 WORKING WITH VIEWS 283
The view returns the customer name, employee number, and employee
name. Keep in mind, however, that by default the database server must execute
the join each time the view is called, with the associated overhead. Access is sim-
plified though, because the user or application programmer only needs to know
what view to specify in the SELECT statement’s FROM clause, not the full syn-
tax for the join. While not that significant in this example, it can make a big
difference with working with complicated joins and subqueries.
You can also specify a view as your destination when running INSERT or
UPDATE, but with restrictions. When inserting rows through a view, any
columns not included in the view must either have default values or be gener-
ated automatically by the database server, or an error will be returned. These
include columns such as identity columns, columns with defined default values,
and columns that allow NULL values.
When modifying tables through a view, the view must either be based on a
single table or you must define a trigger that executes in place of your modifi-
cation statement. In that case, the trigger will take the data passed by your state-
ment and use it to modify the underlying tables.
Before leaving the subject of views, we need to mention one special type of
view supported by SQL Server, the indexed view. An indexed view is a view
for which you have created a clustered index. When you do this, the view result
is persisted through the index structure. The advantages are easy to see. The
view acts like a denormalized table, giving you easier access to the joined table
data. Also, the persisted index improves access performance because SQL Server
doesn’t have to execute the SELECT statement and recreate the result set each
time you access the view.
There are three major disadvantages to using indexed views that tend to limit
their use in database solutions. First is that there are several restrictions on the
SELECT statement and base tables used to define the view, so indexed views
aren’t always a viable solution. Second is that, as part of the process of creating
the view, you bind the schema on the underlying tables. Schema binding pre-
vents you from making any changes to the structure of the base tables, such as
adding columns or changing column data types. Finally, indexed views come
with measurable resource overhead. When you add or modify data in any of the
view’s base tables, SQL Server must also update the view’s index.
SELF-CHECK
• Describe the role of indexes in improving data access.
• Describe the role of views in improving data access.
284 IMPROVING DATA ACCESS
Figure 8-10
0
- £J
UJ DATAOEV (SQL Server 9.0.
sp_jielpdb A
a a Databases
S CJI System Databases
S Gi Database Snapshot
S (J| AdventureWorks
S lj AdventureWorksDV
a (J Chapter 7 Resits _j Messages
a U GeneraHartfcvare name db_size owner dbtd created status
a LJ Security 1 AdventureWorks 165.94 MB DATADEVNAdmmistiator 6 Aug 17 2006 Status-ONLINE. Updat
♦ _j Server Objects
2 AdventureWorksOW 70.50 MB DAT ADEV\Admmstrator 5 Aug 172006 Status=0NLINE. Updat
S LJ Replication
a LJ Management 3 Chapter7 3.00 MB NT AUTHORITYNSYSTEM 7 Aug 18 2006 Status-ONLINE. Updat
a LJ Notification Services 4 GeneraJHardware 3.00 MB NTAUTHORITY\SYSTEM 8 Aug 18 2006 Status-ONLINE. Updat
a SQL Server Agent 5 master 4.50 MB sa 1 Apr 82003 Status-ONLINE. Updat
6 model 1.69 MB sa 3 Apr 82003 Status-ONLINE. Updat
7 msdb 5.44 MB sa 4 Oct 14 2005 Status-ONLINE. Updat
8 tempdb 8.50 MB sa 2 Aug 18 2006 Status-ONLINE. Updat
< >
<. > 0 Query execute DATAOEV (9.0 RTM) DATADEV'Admmstrator (55] Genet aHardware 000000 8 rows
SQL Server documentation tells you how to run the procedure and the infor-
mation it returns. You can see the results. However, you have no idea what SQL
Server went through to collect that information. Users will run procedures you
create the same way. They’ll know the parameters required and what (if any-
thing) it returns, but not the details of how the result is accomplished. This lets
you hide the details of the database and database objects.
Designing Procedures
Most DBMSs let you create custom procedures, sometimes referred to as user
stored procedures. Procedures let you automate periodic or complex activities
and ensure that operations are run consistently. It doesn’t take long to find sev-
eral candidates for procedures. In addition, because of how they are compiled
and run, procedures are more efficient than running ad hoc queries that perform
the same operations. The need for user intervention is reduced and direct access
to database tables is avoided.
One common use of procedures is data entry. Rather than having the user
(or application) run an INSERT command to enter data rows, you create a proce-
dure that runs INSERT for you, passing the column values as input parameters.
286 IMPROVING DATA ACCESS
The INSERT is performed the same way each time, and the only information
you’ve revealed about table structure is the column values needed to add a row.
More complicated processes are also good candidates for procedures. Because
you can include control-of-flow commands, the procedure can make decisions
while it is running based on the values and other conditions it finds. However,
you should design your procedures so that each procedure accomplishes a sin-
gle task. You can also include error handling code that will let your procedure
detect and respond appropriately to errors, rather than simply failing and report-
ing the problem.
Creating Procedures
You run CREATE PROCEDURE (or CREATE PROC) to create a user stored pro-
cedure. The basic syntax for this command is as follows:
CREATE PROC[EDURE] procedure name
[parameter_list]
AS
Sql_statements
The parameter list includes input and output parameters, both of which are
optional. When defining a parameter, you must supply a parameter name and
data type. Optionally, you can specify a default value to be used if the user
doesn’t specify a value for the parameter, and you can identify the parameter
as an OUTPUT parameter. The SQL statement list can include most SQL com-
mands and control-of-flow statements. You can also call other procedures as
nested procedures, which is a procedure called and executed by another pro-
cedure. When the nested procedure finishes running, control is returned to the
calling procedure.
Let’s look at a simple example with four input parameters:
CREATE PROC proc_enter_cust
@CUSTNUM CHAR(4), @CUSTNAME VARCHAR(40),
@SPNUM CHAR(3), @HQCITY VARCHAR(20)
AS
INSERT CUSTOMER VALUES (@CUSTNUM, @CUSTNAME,
@SPNUM, @HQCITY)
When you run the procedure, specify values for each of the input parameters.
If you wanted to add Home Town Supply as a new customer, you might run:
proc_enter_cust ’4554’, ’Home Town Supply’, ’361’, ’Chicago’
Notice that the parameters are entered in the same order as they are specified.
If you also include the parameter names, you can pass the parameters in any
order, as in the next example:
proc_enter_cust @CUSTNUM=’4554’, @SPNUM=’361’,
@HQCITY=’Chicago’, @CUSTNAME=’Home Town Supply’
8.3.2 UNDERSTANDING FUNCTIONS 287
If you want to have the procedure return a value, you need to specify an
output parameter. Here’s an example that returns a count of customers for a spec-
ified salesperson number:
CREATE PROC usp_count_cust
@SPNUM CHAR(3), @CUSTCOUNT INT OUTPUT
AS
SET @CUSTCOUNT = (SELECT COUNT(*) FROM CUSTOMER WHERE
SPNUM=@SPNUM)
RETURN @CUSTCOUNT
When you run the stored procedure, you have to first declare a variable
to accept the output parameter value from the stored procedure. When the
stored procedure includes an output parameter, you must use the EXECUTE
command to run the procedure. In the next example, we use two variables,
one for the input parameter and one to receive the value from the output
parameter.
DECLARE @NUM CHAR(3)
DECLARE @RETCOUNT INT
SET @NUM = ‘137’
EXECUTE usp_count_cust @NUM, @RETCOUNT OUTPUT
SELECT ‘The count for salesperson’ + @NUM + ‘is’
+ CAST( @RETCOUNT AS CHAR(2)
Let’s talk through this batch so you can see what’s going on. We start by
declaring to local variables, @NUM, which will hold the salesperson number,
and @RETCOUNT, which will receive the value from the output parameter. The
variable receiving the output parameter variable must be of a compatible type.
We set @NUM equal to 137, then use it to run the procedure. We pass @NUM
as the input parameter and specify @RETCOUNT as the output parameter with
the OUTPUT keyword. Finally, we use SELECT to return the value as a string
result. The CAST function is used to cast the output parameter integer value as
a character type to support concatenation.
The syntax for ALTER PROC, which is used to modify a procedure, is effec-
tively the same as the CREATE PROC statement. You must include all of the
same information, including parameters and the defining SQL statements, when
altering a procedure as you do to create a new procedure. Use DROP PROC to
delete existing procedures.
a stored procedure. As you saw earlier, you have the option of deciding whether
or not a stored procedure returns a value. A UDF, by design, must return a
value.
SQL Server 2005 supports three types of functions:
Designing Functions
Situations where you will use UDFs are similar to those where you would use
stored procedures. You use functions where you need to automate operations,
simplify them, or improve security by providing access to table data indirectly
through a function. You can also use a table-valued function in most of the same
situations where you might use a view. The main difference is that you can con-
trol the table returned through the value pass as an input parameter.
When you create a function, you will use one or more SQL statements to
generate the value returned. Most SQL statements are allowed and can include
both other UDFs and system functions.
Creating Functions
The syntax for creating a function depends on the type of function you want to
create. The Transact-SQL CREATE FUNCTION command supports three slightly
different syntax versions, one for each function type. The basic syntax for creat-
ing a scalar function is:
CREATE FUNCTION name
([parameter_list])
RETURNS data_type
[AS]
BEGIN
Sql_statements
RETURN scalar_value
END
only. The syntax for specifying an input parameter is the same as for a user
stored procedure. If you don’t need any input parameters, then an empty pair
of parentheses must follow the function name. The return data type is
required. Place the statements used to generate the value in the BEGIN…END
block. The last statement must be a RETURN statement followed by the return
value. Let’s rewrite the earlier procedure used to return a customer count as
a UDF:
CREATE FUNCTION fn_CountCust
(@SPNUM CHAR(3))
RETURNS INT AS
BEGIN
DECLARE @CUST INT
SET @CUST = (SELECT COUNT(*) FROM CUSTOMER WHERE
SPNUM=@SPNUM)
RETURN @CUST
END
Whether or not you specify the relational schema when you create a func-
tion, as with any other database object, the relational schema is part of the fully
qualified function name. With most objects, if you don’t specify the schema,
the default schema (typically dbo) is assumed. However, when you call a UDF,
you must specify the schema. So, to use the function we just created, you might
run:
SELECT dbo.fn_CountCust(‘137’)
The results would be an integer value of 4. To get an output like the one you
saw in the earlier example, you could use:
DECLARE @NUM CHAR(3)
DECLARE @CUSTCOUNT INT
SET @NUM = ’137’
SET @CUSTCOUNT = (SELECT dbo.fn_CountCust(@NUM))
SELECT ‘The count for salesperson’ + @NUM + ‘is’
+ CAST( @CUSTCOUNT AS CHAR(2))
The syntax for an in-line table-valued function is similar, except that the
return type is always TABLE. Here’s the basic syntax:
CREATE FUNCTION name
([parameter_list])
RETURNS TABLE
[AS]
RETURN (select_statement)
This time, let’s create a UDF that doesn’t require an input parameter. Func-
tionally, this will be equivalent to a view. We want to join the CUSTOMER and
290 IMPROVING DATA ACCESS
SALESPERSON tables to return the customer name, salesperson name, and head-
quarters city. Here’s the statement to create the function:
CREATE FUNCTION fn_GetCust ()
RETURNS TABLE
RETURN (SELECT SPNAME, CUSTNAME, HQCITY
FROM SALESPERSON JOIN CUSTOMER
ON (CUSTOMER.SPNUM = SALESPERSON.SPNUM))
Notice the empty parentheses, identifying that the function does not have any input
parameters. We don’t need to qualify the column names in the SELECT clause
because each of the column names is unique. To use this function, we could use:
SELECT * FROM dbo.fn_GetCust
Notice, that in this case, the RETURN keyword is specified on a line by itself.
For our example, we’re going to create a function that returns either a list of
customers for a specific salesperson or, if no salesperson is specified, a list of all
customers:
CREATE FUNCTION fn_GetBySP
(@NUM CHAR(3)=NULL)
RETURNS @custcopy TABLE (CustNum CHAR(4),
CustName VARCHAR(40), SPNum CHAR(3), HQCity VARCHAR(20))
BEGIN
If @NUM IS NULL
INSERT @custcopy SELECT custnum, custname, spnum, hqcity
FROM customer
else
INSERT @custcopy SELECT custnum, custname, spnum, hqcity
FROM customer WHERE SPNUM = @NUM
RETURN
END
8.3.2 UNDERSTANDING FUNCTIONS 291
This time, we better step through the function so you can see what’s going
on inside it. We declare an input parameter named @NUM and set its default
value to NULL. In the RETURNS clause we declare a table variable named @cust-
copy, along with the table definition. If @NUM has a value of NULL, then we
insert all rows from CUSTOMERS into the table variable. Otherwise, we filter
FOR EXAMPLE
Replacing Views?
Remember the “For Example” box in the last section. You want to be able to
retrieve data about employees, but filter the data returned by department and
job requirements. The solution was to create one view for the Human Resources
manager, another for payroll, another for the employee’s manager, and so on.
You might think that another solution would be to create a multistatement
table-valued function to retrieve the data. When calling the function, you pass
the appropriate user type, such as HR or manager, to filter the result.
The problem is that this isn’t really a workable solution in this situa-
tion. Why not? There are two major flaws. One is that a UDF lets you fil-
ter the result by row, but the same columns are always returned. The only
way around this would be to have the function return every column you
might possibly need, but return NULL values for inappropriate or unneces-
sary data. The problem has to do with security. You would need to embed
some type of security check inside the procedure, otherwise what is there
to prevent the employee’s manager from running the function and passing
the parameter to return data for the HR manager?
There are, however, any number of possible situations for which a UDF
would be well-suited. Say, for example, that you need to run sales reports
by customer and salesperson each month so you can calculate commission
payments. As long as the salesperson number is associated with the sales
records, you could create a UDF to simplify this activity. Create a UDF that
lets you pass the salesperson number and then retrieves data for just that
salesperson. You could even take it a step further with a scalar function. You
could have the function use the salesperson number to retrieve the com-
mission percentage for you, calculate the percentage, and return the com-
mission amount as a scalar value.
The same general rule of thumb applies for determining access through
functions as it does for views. If you don’t give people the information they
need to do their jobs, they will let you know. If you give them more than
they need, you may not find out until after the damage has been done.
292 IMPROVING DATA ACCESS
the rows we insert into the table variable by the input parameter value. To
retrieve a list of customers for salesperson 137, you could run:
SELECT * FROM dbo.fn_getbysp(‘137’)
If you want to use the default value to retrieve all of the customers from the
table, you would run:
SELECT * FROM dbo.fn_getbysp(DEFAULT)
Notice that if you want to use the default value of an input parameter, you must
pass the keyword DEFAULT as the parameter value.
The syntax for ALTER FUNCTION, which is used to modify a procedure, is
effectively the same as the CREATE FUNCTION statement. You must specify all
parameters, just as if you were creating a new function. Use DROP FUNCTION
to delete existing functions.
SELF-CHECK
• Compare and contrast user stored procedures and UDFs.
• List and describe the types of UDFs supported by SQL Server 2005.
SUMMARY
In this chapter, you learned about improving data access, which included
improving access performance, simplifying access, and protecting table data from
unauthorized access. You were introduced to the concept of bottlenecks and
shown some of the types of bottlenecks you are most likely to encounter. You
learned about the role of indexes in improving read access, as well as their pos-
sible impact on write access. You learned how to design and create views to con-
trol user access to data. You also learned about using procedures and functions
both to automate database operations (including data access queries) and to con-
trol access to database data.
KEY TERMS
Alert Dedicated server
Bottleneck Disk queue
Cache Function
KEY TERMS 293
Summary Questions
1. A perceived hardware performance problem can sometimes mask a
problem with a different hardware component. True or False?
2. Normalization has no impact on data access or database performance.
True or False?
3. Which of the following would you use to capture SQL Server activity?
(a) SQL Server Profiler
(b) System Monitor
(c) Performance Monitor
(d) Database Engine Tuning Advisor
4. Excessive virtual memory paging is typically caused by which of the
following?
(a) a hard disk that is too slow
(b) an overworked hard disk
(c) too little system memory
(d) too slow of a processor
5. What is usually the best solution for memory bottlenecks?
(a) install a faster hard disk
(b) increase available disk space
(c) install a faster processor
(d) install more RAM memory
6. On a Windows XP computer running SQL Server 2005, which of the fol-
lowing would you use to observe real-time hardware performance?
(a) SQL Profiler
(b) Database Engine Tuning Advisor
(c) System Monitor
(d) alerts
7. When is the best time to collect baseline performance data?
(a) before installing SQL Server
(b) before the database application goes live
SUMMARY QUESTIONS 295
Figure 8-11
PUBLISHER table
AUTHOR table
BOOK table
CUSTOMER table
WRITING table
BOOKNUM AUTHORNUM
SALE table
7. How do the table primary keys, indicated by a solid line under the
column name, relate to default table indexes?
8. Several queries needed by regular operations depend on joining multiple
tables. How can you simplify these queries for your users?
YOU TRY IT
Video Rental Company 2. Currently, the only indexes on the database are
You design a database for a video rental company. The those that SQL Server created by default. You
database is normalized to 3NF. Whenever a customer want to optimize reports run on customer
rents a title, a rental invoice is generated. This is stored rentals. The report will include basic customer
in two tables, RHead and RTail. RHead contains infor- information and the titles of current rentals. How
mation on the invoice as a whole, including the CustNum could you retrieve this information (what tables
to identify the customer, and RTail contains information for are required)? What indexes should you create
each detail item in the invoice. to optimize performance (identity by table and
Three tables are used to track customer information. column)? Justify your answer.
CustInfo contains general information about the cus- 3. You want to be able to limit the report generated
tomer. CustSec contains more senstive information, in Question 2 to a single customer. You want to
such as telephone number and address. CustRentals ensure a consistent result, with the same
has information about current and past rentals; the columns returned the same way each time. How
table has two rental columns, one for VHS and one for should you do this and keep the number of ad-
DVD. The CustNum column is used to establish the re- ditional database objects required to a mini-
lationship between the tables. It is also used as the pri- mum? Describe the object or objects that you
mary key in each of the customer tables. would create.
There are two tables for videos. One contains titles on 4. How would the solution change if you wanted to
VHS tape and the other DVDs. You are preparing to add retrieve either one customer or all customers?
a third table for games, which you plan to call Games.
Each title has a unique tracking number that is generated Periodic Activities
internally. The data type for the tracking number is the You have several activities that must be completed at
same in both tables, NCHAR(10). Each value is unique, the end of each month and at the end of each quarter.
so that none of the numbers are duplicated between the These include reporting and data manipulation opera-
VHS table and the DVD table. The tracking number is tions. You need to automate as many of these opera-
used as the primary key in the VHS and DVD tables. tions as possible. You also want to simplify them so
The database is deployed on a computer running Win- that you can delegate some of the periodic require-
dows XP and SQL Server 2005. The computer is also ments to someone else.
running Internet Information Services (IIS). The data- There are two databases involved, both of which are
base application is written as a Web application, which hosted on the SQL Server 2005 database server OUR-
is hosted on the same server, so employees launch a DATA. The databases are named SalesData and Ops-
Web browser on their computers and connect to the Data. End-of-period activities include both databases,
application through the local network. but all of the activities run separately on each. There
are no requirements for combining data from the data-
1. Users complain about poor performance when bases or transfering data between the databases.
posting rentals. This has been an ongoing com- The databases are normalized to 3NF. In some cases,
plaint and you suspect the server hardware. As- you have included denormalized tables to support
sume you suspect memory as the problem. Go read requests. The databases are optimized for read
through the general steps you would use to verify performance. You evaluate index use and update the
and correct a memory problem. Other than SQL indexes twice a year. You document the indexes by
Server, what additional loads are there on system generating creation script files for each of the indexes
resources? How could you reduce these? any time an index changes. These scripts are kept in a
298
YOU TRY IT 299
directory on the database server to keep them readily 3. You want to be able to return the number of rows
accessible. updated by the monthly update. How could you
Periodic updates are run over weekends to keep them do this? How would this change your require-
from interferring with user activities. ments for running the update?
4. You need to ensure that periodic updates run as
1. You plan to create objects as necessary sepa- quickly as possible so that they are finished over
rately on each database. What type of database one weekend. What change can you make to
object would you create to perform the periodic the databases to optimize write performance
acitivities? Why should you create separate instead of read performance?
objects for each activity? 5. What kind of database object can you create
2. Rather than running monthy updates, you de- that will reverse this change with minimal effort
cide to run them each quarter. When you do, you on your part? Describe what it would need to do.
need to identify the month and year for which 6. What periodic changes might you need to make
each monthly update is being run. How can you to this object?
do this?
300 DATABASE ADMINISTRATION
9
DATABASE ADMINISTRATION
Starting Point
Go to www.wiley.com/college/gillenson to assess your knowledge of database
administration requirements.
Determine where you need to concentrate your effort.
INTRODUCTION
Administration includes a set of critical responsibilities that begin with data-
base planning and continue as long as the database is in use. Considering the
importance of the data contained in corporate databases, administration is def-
initely not an area to shortchange. In this chapter, we’ll look at justifying
administration roles, identify the two basic administration roles and responsi-
bilities related to each, and talk a little about guidelines for performing admin-
istration tasks.
Figure 9-1
Design
Data analysis, database
design and application
design and testing
Implementation
Database deployment
and baseline monitoring
Production
Day-to-day use and
ongoing maintenance
Retirement
Replacement or migration
and archiving
Early in the life cycle, during database design, the tasks focus more on data
requirements. As you move through the life cycle, day-to-day activities and the
database server take precedence. Toward the end of the application’s functional
life, the focus turns back to the data, with possible data migration or data archiv-
ing requirements, merging into a new design cycle. It’s important to understand
administration requirements and roles played by administrators.
cost centers that don’t produce revenue? At one time or another, most companies
have struggled with these questions, but in today’s heavily data-intensive, infor-
mation-dependent business environment, these functions are recognized as being
more important than ever.
DBMS manufacturers have recognized the need not only for administrators,
but for qualified administrators. Toward this end, most manufacturers offer cer-
tification programs. Certification candidates must take and pass one or more
exams to earn certification. Certifications offered and certification exam objec-
tives vary between different DBMS certification programs. One ongoing problem
with these programs is to design them so that they test for real-world-based
skills and knowledge. A parallel industry has grown up alongside the certifica-
tions, offering certification training with some giving money-back guarantees of
passing.
Justifying Administration
Data as a corporate resource has taken its rightful place alongside other corporate
resources. Virtually all aspects of business have become dependent on their infor-
mation systems and the data flowing through them. Today’s organizations could
not function without their vast stores of personnel data, customer data, product
data, supplier data, and so forth. Indeed, data may well be the most important
corporate resource because it describes all of the others. Furthermore, the effec-
tive use of its data can give a company a significant competitive advantage.
The problem is that available personnel and other resources within a company
tend to be scarce (is there ever enough money to go around?), and there is typ-
ically internal competition for them. Data is no exception. As more and more
corporate functions seek the same data for their work, bottlenecks can form and
the speed of accessing the data can slow. Arguments arise over who “owns” the
data, who should have access to it, and, especially, who is responsible when
problems arise. Without someone clearly responsible for data issues, companies
respond in a variety of ways without completely understanding the problem,
such as bringing in faster computers, but these solutions have their limits and
often come with inherent drawbacks.
Management cannot be left to chance or performance optimization left as an
afterthought. Any shared corporate resource should have a dedicated department
to manage it. It makes little sense to have an important resource either not man-
aged at all or managed part-time and half-heartedly by some group that has other
responsibilities. It also makes little sense to have any one of the groups com-
peting for the shared resource also managing it—the resource manager must
be impartial when a dispute arises. The answer is to have a department, or
personnel within a department, specifically dedicated to data and database man-
agement. These should be personnel whose sole purpose is taking care of cor-
porate data.
304 DATABASE ADMINISTRATION
FOR EXAMPLE
Recognizing the Need for Administration
Lower-level DBMS products, like Microsoft Access, are specifically designed
to let users design and develop their own databases and simple database
applications. One reason for their popularity is that traditional corporate IS
departments are often slow to respond to users’ requests for changes, even
changes that seem as minor as changing report options. Part of the prob-
lem is that “simple” changes often aren’t that simple when dealing with older
mainframe databases and applications.
The result is that you have “pools” of data throughout the company. This
often results in both duplicated data and duplication of efforts, but this isn’t
the only reason for wanting to maintain more centralized control over cor-
porate data. In fact, there are a wide variety of potential issues.
With individual databases, one person is typically responsible for the data
and its administration, such as it is. Data inaccuracies are common, result-
ing in different users generating different results from the “same” data. Data
updates occur in a haphazard manner, if at all, leading to business decisions
being made on out-of-date information. Backups are often an afterthought,
meaning that data loss is common. Also, most users don’t give any thought
to security, opening the door to data theft and other industrial espionage.
One answer to these potential problems is to put all data, both central-
ized and decentralized, under one common management umbrella. This gives
you a way of setting standards for issues such as data backup and data secu-
rity. It lets you identify and consolidate duplicate data, reducing the effort
required to manage and maintain the data. With modern DBMSs like SQL
Server, it’s even possible to integrate public and private databases. You can
keep a master copy of the data on a database server where it is properly man-
aged and maintained. Individuals can still have their own copies in local
Access databases, kept up-to-date from the master copy. This lets them have
the impression that they are in control over their own data and lets them
meet some of their own custom requirements, such as individualized reports.
SELF-CHECK
• Compare and contrast data administration and database adminis-
tration roles.
• List key justifications for dedicated administration personnel.
9.2 IDENTIFYING ADMINISTRATION RESPONSIBILITIES 307
applications that will support those efforts. Data planning may be limited to data
generated and used internally within the company. However, today it often
means coordinating with other companies in a supply chain or acquiring exter-
nal customer data for use in marketing. In either case, there is the need to plan
for integrating the new data with the company’s existing data. A number of
methodologies have been developed to aid in data planning. These methodolo-
gies take into account the business processes that the company performs as part
of its normal operations and add the data needed to support these operations.
While the methodologies generally operate at a high strategic level and may not
get into the details of individual attributes, they do provide a broad roadmap
from which to work.
Related to strategic data planning, your long-term data plans and require-
ments, is the matter of what hardware and software will be needed to support
the company’s information systems operations in the future. The questions
involved range from such relatively straightforward matters as how many disk
drives will be needed to contain the data to broader issues of how much pro-
cessing power will be needed to support the overall IS environment. While this
is specifically the database administrator’s responsibility, data administration must
be involved so that the database administrator is not working blindly. Informa-
tion like projected growth based on historical requirements or upcoming additional
needs of which the data administrator might be aware provide the database
administrator with critical information for planning hardware requirements.
Another data planning issue is how metadata should be put to use. With some
DBMSs, metadata and the data dictionary is managed separately from the oper-
ational data. This management involves what data should be stored in the data
dictionary, what uses the data dictionary should be put to, who should interact
with the data dictionary and how, and on what kind of schedule all of this should
take place. Yet another data planning issue many companies must face is the
migration of old, pre-database data and applications into the company’s database
environment. Related to this is the problem of migrating data from one DBMS
to another as the company’s software infrastructure changes.
could be possible, though, because each would be used in a different table and
would result in a unique column name. Similarly, you add unnecessary confusion
if you use Serial Number in some tables and Employee Number in others to refer
to the same value. An even bigger problem arises if you use different identifying
values in different tables, making it difficult (if not impossible) to establish rela-
tionships between tables.
Another issue of data standards relates to data access. It’s important to insist
on consistency in the way that programs access the database. These include
issues such as how connections are established, how data is retrieved, and how
data manipulation commands are passed. Here, the data administrator must
work with application programmers and the database administrator to develop
standards that help minimize problems relating to performance and security.
Also, well-established standards help to make application programs easier to
maintain when it comes time to make changes or add functionality.
Data standards also come into play in the IS interactions between companies
in supply chains, as shown in Figure 9-2. When data is exchanged using elec-
tronic data interchange (EDI) technology, technology used for electronic data
transfer and related data standards, adjustments have to be made to account for
data structure and other differences in the information systems of the two com-
panies involved.
Many companies use non-relational data formats for passing information
such as purchase orders. It’s common to find XML documents based on prede-
fined schemas as the transfer standard. The need to be able to generate messages
in those formats can impact how database data is structured.
In Figure 9-2, you see a retailer sending a purchase order to the warehouse
using EDI technology. The order is received automatically by the computer at
Figure 9-2
Purchase Order
Retailer
Warehouse
the warehouse, the order is filled, and the items shipped to the retailer. EDI helps
automate the process, with data administrators on each end managing the data
standards needed.
Managing Training
In some companies, data administrators are responsible for training all those in
the company who have a reason to understand the company’s data. In some cases,
312 DATABASE ADMINISTRATION
that training can also include the DBMS environment, though that specific need
is usually better met by database administrators. Management personnel need to
understand why the database approach is good for the company and for their
individual functions specifically. Users must understand why the shared data
must remain secure and private. Application developers must be given substan-
tial training in how to work in the database environment, including training in
database concepts, database standards, and how to write DBMS calls in their pro-
grams. These requirements might also include how to do database design, how
to use the data dictionary to their advantage, and in general, what services they
can expect data and database administrators to provide.
Monitoring Performance
One of the key functions performed by database administration is performance
monitoring. Using utility programs, the database administrators can gauge the per-
formance of the running DBMS environment. This activity has a number of impli-
cations. It is important to know how fast the various applications are executing
as part of ensuring that response time requirements are being met. Also, this type
of performance information is pertinent to future hardware and software acqui-
sition plans. Depending on the characteristics of the DBMS and the operating sys-
tem under which it is running, the performance information may be used to redis-
tribute the database application load among different CPUs or among different
memory regions within a system. Finally, performance information can be used
to ferret out inefficient applications or queries that may be candidates for redesign.
The database administrators must also interface with the IS organization’s
systems programming staff, who maintain the mainframe operating systems
(if present). The systems programmers will also have performance and trou-
bleshooting responsibilities, which may overlap with those of the database
administrators. The net of this is that it greatly facilitates matters if the two
groups get along well with each other and can work together effectively. This is
less an issue than it has been in the past as PC-based databases rival (or exceed)
mainframe performance and mainframe-hosted databases become the exception
rather than the rule.
Keep in mind that performance depends not only on the DBMS and data-
base, but the hardware platform, operating system, and even the network. It may
be necessary to work with network administration or PC administration and help
staff to complete some monitoring tasks. This is especially true if you plan to
run tests that will place a heavy load on network infrastructure that could inter-
fere with the performance of other network applications. These also may be
issues relating to database administration security access. As a database admin-
istrator, you could have unlimited access to the DBMS and any hosted databases,
but still have limited access to operating system resources and utilities.
Monitoring Security
Database administrators keep track of which applications are running in the
database environment. They can track who is accessing the data in the database
9.2.2 UNDERSTANDING DATABASE ADMINISTRATION RESPONSIBILITIES 315
at any moment, either directly or through an application. Again, there are soft-
ware utilities that enable them to perform these functions.
Monitoring the database users is done from several perspectives. One is the mat-
ter of access security, making sure that only authorized personnel access the data.
This includes managing database users and authorizing user access to the data-
base, as ordered by data administration personnel in conjunction with the data
owners. Another perspective is the need to maintain records on the amount of
use the various users make of the database. This can have implications in future
load balancing and performance optimization work. This information can also
be used in allocating system costs among the various users and applications.
A related concern is database auditing. Even assuming that only authorized
users have accessed the database, reasons involving accounting and error cor-
rection require that a record be kept of who has accessed and who has modi-
fied which data items. This is also critical if you suspect unauthorized access,
someone hacking into the database. This audit trail, the ongoing log of user activ-
ity, is a necessary tool when attempting to detect and hopefully block unautho-
rized access. The level of auditing supported is DBMS-specific. Some DBMSs are
limited to tracking user connections only, or simply to the fact that the user
accessed database data. Others can record detailed activity, including specific data
modifications.
Managing Metadata
There are multiple concerns relating to managing metadata. In most cases, data-
base administrators are the only individuals who should have direct access to
metadata. This means that, except in some specific, limited instances, database
administrators are the only ones who should be making changes to the meta-
data information. When permission to make metadata changes is delegated, it
needs to be both carefully tracked and tightly controlled. You can configure SQL
Server, for example, to not only block metadata changes, but to also log the
attempt, including the user or application trying to make metadata changes.
The database administrator, working with the data administrator, may also
need to publish some of the database structure. Users and application pro-
grammers often need to understand, at some level, how database objects are
structured to enable database access. However, DBMSs give you ways of iso-
lating the database tables from direct access, including views and user stored
procedures.
The database administrator is also responsible for both documenting
and protecting the database metadata. This means backing up the metadata
in case of problems and documenting database and database object structure.
One way of doing this is by scripting the databases, creating scripts that can
be run to recreate database objects or, if necessary, to completely recreate the
database.
316 DATABASE ADMINISTRATION
Managing Software
Database administration personnel will be involved with a wide range of data
and software maintenance activities. The degree of involvement depends on how
the IS department is organized. These activities include installing new versions
of the DBMS, installing fixes or patches (corrections) to the DBMS, performing
backup and recovery operations, and performing any other tasks related to
repairing or upgrading the DBMS or the database. One potential concern, which
must be coordinated with network and system administrators, is permissions.
The database administrator might not have sufficient permissions to install fixes.
It may be necessary to either give the database administrator additional permis-
sions or have a network administrator perform the installations.
Operating system patches and service packs are also an issue, but tradition-
ally not a database administration responsibility. Operating system updates are
more often the responsibility of computer support personnel or network admin-
istrators. Network administrators and database administrators usually work
together to determine which updates need to be applied and scheduling the activ-
ities. Current Windows operating system versions are often configured so that
most updates are applied automatically, without any administrator interaction.
One particular data maintenance activity is modifying the database struc-
tures as tables are modified and new tables are inevitably added. This is fun-
damentally an issue of database design, which means it also involves data
administration.
Managing Troubleshooting
Inevitably, failures will occur for reasons ranging from a bug in the application
code to a hardware or system software failure. The question is, who do the users
9.2.2 UNDERSTANDING DATABASE ADMINISTRATION RESPONSIBILITIES 317
FOR EXAMPLE
Determining the Bottom Line
One possible point of contention when dividing work between data adminis-
tration and database administration is determining not only who is responsible
for what, but also who has the final word. The responsibilities listed here are
guidelines based on how the roles are typically defined, but each company has
to fill in the details based on its particular situation and administration needs.
The general rule of thumb is that data administrators set the standards,
while database administrators apply the standards. Data administration is more
of a thinking role, analyzing needs and proposing solutions. Database adminis-
tration is more of a doing role, finding ways to apply solutions in the real world.
The assumption is that each is a specialist in his or her area of responsibility.
Disagreements can arise. Each department addresses issues from its own
frame of reference, which can result in competing solutions. Turf battles
erupt, as each group wants what’s best for “their” database. Though they
shouldn’t, the issues can become personal with the final solution becoming
a matter of personal pride.
Sometimes the deciding factor is simple and direct. Can you or can’t
you physically implement the solution in the chosen DBMS? If not, then
you have to find a different solution. If you can, then things become more
complicated. Before you can settle on the best solution, you have to define
best. What are the overriding project goals? Which is more important, read
performance or write performance? Is it more important to optimize server
performance or to minimize space requirements?
Before these issues arise, or at least before bridges are irreparably burned,
it’s important to determine who within in the company is the final arbitrator
between data and database administration. The final decision should go back
at some point to the data consumers, or more likely, the manager responsi-
ble for the data consumers. Early in the design process, you need to think
about design issues that might arise and make decisions then about priori-
ties. Then, identify one person (or a very small group) that both data and
database administration can agree on as having the final say. This makes it
possible to resolve these issues more quickly when they arise (and they will)
and to get back to the job of getting and keeping the database running.
call when a failure occurs? In most organizations, the database administrators are
the direct contact for troubleshooting. The key to the troubleshooting operation
is to make an assessment of what went wrong and to coordinate the appropri-
ate personnel needed to fix it, including systems programmers, application pro-
grammers, and the data administrators themselves.
318 DATABASE ADMINISTRATION
SELF-CHECK
• List the responsibilities usually delegated to data administration.
• List the responsibilities usually delegated to database administration.
In all cases, pick the tool or utility that is able to complete the task you need
done. Take SQL Server for example. You can perform some common tasks, like
backup and restore, using either a command-line command or SQL Server Man-
agement Studio. Other tasks, such as setting up a data protection mechanism
known as database mirroring, are supported through command-line utilities only,
so there’s no choice to make.
After considering your requirements, if you still have multiple options avail-
able, and other factors are more-or-less equally balanced, consider ease of use.
There is no reason to make your job more difficult than necessary. Consider,
again, running a data backup. Most database administrators find it easier to use
SQL Server Management Studio to set up backups, especially when different
types of backups running under different schedules must be included, because
of the graphical scheduling interface.
9.3.3 CONSIDERING ONGOING MANAGEMENT TASKS 321
Deciding on Automation
How do you determine whether or not you should automate a task? There are
a few simple guidelines you should consider. First, how often do you perform
the task and under what circumstances? If it’s something that you’ll be doing
once, like creating a specific database table or adding a specific user to the data-
base, then you will usually want to do the job and be done with it. The same
is true for procedures that are rarely performed, or performed intermittently with
no set schedule.
The opposite is true for periodic procedures that are performed on a regu-
lar basis. Once again, let’s consider database backups. Backups are an ongoing,
periodic activity, usually occurring on a set schedule. How often you back up
your database depends on factors such as how quickly the data changes, how
much data you can afford to lose (or repost), and what other mechanisms you
have in place to protect your data. At least part of the data in most databases is
backed up at least once a day and, in many cases, even more often than that.
Trying to run those backups manually would be a waste of effort, especially when
you consider that backups are usually run after hours whenever possible.
As is usually the case, there is an important exception to these automation
guidelines. It’s sometimes better to automate the execution of a procedure, even
when setting up the automation takes longer than it would to just run it. This
is most often the case when running procedures that either interfere with user
access to the database or severely degrade server performance while they are run-
ning. Another possibility is that you are supporting multiple locations and you
want to ensure that a change is made to all of the locations, the same way, and
at the same time. If one database is updated and the others are not, you have a
consistency problem. You can avoid this by scheduling the change to happen
the same way and at the same time, in all locations.
As an example, think about table indexes. They can become fragmented over
time. Fragmentation is a condition in which the data is randomly spaced in
small pieces all over the hard disk rather than being located in one place. This
happens over time as data is added to and removed from a table. When frag-
mentation becomes severe enough, performance starts to degrade and you have
to rebuild the indexes to correct the problem. However, rebuilding indexes is a
resource-intensive process and can result in tables becoming temporarily unavail-
able. Often, the best solution is to schedule the rebuild to occur automatically
after hours when the impact on other operations is minimized.
a procedure is that it lets you pass parameters to control the process. With SQL
Server 2005, you can also create maintenance plans that define the tasks to per-
form, task options, and even the execution schedule.
Figure 9-3
Maintenance plan.
Once created, either manually or through the Maintenance Plan Wizard, you
can modify your maintenance plan. You can change, for example, the tasks to
run, task options, precedence constraints, and the task schedule. You can also
delete a maintenance plan when no longer needed.
This ability to create scheduled maintenance plans is not unique to SQL
Server, but SQL Server has automated the process more than most other DBMSs.
This is also not your only option for automating process execution in SQL Server.
You can also define SQL Server Agent jobs, which can include operating system
commands as well as SQL command statements, with periodic schedules. You
can define alerts, both through SQL Server and through the Windows operating
system, that execute in response to different conditions. However, these options
for automated execution are beyond the scope of this chapter.
324 DATABASE ADMINISTRATION
Figure 9-4
Available tasks.
FOR EXAMPLE
Simplifying Your Life
Here’s the situation. A company has just implemented a database that will
contain all of the corporate data. You’ve been hired to fill both the data and
database administration roles. Any time the company is open for business,
you’re on call. How can you possibly do this job and have any hope of
maintaining a life outside of work?
The key is proper use of automation. Identify all of the tasks that need
to be performed on a periodic basis. This includes what tasks you need to
perform, when, and how they should run (what databases, tables, and so
forth are involved). Create and schedule automated procedures to do these
for you, and have the database server log the results. That way, as long as
there aren’t any problems, you just need to check the logs every so often
and make sure that everything ran right.
Want to make your life even easier? Few things are more frustrating than
an unexpected page at an inopportune moment. What if you could head off
some of these in advance? By carefully reviewing the database design and
monitoring database activities, it is often possible to identify potential prob-
lem areas and put an automated solution in place. For example, some mod-
ules within the database application create working tables in the database.
It is supposed to drop those tables when finished, but that part doesn’t
always happen. The application programmer is (supposedly) working on a
solution, but for now, it’s your problem. The tables build up and eventually
you run out of hard disk space. When that happens, no new data can be
entered into the database.
The fix is easy. The tables are named so that they are easy to identify
and isolate, so you just need to drop the tables and shrink the database. In
fact, you’ve even already create a procedure that does that for you. Now you
need to take the next logical step. You can create an alert that monitors avail-
able hard disk space. Before available space reaches a critical point, say when
10 percent remains, you have the alert run the procedure for you.
up again when least expected or convenient. Fixing the underlying problem also
includes verifying the correction. This doesn’t mean that you will always find the
cause, especially when dealing with data errors or corrupted records. These are
sometimes caused by transient problems, by one-time glitches like a momentary
power surge. In that case, it’s unlikely that you will find a root cause, but that
doesn’t mean you shouldn’t try.
326 DATABASE ADMINISTRATION
Here’s an example. A user complains that the application reports a data error
when trying to access a specific data row. You verify that you can duplicate the
problem, find that the problem is consistently with that one table, and correct
by restoring the data from backups. Problem fixed, right? Not necessarily,
because the underlying cause could be that the hard disk is beginning to fail and
that the data error is an early warning sign. In this example, you need to make
your best effort to try to discover the cause of the error, which in this case might
include running disk utilities or diagnostic programs to check the hard disk.
SELF-CHECK
• Explain, in general terms, how application life cycle phases relate to
administration responsibilities.
• List and describe the factors to consider when choosing a tool or
utility to perform a management task.
SUMMARY
In this chapter, you learned about administration roles and responsibilities. You
learned how you can justify the need for dedicated administrators. Next, you
learned about the responsibilities of data administrators and database adminis-
trators. Finally, you learned about issues relating to administration tasks, such as
choosing the proper utilities and using automation appropriately.
KEY TERMS
Arbitration Electronic data interchange (EDI)
Audit trail Enterprise resource planning (ERP)
Automated tasks software
Business problem Fragmentation
Data administrator Maintenance plan
Data analyst Manual task
Database administrator Precedence constraint
Data consumer Publicity
Data flow Strategic data planning
Data planning Systems analyst
SUMMARY QUESTIONS 327
Summary Questions
1. Which of the following is another term for data administrator?
(a) data analyst
(b) system analyst
(c) database administrator
(d) systems programmer
2. Why is it important to have an administrator dedicated to day-to-day
operational tasks?
(a) to ensure that the tasks are done properly and on time
(b) because of the level of technical expertise required to perform
management tasks
(c) to ensure that decisions are made based on requirements of the busi-
ness as a whole
(d) all of the above
3. Database administration is not needed in a decentralized data environment.
True or false?
4. Data administrators and database administrators perform identical job
functions. True or false?
5. Which of the following is not a phase of the database application life cycle?
(a) design
(b) evaluation
(c) implementation
(d) production
6. Which of the following is primarily a database administrator responsibility?
(a) arbitration
(c) security monitoring
(d) setting data standards
(e) data coordination
7. Which of the following is primarily a data administrator responsibility?
(a) physical database design
(b) performance monitoring
328 DATABASE ADMINISTRATION
(c) publicity
(d) security monitoring
8. The database administrator should be the primary point of contact when
database failures occur. True or False?
9. Which of the following statements best describe data standards
management?
(a) The database administrator has sole responsibility for setting data
standards.
(b) How connections are made for data access should be left up to the
individual application programmers.
(c) Data standards include naming and data access standards.
(d) none of the above
10. Who should have unlimited access to database metadata?
(a) users
(b) application programmers
(c) data administrators
(d) database administrators
11. The database administrator should be the only person involved when
troubleshooting database problems. True or False?
12. Training should be primarily delegated to data administration. True or
False?
13. Automated tasks should be limited only to tasks running on a periodic
schedule. True or False?
14. During which phase of a database application’s life cycle do most of the
responsibilities fall on the data administrator?
(a) design
(b) implementation
(c) production
(d) none of the above
15. A manual task is one that requires operator intervention. True or
False?
16. GUI-based utilities can be used with manual tasks only. True or False?
17. Which of the following would most likely be set up as a recurring task
executing on a set periodic schedule?
(a) database table creation
(b) index rebuild
(c) database backup
(d) adding login accounts
APPLYING THIS CHAPTER 329
Starting Point
Go to www.wiley.com/college/gillenson to assess your knowledge of
transaction and locking support.
Determine where you need to concentrate your effort.
INTRODUCTION
Transactions are a key part of nearly any database application. Transactions
provide a means of helping to ensure that you maintain database integrity and
consistency by requiring that either all dependent changes are made to the
database data or none of the changes are made. In this chapter, you will learn
about transaction processing, including transaction properties and transaction
scope. You will also learn about potential concurrency errors that can occur
and methods different DBMSs use to avoid these. You’ll also see how to apply
transactions in a real-world environment, using examples based on SQL Server
2005.
Figure 10-1
Inventory
Sales
OrderHead
OrderTail
CustomerBalance
Order processing.
they’re no longer consistent with each other. Their balances are no longer based
on the same information. The same is true with the order tables and Cus-
tomerBalance. One of the goals of transactions is to prevent such inconsistencies
from happening.
transaction log until a backup of the log is run, and then delete the completed
transactions at that time.
What if a hardware error occurs before the changes can be written perma-
nently to the hard disk? That is where the transaction log becomes a critical part
of transaction management, through a process know as recovery. After you
restart the DBMS after a failure, the DBMS checks the transaction log for com-
pleted, which is to say committed, transactions. These transactions are rolled
forward, which means that the database table updates are made from the changes
recorded in the transaction log. Otherwise, the change might be lost. Uncommit-
ted transactions are treated as if they don’t exist, with no changes made to the
database.
Some DBMSs support additional statements that let you name a transac-
tion or let you set points within the transaction that let you perform a partial
rollback, which is when changes are reversed back to a specified point within
the transaction, but we’re limiting our discussion to these basic transaction
commands.
Many programming languages provide direct support for transaction man-
agement. For those that don’t, you control transactions by passing the transac-
tion commands to the database server. Either way, it is strongly suggested that
database applications be designed so that all modifications to the database are
performed in the context of a transaction.
When run in a batch or as part of an application, logic must be included to
test for or respond to errors, so that the program can correctly commit or roll
back the transaction. When using transactions from the command line, you
need to manually commit or roll back the transaction. One way of doing this,
demonstrated in Figure 10-2, is having a decision point after each statement is
executed to determine whether or not an error occurred. Figure 10-2 represents
Figure 10-2
BEGIN TRAN
Update
CUSTOMER
Yes
Error?
No
Update
CUSTOMER
Yes
Error?
No
Update
CUSTOMER
Yes
Error?
No
COMMIT ROLLBACK
TRAN TRAN
Transaction flow.
337
338 TRANSACTIONS AND LOCKING
Using Transactions
Exactly when, then, do you use transactions? The short answer is that you should
use transactions any time you could need to roll back your changes in case of
an error. This includes all critical or sensitive database activities, or activities that
need to reference multiple data sources.
Think about what happens when a warehouse sale is posted as a customer
order. When the order is posted, it needs to:
▲ Update the inventory records for each of the order line items.
▲ Update the customer’s outstanding balance with the order amount.
▲ Create a customer order record documenting the sale.
The order would then be forwarded to the warehouse for filling and ship-
ment. Now what happens if there is a problem with one of the steps, such as a
problem with creating the customer order record? Both the inventory and cus-
tomer records would be updated, but because the order isn’t created, it doesn’t
go to the warehouse for fulfillment. The items never get pulled and shipped. The
data in the database is inconsistent. The inventory records show fewer items on
hand than are actually there and the customer is charged for items never
received. A problem like this could be very difficult to isolate and correct.
If the order were run in the context of a transaction, when there was an
error creating the customer order, the changes to the customer records and
inventory records would be rolled back. The customer still doesn’t receive the
order, but the inventory and customer records remain consistent. The customer
is never charged for the order.
Figure 10-3
BEGIN TRAN
Begin
statement
block
Update
CUSTOMER
Update
CUSTOMER
Update
CUSTOMER
End statement
block
Yes
Any error?
No
COMMIT ROLLBACK
TRAN TRAN
FOR EXAMPLE
Automatic Recovery
Let’s take a moment to consider the importance of the transaction log in
maintaining database consistency and integrity. During normal operations,
at any point, you will likely have several updated data pages in the cache
(in cache memory) waiting to write to the hard disk. This helps improve
overall performance because the writes can be delayed until the disk and
processor are less active. But what happens if you have an unexpected fail-
ure. A UPS fails suddenly, shutting off the server without shutting down.
Someone accidentally (or maliciously) unplugs the server. In short, some-
thing happens.
What does that mean from a data standpoint? It means that you likely
have inconsistencies in the database. Some, but not all, of your updates
have been written to the hard disk. Even worse, you don’t know which
updates have been made, so you don’t even have a starting point for
recovery.
Maybe you don’t have a starting point, but the database server does.
The database server has the transaction log. Remember that when a check-
point is issued, it causes all dirty pages (pages that have been modified
in memory) to write to the hard disk immediately. When the server starts
back up, it knows where to start the recovery process, from the most
recent checkpoint. Starting at that point, it applies all of the committed
changes it finds. The committed change is saved as a data page, so even
if the update had already been made, the server is just overwriting the
page with the same current data. The important point is that any dirty
pages that weren’t written to the hard disk before the failure are written
during the recovery.
Unfortunately, there’s nothing the server can do about recovering
uncommitted transactions because it has no way of knowing whether or not
they completed successfully. Any pages relating to uncommitted transactions
are not written to the hard disk and are, in fact, removed from the transac-
tion log because they are no longer needed.
This doesn’t mean that there isn’t some level of manual recovery
required. You still need to check any data that was recently posted to the
database to make sure that these changes were made. If not, you will have
to repost them manually. Why? Even though you might have completed the
operation, depending on when the error occurred, the application might not
have committed the transaction before the failure.
10.1.4 UNDERSTANDING TRANSACTION SCOPE 341
Explicit Transactions
We’ll start with explicit transactions because they’re the easiest to understand.
With an explicit transaction, the entire transaction life cycle is controlled
342 TRANSACTIONS AND LOCKING
▲ ALTER TABLE
▲ Any CREATE statement
▲ DELETE
▲ Any DROP statement
▲ FETCH
▲ GRANT
▲ INSERT
▲ OPEN
▲ REVOKE
▲ SELECT
▲ TRUNCATE TABLE
▲ UPDATE
10.1.4 UNDERSTANDING TRANSACTION SCOPE 343
Some of these statements might not be familiar to you. The FETCH and
OPEN statements are used with server-side cursors, which provide a way of pro-
cessing individual rows within a result set. GRANT and REVOKE are used with
security management to allow and remove permissions. Cursor use and security
management are beyond the scope of this chapter.
The biggest drawback of using implicit transactions is that you don’t control
the start of the transaction, but you must identify the end of the transaction.
This can lead to very long running transactions which, in turn, can result in
access problems and impair performance.
The main reason for enabling implicit transactions is that they are required
by some legacy applications. You should not develop new applications to use
implicit transactions. The ODBC API even includes commands to place the data-
base in implicit transaction mode. The OLE DB API can be used with implicit
transactions, but does not include a way of putting the database into implicit
transaction mode. It can, however, issue commands that disable implicit trans-
actions. The ADO and ADO.NET APIs do not support implicit transactions.
Autocommit Transactions
With autocommit transactions, which is the default state for SQL Server and
many other DBMSs, each statement is treated as an individual transaction. If you
want a statement to run and, if no errors occur, commit—there’s no need to do
anything special when you execute the statement.
When operating in an autocommit mode, explicit transactions are still sup-
ported. It’s not uncommon to have some statements, like data retrieval queries,
run as stand-alone transactions in autocommit mode, while using explicit trans-
actions when multiple dependent statements are used.
Nested Transactions
Most DBMSs support nested transactions. A nested transaction occurs when
you explicitly start a new transaction while already operating within the scope
of a transaction. Nested transactions can be used with both explicit and implicit
transactions.
Here’s an example. The line numbers are included for reference only. The
indents are included to make the nested transaction easier to see.
1 BEGIN TRAN
2 INSERT CUSTOMER VALUES (’1442’, ’Get it here’, ’137’,’Memphis’)
3 BEGIN TRAN
4 INSERT CUSTOMEREMPLOYEE VALUES
5 (’1442’, ’3221’, ’Thomas’, ’Jane’, ’Owner’)
6 IF @@ERROR = 0
7 COMMIT TRAN
8 ELSE
344 TRANSACTIONS AND LOCKING
9 ROLLBACK TRAN
10 —More processing statements here
11 —Commit or rollback transaction
The BEGIN TRAN in statement 1 is the start of the outer transaction. You add
a row to the CUSTOMER table. The BEGIN TRAN statement in line 3 starts the
inner transaction, which adds a row to the CUSTOMEREMPLOYEE table.
Depending on the value of @@ERROR, which identifies whether or not an error
occurred when the most recent statement executed, you either commit or roll-
back the inner transaction. This does not commit the statement in line 2. The
ROLLBACK TRAN statement acts differently however. If you execute the ROLL-
BACK TRAN in line 9, everything rolls back to the initial BEGIN TRAN statement.
Even though you run the COMMIT TRAN statement in line 7, it doesn’t
mean that the change will write to disk. This change is still controlled in the
context of the outer transaction. If, after the statements shown here, you run
COMMIT TRAN, both added rows would write to their respective tables. If you
run ROLLBACK TRAN, all changes, including those in the inner transaction, are
rolled back.
You should try to avoid the use of explicit nested transactions like the pre-
vious example. That’s because a transaction should represent a single unit of
work, that is, it should accomplish a single task. Also, as you can see in that
example, there really isn’t much benefit to including a nested transaction. So why
even bring them up? Because you might end up using a nested transaction with-
out ever realizing it. You can end up with a nested transaction when calling pro-
cedures that include transactions. For example:
BEGIN TRAN
DECLARE @TestVal int
EXEC p_GenSummary @TestVal OUTPUT
IF @TestVal = 0
COMMIT TRAN
ELSE
ROLLBACK TRAN
a query window in SQL Server Management Studio, for example. If you close a
connection that contains open transactions, those transactions are rolled back.
Okay, so how do you handle these? Let’s take the last item first, because it’s
potentially the trickiest one of the set. This isn’t something that the database can
easily check through table constraints, so it’s an issue better resolved by the appli-
cation. The application needs to verify that there is at least one valid detail item
before committing the transaction. An even better solution would be using that
as a decision point for whether or not it even starts the transaction. If you never
start it, then you don’t have to roll it back because the order doesn’t qualify.
346 TRANSACTIONS AND LOCKING
The other two conditions are easier to enforce because they can be han-
dled by table constraints. You could use a unique constraint to enforce unique-
ness on the order number and a foreign key constraint to enforce the rela-
tionship. If either is violated, the database server generates an error. However,
instead of just having the transaction fail and roll back, you could include logic
FOR EXAMPLE
Transactions from the Command Line
Let’s talk about explicit transactions from the command line. When using SQL
Server 2005, considering that it defaults to autocommit transactions, you might
wonder why you would need, or even want, to use explicit transactions.
One possible reason is simple. We all sometimes make mistakes. When
connected to a database server with administrator permissions and making
direct changes to table data, the consequences of our mistakes could be far-
reaching. Say you’re doing some cleanup work in the database and you’re
deleting some old customer records. You run:
delete customer where custid = ’AJA1244’
delete customer where custid = ’BX54T4’
delete customer
At some point, immediately after you press F5, you realize your mis-
take. You immediately run:
ROLLBACK TRAN
The query window responds with:
Msg 3903, Level 16, State 1, Line 1
The ROLLBACK TRANSACTION request has no corresponding BEGIN
TRANSACTION.
You’ve lost all of the data in the customer table. The situation isn’t hope-
less. You have backups (hopefully) and you can restore the customer table
from the most recent backup. That means an interruption to user opera-
tions, lost time, lost money, and possibly a lost job (yours).
What if you ran the following, instead:
begin tran
delete customer where custid = ’AJA1244’
delete customer where custid = ’BX54T4’
delete customer
This would give you a chance to verify your changes before committing the
transaction and if you do have a mistake, like the typo in the last DELETE
statement, you can roll the changes back and start again. Now, if you still
end up committing the changes without first verifying, you’re on your own.
10.2 MANAGING CONCURRENCY CONTROL 347
in the application to correct the error and resubmit the rows for insertion into
the appropriate tables.
Another potential problem is concurrent access of the same data. What if
another transaction is updating the same tables at the same time? Here’s what might
happen. Both transactions read the balance from CustomerBalance at the same time.
Each transaction calculates a new balance based on the order being processed. One
transaction updates CustomerBalance, then the other. The change made by the sec-
ond transaction overwrites the change made by the first, so that the first change is
lost. CustomerBalance no longer matches up with the customer orders. This is an
issue of concurrency, which we’ll discuss in more detail later in this chapter.
SELF-CHECK
• Describe the properties of an ACID transaction.
• Compare and contrast implicit and explicit transactions.
Figure 10-4
Transaction Warehouse Database Transaction
T1 T2
Check on hand for item 120 Check on hand for item 120
Check credit for customer 1001 TextBookInventory Check credit for customer 1040
Reduce on hand for 120 by 350 Generate order
100 Chemistry
Generate order Reduce on hand for 120 by 500
120 Physics
Commit transaction Commit transaction
310 Intro. To Literature
Customer
1001 Northgate
1040 Westview
1910 Eastland
Sample transactions.
Figure 10-5
Transaction Warehouse Database Transaction
T1 T2
Check on hand for item 120 Check on hand for item 120
Reduce on hand for 120 by 1000 Reduce on hand for 120 by 600
Check credit for customer 1001 TextBookInventory Check credit for customer 1040
Generate order 100 Chemistry Generate order
Commit transaction 120 Physics Commit transaction
310 Intro. To Literature
Customer
1001 Northgate
1040 Westview
1910 Eastland
Item 120 has 1100 units on hand before transactions T1 and T2 start pro-
cessing. Textbook quantities are reduced by the amount ordered by customer
1001. For item 120, it is reduced by 1000 copies, leaving 100 on hand. In
T2, the application checks inventory levels for customer 1040’s order and
finds that there are insufficient copies to fill the order. This is a dirty read
because the change made by T1 has not been committed. The line item for
item 120 is canceled out due to insufficient quantity to fill the order. After
this happens, in T1, it is determined that customer 1001 has insufficient
credit, so all of the changes to TextBookInventory made by T1 are rolled back.
However, this happens too late to fill those items for customer 1040. The cus-
tomer doesn’t get the books, and the company possibly loses a sale, because
of a dirty read.
A nonrepeatable read is also referred to as inconsistent analysis. This
occurs when one transaction is reading a table multiple times while another
transaction is modifying the same table. Take a look in Figure 10-6.
Figure 10-6
Transaction Warehouse Database Transaction
T1 T2
Read all TextBookInventory Order more copies of 120
Check processing criteria TextBookInventory Order more copies of 200
Identify qualifying rows 100 Chemistry Commit transaction
Read qualifying rows 120 Physics
Process rows and modify values 131 Physical Science
Commit transaction 200 Social Studies
310 Inro. To Literature
Customer
1001 Northgate
1040 Westview
1910 Eastland
Figure 10-7
Transaction Warehouse Database Transaction
T1 T2
Read TextBookInventory 100–500 Delete 200
Calculate row changes TextBookInventory Commit transaction
Read each row 100 Chemistry
Process row and modify values 120 Physics
Commit transaction 131 Physical Science
200 Social Studies
310 Inro. To Literature
Customer
1001 Northgate
1040 Westview
1910 Eastland
FOR EXAMPLE
Why Concurrency Matters
Most businesses have the same basic goal, to make money. That means that
they are looking for ways to either make more money, or reduce their
expenses, the cost of making money. Some expenses are obvious, such as
building, inventory, and salaries. Others are hidden and not always noticed.
That leads us to concurrency errors.
Many experts agree that cutting costs pays off better in profits, dollar-
for-dollar, than increasing revenues. Errors cost, including concurrency
errors. Think about the examples in this chapter, and you can start to see
some of the hidden costs, such as the following:
Administrators can lose a lot of time, and possibly have to cause addi-
tional server down time, looking for the source of intermittent problems.
Some of these problems could be caused by transient concurrency errors.
Because of the brief error duration and mixed symptoms, problems like this
can be especially hard to isolate and correct.
The cost of any one of these (except possibly the lost order) is minimal
when taken by itself. Over time, and over multiple occurrences, they can
add up to a measurable expense.
10.3.1 UNDERSTANDING TRANSACTION PROCESSING 355
SELF-CHECK
• What are the benefits of concurrent processing?
• List design considerations that can help minimize concurrency
errors.
• List methods used by various DBMSs to minimize concurrency
errors.
issue checkpoints. There are also database server events and conditions that will
cause a checkpoint to be issued, including the following:
▲ The active portion of the log is larger than can be efficiently recovered
after an error.
▲ An ALTER DATABASE statement is executed.
▲ The database is backed up.
▲ The server is stopped or shut down.
This is a partial list, but it gives you a good idea of the types of events that
cause a checkpoint to be issued. Transact-SQL also includes a CHECKPOINT
command that manually issues a checkpoint when run. When the checkpoint is
issued, it forces a write of all committed dirty pages to the hard disk.
▲ Read uncommitted
▲ Read committed
▲ Repeatable read
▲ Serializable
SQL Server supports one additional level that is not part of the SQL-99 stan-
dard, snapshot isolation. The level is set for a connection using the SET TRANS-
ACTION ISOLATION LEVEL statement.
The read uncommitted isolation level is the least restrictive, but also pro-
vides the least protection against possible concurrency errors. When running
under this isolation level, transactions do not acquire a shared lock when read-
ing data. Transactions also ignore exclusive locks, meaning that they can read
data that has been locked for modification. Because of this, your transactions can
be affected by dirty reads, nonrepeatable reads, and phantoms.
The read committed transaction isolation level is the default database level.
This level prevents a transaction from reading uncommitted changes made by other
transactions. This means that dirty reads are prevented, but nonrepeatable reads
and phantoms are both still possible. That is because it is possible for another
transaction to make changes to the data and commit the changes between reads.
The repeatable read transaction isolation level prevents both dirty reads and
nonrepeatable reads. This is because it prevents the current transaction from
reading uncommitted data and also prevents other transactions from modifying
data being read by the current transaction until after the transaction completes.
10.3.2 MANAGING LOCKS, LOCKING, AND TRANSACTION ISOLATION 357
Figure 10-8
Transaction Warehouse Database Transaction
T1 T2
Open TextBookInventory
OrderTail Open
Deadlocked transactions.
358 TRANSACTIONS AND LOCKING
FOR EXAMPLE
Lower Level—Better Performance?
SQL Server 2005 defaults to the read committed transaction isolation level,
but is that always the best choice? The fact that SQL Server offers choices
for setting the transaction isolation level in itself tells you that this is not
always the case. Not only can you set the transaction isolation level when
you open a command line connection, like a query window, but you can
also specify the transaction isolation level when establishing a connection
from a database application. But how do you choose?
Choosing the right transaction isolation level depends on several factors.
The time you take to make the right choices can pay off, though, in better
data integrity and possibly in better performance. The data integrity is easy
to see. The higher the isolation level, the better protected the transaction
from interference from other transactions. Performance issues might not be
as obvious.
In the simplest terms, the less restrictive the isolation level, the less likely
that your application might need to wait for resources. Does this mean that
a lower level always means better performance, though? Not necessarily. Per-
formance includes both database and application performance. Any perfor-
mance gains from lowering the transaction isolation level could be lost if the
application has to do additional verification testing or has to correct errors
that it has detected.
and respond appropriately. The most common error, as with the situation in Fig-
ure 10-8, is that one transaction eventually times out. As a response, this means
recognizing that an update has timed out, rolling back the transaction, and then
resubmitting the changes in the context of a new transaction.
Figure 10-9
Server: DATADEV
Connection:
DATADEV\Administr ator
View connection properties
Progress
Done
0
< >
Close
Suspended transaction.
When you kill a process, you exit it without completing any open transac-
tions. You can do this by running the KILL command followed by the process
10.3.3 RECOGNIZING, CLEARING, AND PREVENTING DEADLOCKS 361
FOR EXAMPLE
Could It Be—Deadlocks?
One area where many database applications fall short is in error reporting.
The application receives errors from the database server, but either isn’t
designed to evaluate the error’s cause or it doesn’t pass the errors along to
the user. Instead the applications just report some type of generic database
error.
If database activity increases with time, it’s not unusual that database
errors will also increase. It’s not that the rate of errors is increasing, but
as the overall work volume increases, the number of errors increase. In
some cases it could be that you’re pushing the limits of the hardware, or
that there is some inherent flaw in the database design that doesn’t
become significant until activity reaches a threshold point. However, a
likely culprit for increased errors as database activity increases is dead-
locks. Because deadlocks are caused by competition for database
resources, as you have more requests for those same resources, deadlocks
will also increase, even in a well-designed database. The problem is, if
your application isn’t telling you that deadlocks are occurring, how will
you know?
If you start experiencing increased problems with time-outs or with
applications hanging without good reason, especially if the server load has
increased at the same time, you should suspect deadlocks as a potential sus-
pect. SQL Server gives you the tools you need to verify whether or not they
are the problem. Start with the SQL Server error logs. Deadlocks and trans-
action time-outs are reported to the error log, and you can easily view the
error log through SQL Server Management Studio. Once you know that
deadlocks are occurring, you can use SQL Server Profiler to help you pin-
point the transactions that are the most common culprits.
Your next step from there depends on what you find. If the problem
is poorly designed transactions, then your first step is setting them right.
Use the tips already recommended in this chapter, like keeping the trans-
actions short, making the transaction locks more granular, and accessing
resources in a consistent order. You might also consider lowering the trans-
action isolation level. Most transactions don’t need to be run under the
serializable, or even repeatable read, isolation levels. However, you prob-
ably don’t want to drop to read uncommitted or use an optimistic isola-
tion method. Why not? You already know that transactions are interfering
with each other. Why take a chance on replacing errors that you know are
occurring with the possibility of introducing data errors that are even
harder to detect and resolve?
362 TRANSACTIONS AND LOCKING
Figure 10-10
SELF-CHECK
• Describe when SQL Server issues a checkpoint.
• List the transaction isolation levels supported by SQL Server.
• Describe the circumstances under which a deadlock occurs.
SUMMARY
Transactions are a key part of nearly all database applications. During this chap-
ter, you learned about transaction basics, including ACID properties and trans-
action commands. You also learned the differences between implicit and explicit
KEY TERMS 363
KEY TERMS
ACID Nested transaction
Atomicity Nonrepeatable read
Autocommit transaction Open transaction
Blocked transaction Optimistic processing
Checkpoint Partial rollback
Commit Partial update
Concurrent transactions Phantom
Consistency Phantom read
Cursor Read committed
Data API Read uncommitted
Deadlock Recovery
Dirty page Repeatable read
Dirty read Rollback
Durability Roll forward
Exclusive lock Serializable
Explicit transaction Serialization
Implicit transaction Shared lock
Inconsistent analysis Snapshot
Isolation Time-out
Kill Timestamp ordering
Lock Transaction isolation
Lock level Transaction isolation level
Lock scope Transaction scheduling
Lost update Uncommitted dependency
364 TRANSACTIONS AND LOCKING
Summary Questions
1. Explain the potential impact of the usage 1. Why are transactions necessary when transfer-
changes to blocked transactions. ring money between accounts?
2. How does having the transactions used by ap- 2. You need to ensure maximum transaction isola-
plications access resources in the same order tion during automated loan payments. How can
help to prevent deadlocks? What is the possible you do this? Under what conditions might auto-
impact on blocked transactions? matic payments become blocked?
3. Why is concurrent transaction processing im- 3. Some of the savings and loan’s reporting re-
portant in this environment? quirements built into the Web application are
4. What one action can you take to increase the summary reports based on the specific moment
relative isolation between transactions? What in time when the report is run. How can you en-
are the potential benefits of this, if any? Potential sure this?
disadvantages? 4. The reporting module of the third-party account-
ing application defaults to the read committed
Concurrent Transactions transaction isolation level. Reporting is consid-
A small Midwestern savings and loan provides three ered a low-priority function. Some reports must
basic types of services to its customers—various types make multiple reads of the source tables to col-
of savings accounts, various checking account op- lect the information needed. How can you justify
tions, and loans. Most of the internal account functions not using a higher isolation level?
367
11
DATA ACCESS
AND SECURITY
Starting Point
Go to www.wiley.com/college/gillenson to assess your knowledge of data
access and security.
Determine where you need to concentrate your effort.
INTRODUCTION
Data access is critical to any database design. This means ensuring that users can
get to the data they need and that the data is protected against mistakes, malicious
actions, and equipment failures. This chapter begins by introducing data server and
database connectivity concepts and requirements. From there, it moves on to includ-
ing server and database access requirements in your database design and imple-
mentation. Finally, it introduces access permissions and data protection methods.
It is important to note that you cannot guarantee a completely secure data-
base or server. No matter what safeguards you implement, given enough time,
patience, and technical expertise, someone can eventually find a way to com-
promise your database server and data. Because of this, it is important that your
security plans focus not just on protection, but also on detecting and recover-
ing from security breaches.
Figure 11-1
Figure 11-2
PC
PC
Application
Programs
DBMS
Database
PC
Server
PC
PC
Figure 11-3
PC Browser
PC Browser
PC Browser
PC Browser
“The Internet”
PC Browser Company
Web Server
PC Browser
PC Browser
Now that you see what a connection is, and how it relates to the client/server
model, we can now focus on the components that make up that connection.
Figure 11-4
PC
Web Server
PC
PC
Database
PC “The Internet” Server
PC
PC
Database
PC
With WiFi, the connection path is the radio frequency transmissions between
the computers.
The other part of this hardware connection is the physical network adapter
your computer uses to communicate. For traditional wired networks, this
means a network interface card (NIC), sometimes called a network adapter.
This provides the computer’s physical connection to the network and handles
the low-level communication functions. Wireless networks use a network
adapter that is functionally a wireless transmitter/receiver that supports WiFi
connectivity.
The connection could also be made through a dial-up modem, cable modem,
or other connection device for remote connectivity. A modem is a device that con-
verts digital data to an analog signal for transmission and then back to digital
374 DATA ACCESS AND SECURITY
before passing the data to the computer. Currently, the most commonly used
path for remote connectivity is the Internet. Many companies support connec-
tions over the Internet through a virtual private network (VPN). A VPN
provides a secure, reliable communication path. VPNs can be used over the
Internet, or over a private LAN or WAN. In either implementation, the VPN
acts as a network inside a network, with a protected communication path
deployed through a public pathway. Other than the details of how it is imple-
mented, a VPN is operationally like any other network. Technically, a VPN is a
network built out of “soft” components. It uses special software at each of the
nodes (connection points) and special protocols known as tunneling protocols.
The name of the protocols comes from the fact that the VPN “tunnels” through
the public media pathway, effectively isolating VPN traffic from the rest of the
network traffic.
Think about that last point. It’s more complicated than you might think.
A database might have hundreds, even thousands, of concurrent connections
it must manage and maintain. It has to keep track of each of these, including
the client computer, user, and any data associated with the connection. Even
at the client end there’s a lot happening. Not only does the data have to arrive
11.1.1 UNDERSTANDING CONNECTIVITY CONCEPTS 375
Figure 11-5
Sample IM conversation.
at the correct PC, once there, it must be routed to the right application inside
the PC.
A detailed discussion of all of the ins and outs of PC communications would
be a course in itself and is far beyond the scope of this chapter. It’s enough that
you know the basics of what’s going on and what needs to be managed at the PC
and server ends. We do, however, need to talk a little about the high-level soft-
ware components required and the database and server objects you need to design,
create, and manage in order to enable communication with the database server.
376 DATA ACCESS AND SECURITY
FOR EXAMPLE
What about ADO.NET?
You may be familiar with ADO.NET, objects providing the functionality that
lets you connect to a data source and manipulate data in the .NET Frame-
work. If so, you might be wondering where it fits in as part of the access
model we’ve been discussing.
ADO.NET is a collection of data object libraries that let you connect to
various data sources, send requests, and receive and manipulate data. For
example, there are object libraries that contain objects designed for work-
ing directly with operating systems files, such as XML document files. More
closely related to our discussion are the object libraries that let you work
with database servers. These include generic libraries based on ODBC and
OLE DB, as well as product-specific libraries designed for use with prod-
ucts such as Oracle and SQL Server.
When using a .NET Framework language, the question becomes not
what technology to use in your database application, but which data object
library to use. The choices are somewhat language dependent, but the most
commonly used languages (Visual Basic and C#) support the same object
libraries. The differences between the two are in details such as command
syntax, but other than that, the choices are essentially the same.
The choice comes down to picking the object library best suited to your
requirements. When writing a custom application for a specific DBMS, the best
choice is usually to pick a server-specific interface option, like OLE DB for
Oracle or the SQL Client library. This is only possible if the decision has been
made about the database server platform, but realistically, that decision should
be made relatively early in the design process. That’s because several data design
decisions, such as data types, are also somewhat DBMS-specific. Along with
your data object recommendation, you might have other responsibilities in the
application development process. You could, and probably will, be called on to
provide information about the database and connection requirements, such as
the minimum requirements for connecting to the server, the database object
structure, and the availability of custom functions and procedures.
11.1.3 UNDERSTANDING MULTITIER CONNECTIVITY 379
Figure 11-6
Web Server
PC Browser
CGI or API
PC Browser
Middleware
PC Browser
“The Internet” ODBC
PC Browser
Database
Server
PC Browser
PC Browser
Database
on the market are Cold Fusion, Oracle Application Server, Microsoft Active Server
Pages (ASP), ASP .NET, and others.
In order for the application software running on the Web server to con-
nect with software outside of the Web server, there must be agreed upon inter-
faces. Web applications use Common Gateway Interface (CGI) or various
APIs. CGI is a standard way for passing requests and data between a Web
server and an application. These interfaces have software scripts associated with
them that allow them to exchange data between the application and the data-
base server. The connection to the database server could be made directly at
this point, but more likely the connection is made using a standard interface,
such as ODBC.
11.1.3 UNDERSTANDING MULTITIER CONNECTIVITY 381
FOR EXAMPLE
Multitier Application Design
Designing a multitier database application doesn’t necessarily mean designing
a Web application for use on the Internet. Larger companies often find it appro-
priate to design enterprise applications that use a multi-tier design, but the con-
cepts involved are similar to those already discussed for a Web application.
Suppose you have completed the database design and implemented a
preliminary version on a development server. Several applications will use
this database, including the one in this example, a data entry application.
Let’s start with a little background and a few requirements. The database will
be used in a nationwide medical study. Study candidates fill out a ten-page
questionnaire. The answers are entered into a tracking database that will also
track other medical information.
The application you’re designing is a data entry application that will enter
the survey results. Because of security requirements, end-user client comput-
ers cannot connect directly with the database server. The client computer will
do initial data validation, with more advanced checks handled by a server appli-
cation and, where appropriate, table and row constraints used as a final line
of defense. After normalization, you have four tables involved, each holding
part of the data. To understand your responsibilities, look at the application as
having three distinct, but related components. These components are the client
running on end-user computers, a middle component running on an applica-
tion server, and a server component on the database server.
The client component performs initial validation before sending the data
to the middle component and receives results or error information back from
the middle component. You need to provide data requirements such as the
types of data needed by the database and any limits such as field length or
maximum and minimum values.
The middle component accepts data from the client and uses it to pass
commands to the database server. In this example, assume that you’ve cre-
ated user stored procedures on the server to perform the data entry, one pro-
cedure for each of the four tables. The middle component connects to the
database server and calls these procedures, passing the data from the client
as input parameters. It will receive back any response, such as possible errors,
from the database server. As you get closer to the server, your responsibili-
ties increase. You probably need to provide guidelines for connecting to the
database server, connection parameters, the procedure names, and procedure
syntax which is probably limited to the order in which you must pass the
input parameters. You may also need to provide assistance and guidelines
relating to decoding and responding to server errors.
Continued
382 DATA ACCESS AND SECURITY
The server component is primarily the application data but also includes
other database objects you create to support the application. The responsibili-
ties at this level fall almost exclusively on you and include providing an appro-
priate data design and database tables. You’ll design, create, and (hopefully)
document the procedures used for data entry. You’ll also configure other
requirements such as primary key constraints, foreign key constraints used to
relate the four tables, and constraints enforcing value limits.
SELF-CHECK
• What does the term mutual authentication refer to?
• When do you need to use SQL Native Client?
• How does a multitier database application design differ from a
client/server database application design?
▲ Login name
▲ Password
▲ Default database
▲ Server-level permissions
11.2.1 CONTROLLING SERVER ACCESS 383
▲ User name
▲ Associated login
▲ Database access permissions
▲ Object-level permissions
It’s common to have several data users associated with a single login, one
user for each database to which you want to give the login access. Be careful to
create only those logins and users you actually need in order to help minimize
the possibility of unauthorized access.
In a single-tier system, you have one account acting as the security principal
for both server and database access. In this type of system, the account will have
an associated password, server-level permissions, and database-level permissions,
including permissions assigned at the database object level.
Because of the differences you will find between the various manufactur-
ers’ solutions to security requirements, we will focus on SQL Server 2005 as
a representative example of one manufacturer’s choices. Throughout this dis-
cussion, keep in mind that SQL Server security is based on both logins and
users, and that logins and users are distinctly different security objects with
different security management roles.
Figure 11-7
Authentication options.
account name, default database, and server access permissions. User accounts,
passwords, and group membership are managed through Windows and Active
Directory management utilities.
In SQL Server, members of the Administrators group as the default admin-
istrator account are granted login privileges during installation. Any additional
login accounts must be configured manually. When you attempt to connect
using Windows authentication, the user is passed to Windows for authentica-
tion and the status (whether or not you are authenticated as a valid user) and
group membership are passed back to SQL Server.
If you choose mixed authentication, you can also create SQL Server logins.
These are managed separately from Windows, with the accounts and passwords
authenticated by the database server. The only SQL Server login account cre-
ated by default is the sa account, which is a system administrator account with
11.2.1 CONTROLLING SERVER ACCESS 385
Figure 11-8
unlimited access permissions to the server and its databases. It should go with-
out saying that you want to protect access to the sa account with a strong pass-
word, one that includes mixed letters, numbers, and characters that would be
difficult to guess.
SQL Server provides command-line commands for login management, as well
as lets you create, manage, and delete logins through the SQL Server Management
Studio Logins folder, shown in Figure 11-8.
What you can do after logging in depends on the permissions assigned to
your login account. SQL Server primarily assigns permissions through roles,
which are roughly analogous to groups and are assigned different rights and
permissions. Other DBMSs use a similar management concept, but use different
terms to refer to groups of logins. You assign permissions to a login by grant-
ing the login membership in the role containing the permissions you want
that login to have. There is also a limited set of permissions that can be man-
aged explicitly by user and server object. Most DBMSs have similar methods
for managing permissions. As you will see later in this chapter, database per-
missions are handled in a similar manner, but with users instead of logins as
the security principal.
386 DATA ACCESS AND SECURITY
Figure 11-9
You can also manage permissions through explicit assignments to individual users,
but this is discouraged because of the added management overhead involved in
handling permissions at that level. Permission management is discussed in more
detail later in this chapter.
Figure 11-10
Notice that when connecting as a trusted connection you don’t need a pass-
word. SQL Server 2005 also supports other commands that open a command-
line interface, but Sqlcmd is the only one that supports SQL Server 2005-specific
features. All of these commands have the same potential shortcoming though,
in that passwords are passed in clear text. That makes it easy for someone to
11.2.3 UNDERSTANDING THE CONNECTION PROCESS 389
Figure 11-11
intercept them or, if run from a batch, read the passwords from the batch file.
The process for establishing the connection is the same as discussed earlier for
the connection dialog box.
The same connection information is needed when using a data API to connect
to a database server from an application. For example, Microsoft Visual Studio
390 DATA ACCESS AND SECURITY
Figure 11-12
Activity Monitor.
11.2.3 UNDERSTANDING THE CONNECTION PROCESS 391
FOR EXAMPLE
Implementing SQL Server Authentication
Microsoft recommends using Windows authentication with SQL Server
whenever possible. From an administration standpoint, it’s typically easier
to set up and manage. From a database administrator’s standpoint, it means
you get to pass part of the management duties off to a network adminis-
trator. All this brings us to the question, why ever use SQL Server authen-
tication? The answer, quite simply, is that it depends.
Like so many other design decisions, you have to consider your network
environment and application requirements as a whole when deciding on an
authentication method. If all of the database users are part of the same net-
work, then Windows authentication probably is the only authentication method
that you need. However, there are other possibilities that must be considered.
The biggest advantage to Windows authentication is also its biggest
weakness, that users are authenticated based on the currently logged on user.
This assumes that the client computer is running some version of Windows
and that it is part of your network. These aren’t always good assumptions.
Many established networks, and even some newer networks, are het-
erogeneous network environments supporting a mix of operating systems.
That means you have computers running operating systems other than Win-
dows and possibly not fully compatible with a Windows networking envi-
ronment. This is especially true if you are migrating from a legacy database
to SQL Server. Those unsupported clients may still need access to the data
hosted on the SQL Server database, which often means providing access
through SQL Server authentication.
In a world where electronic communication is often the key to a suc-
cessful business and where an immeasurable amount of buying and selling
occurs across the Internet, some clients who need access to your data aren’t
necessarily members of your network, nor do you want them to be, even if
they’re accessing your servers. A concern in this interconnected environment
is the potential theft or destruction of data. Companies spend a lot of time,
effort, and money isolating their networks from the internet. Though some
exposure is often necessary, the less exposure, the better.
Consider a situation where you’re deploying Web servers running an e-
commerce application so you can sell on the Internet. The Web servers need
access to the data stored on your network database servers, but you don’t
want to expose anything more about your internal network than absolutely
necessary. Because of this, the Web servers are not members of your Active
Directory domain, but they still need access to the data. In many cases, the
easiest way to configure that access is to configure the database servers to
support SQL Server authentication.
Continued
392 DATA ACCESS AND SECURITY
One final point on that last example: you will rarely, if ever, want to directly
expose your database servers (and data) to the internet. That means burying
the servers, along with the rest of the network, behind a firewall, and maybe
even behind multiple layers of firewalls. Then, you configure the firewalls to
support the minimum possible amount of traffic into and out of the network.
One trick that some network and database administrators use is to configure
SQL Server, which is inside the firewall, and its clients outside of the firewall
so that they use nonstandard configuration settings. This makes it harder for
someone to slip in through the hole you’ve opened for that purpose.
SELF-CHECK
• Are the terms login and user interchangeable in a multitier access
configuration?
• When using SQL Server configured for Windows authentication,
how do you create and manage users and passwords?
• Are database users managed separately for each database?
• When connecting to a database server using a data API, what does
the connection string specify?
• Does SQL Server authentication give you a way of configuring
connectivity for non-Windows client computers?
• Can you monitor active connections to a SQL Server database
through SQL Server Management Studio?
Instead, we’ll discuss a few key topics relating to data. We’re focusing on two
areas: securing data against unauthorized access and protecting data against loss.
You’ve already learned something about access security and a little about
permission management, at least management through roles on SQL Server.
Why is this important when learning about databases and database design?
You might think that security is a more advanced topic, as aspects of it are, but
security concerns must be considered during database design and implementa-
tion. Basic security needs directly impact database design and database design
impacts how you can meet security needs.
One reason for concern about data security is simple: it’s your data, and you
want to protect it. Data loss or corruption could interfere with business opera-
tions. Errors resulting from corrupted data and the time and effort required to
correct the data result in real-world costs to your business. Less obvious, but just
as damaging, is the possibility that you could lose credibility with your customers.
There are also potential legal concerns about data security. If your database
contains sensitive data, such as personal identifying information (PII), credit card
numbers, medical records, and so forth, you may be required by law to protect
that data. Accidental or malicious disclosure of the data could result in fines and
other penalties. You could also be found responsible for collateral problems. For
example, if data stolen from your database is used for identity theft, you could
be held at least partially responsible for the results of that theft if it is found that
you didn’t do everything you could to protect the data.
Your legal responsibilities relating to security vary with your geographic loca-
tion, the industry in which your business operates, and the type of data involved.
If you do maintain sensitive data in your database, it is well worth investigating
the legal requirements to protect that data, and the actions you are required to
take by law if you discover that security has been compromised, such as noti-
fying the persons to whom the data relates.
▲ Tables
▲ Views
▲ Functions
▲ Stored procedures
▲ Users
▲ Roles
394 DATA ACCESS AND SECURITY
The complete list of securables, the server and database objects over which
you have control of permissions, depends on the DBMS and the types of data-
base objects it supports. The types of permissions depend on the type of object.
For example, at the database level, you can control whether or not a user can
manage database properties or create, modify, and delete database objects. With
tables, you can determine if a user is limited to retrieving data from a table or
if the user can also insert and modify table data.
The detail of control varies by DBMS. The requirements vary by database
and by user. Even with all of these possibilities, there are some general guide-
lines about security that you should keep in mind in all instances, including:
Figure 11-13
Sam
Alice
George
Sam
Ed
Marta
Marta
Jack
Andy
Sample roles.
Notice that Marta is a member of the Sales and SalesManagers roles. The
permissions granted through roles are additive, so that means she can view and
modify CUSTOMER, ORDERHEAD, ORDERDETAIL, and SALES. She would
have the same permission if she belonged to SalesManagers only.
396 DATA ACCESS AND SECURITY
Not only can you allow access, you can block access by explicitly denying
permissions to a user or role. This overrides any permissions allowed. The use
of denied permissions should be limited because they can make it difficult to
identify and correct access problems. When possible, simply do not allow access
as a way of preventing access to a resource.
When setting object permissions, you can set the permissions by individual
object or by object type. Whenever possible, keep assignments as general as pos-
sible. For example, if a role needs to be able to view the contents of all tables,
assign the permission through the type (table) rather than individually for each
table. SQL Server 2005 also organizes objects as belonging to an organizational
schema. Each object belongs to one and only one schema. However, a schema
can contain any number of objects. You have the option of assigning permissions
at the schema level and having the permissions apply to all objects contained in
that schema.
Minimizing Permissions
This guideline’s relatively easy to understand. Don’t give users a greater level of
permission than they need to do their jobs. Users who need to only view data
should be allowed to run SELECT statements, and only SELECT statements,
against data sources. Data entry personnel will need to be able to insert, and
possibly modify, as well as view data.
You need to identify database users and their access permission requirements
during the design phase. Use this to make permission assignments during imple-
mentation. If you don’t give the users, or applications, the permission they need
you will find out almost immediately. If you give them more permission than
they need, it’s likely that someone will eventually figure that out. You, however,
might not find out until after the user has done damage to the data.
Auditing Changes
It’s a good idea to log changes made to users and permissions through auditing.
This gives you a trail to follow should you have problems in the future. It also gives
you a way of finding out if someone is granting permissions without authorization.
Consider this possibility. A worker in the warehouse slips items out the back
without anyone knowing and sells them. In order to hide this activity, the worker
needs to be able to directly change inventory quantities so the thefts are less
likely to be noticed during a physical inventory. The worker finds an administrator
willing to work with him or her for a percentage of the profits. The administrator,
looking to cover his or her own tracks, gives the worker permission to make the
changes. The administrator is trying to keep as far out of the direct loop as pos-
sible. If you are auditing permission changes, you’re more likely to discover the
scheme through the permission change, or at least discover it more quickly, than
through the unexplained changes in inventory quantities.
11.3.4 UNDERSTANDING RAID CONFIGURATIONS 397
Figure 11-14
Disk 0
Disk 1
Disk mirroring.
Hardware RAID is more expensive than configuring RAID through the oper-
ating system, but can provide better disk read and write performance. Hard-
ware RAID systems, depending on their design, can either connect as a local
hard disk or can be accessed as network storage through a high-speed, high-
bandwidth network connection.
Disk Mirroring
Figure 11-14 shows two hard disks configured for disk mirroring. Disk 0 and
Disk 1 contain the same data. During each write, the same data is written to
Disk 0 and Disk 1. If either hard disk fails, you still have a full set of the data
on the other hard disk. Write performance can suffer slightly because of the
duplicate writes, but read performance improves because the configuration sup-
ports split reads, reading from either hard disk.
A variation of disk mirroring is disk duplexing, which uses hard disks con-
nected to different disk controllers. In this configuration, you are protected
against failure of either hard disk or disk controller.
Figure 11-15
Disk 0
Disk 1
Disk 2
tem. In this configuration data is spread across the hard disks along with parity
information. Because read and write operations split across multiple drives, you
will typically see a performance improvement in both, though some of this
improvement is lost in software configurations because of the overhead required
to generate parity information. This isn’t a problem with hardware RAID systems.
If any one hard disk is lost, the parity information is used to reg enerate the
missing data. If more than one hard disk is lost, then the data is lost. In a soft-
ware RAID configuration, read performance falls off sharply when one of the
hard disks fails. Hardware-based systems don’t suffer from this performance loss
because the hardware is optimized for data recovery.
Figure 11-16
Client computers
Updates
Primary Backup
server server
include support for duplicate live servers so that operations can be split between
the servers, improving performance.
Why not use server mirroring all the time? After all, it protects against a wider
range of potential problems. There are two potential problems with server mirror-
ing. The first is that not all DBMSs support a backup server or failover option. The
other is that these solutions are relatively expensive, requiring duplicate hardware
and, in many configurations, multiple software licenses (one for each server).
If the solutions are so expensive, what’s the justification for using them?
Backup servers, especially configurations that provide for automatic failover, give
you a way of designing a database system with a guarantee of near-zero down
time. Imagine, for a moment, a database system that supports a stock broker-
age. Clients need the ability to post trades and have them recorded and executed
immediately any time, day or night. Any outage, no matter how brief, means
delayed or lost trades. Not only does this mean lost revenue, it would probably
mean losing any customers that were affected. If word gets out, it could means
losing even more customers, and possibly your business.
backup takes longer to run, but after restoring the base backup, you only need
to restore from the most recent differential backup.
DBMSs that use a transaction log for tracking changes to the backup support
transaction log backups. When you back up the transaction log, you back up the
inactive portion of the log only. That means that you back up only those transac-
tions that have been completed, either committed or rolled back. The inactive por-
tion is then deleted, freeing up the space and helping to keep the transaction log
a manageable size. Because of their small size and because they usually run quickly
and with minimal interference to user operations, transaction log backups are typ-
ically used to support backup plans that require frequent (such as hourly) back-
ups. During recovery, you first restore the base data and any change (incremental
and differential) backups. Then, you restore all transaction log backups, starting at
that point, in the order in which they were taken.
Usually, if your DBMS supports multiple backup options, it also lets you mix
and match those options to meet your needs. Let’s say that General Hardware
Supply, a fictitious company, determines that they can run backups after hours
only. This means running backups between 9:00 p.m. and 6:00 a.m. Monday
through Saturday and any time on Sunday. Because there are other automated
operations occurring at night, they want the backups run during the week to
run as quickly as possible. They also want a full backup to run at least once a
week. You could meet this requirement by configuring the backups to run auto-
matically, scheduling a full backup to run on Sunday and incremental backups
to run each of the remaining evenings.
Looking back at the brokerage example again, you would run infrequent full
backups, no more than once a week and possibly once a month or less. If you
were to cut full backups back to once a month, you would probably want to
run differential or incremental backups at least once a week, and more often if
your schedule and server activity allows. For the very frequent backups, such as
those that run as often as every minute, you would probably use transaction log
backups. Where this solution becomes a problem is in recovery after a failure.
You could find yourself having to recover from the full backup, the most recent
differential backup, and potentially thousands of transaction log backups. This
is one situation where you might strongly consider the added expensive of a
duplicate server as an appropriate investment.
Figure 11-17
Backup Data
One of the biggest advantages of this method is its ease of use. Backups can
run fully automated, as long as the network remains available and the destina-
tion server doesn’t run out of space. Backups can run very fast, depending on
your network bandwidth. It also leaves the data readily available for recovery
should there be a problem. Some companies, after moving most of their pro-
duction data to PC-based platforms now use some of the extra storage space on
mainframe or minicomputers as a backup destination.
The disadvantage is that the destination computer is a potential point of fail-
ure. Some events, such as fire or flood, could take out both the database server and
the backup file server unless they are kept well separated in different locations.
There is also a potential security concern. You need to ensure that the destination
server is protected against unauthorized access.
The other option used to be more common, backing up to removable
media, which is storage media that can be physically removed from the hard
disk. Most often, this means backing up to some kind of magnetic tape, but can
sometimes include optical media such as DVD. One problem with this method
is that the media capacity is often relatively less than the size of the database,
meaning that you need a backup device with a data carousel, or someone has
to physically change the media. This is also why media selections are usually
limited to magnetic tape or DVD. The storage capacity of diskettes and even
writeable CDs is too small for them to be considered for use except in some very
limited situations requiring selected backup of a subset of the database data. A
data carousel is a backup device that has multiple live media bays, which writes
to several tapes at once, for example, or changes the media for you automati-
cally. Another concern is that magnetic media has a limited lifespan and must
404 DATA ACCESS AND SECURITY
be replaced after a set number of uses. Magnetic tapes are also susceptible to
magnetic fields that can degrade or even erase the data.
The biggest advantage of using removable media is that is makes it easy to keep
a copy off-site and protected. Companies that use removable media often have mul-
tiple sets, rotating one set off-site on a regular schedule, such as once a week.
You can also combine methods. You might have some data backed up to a
network server and other data backed up to removable media. Another option
is to run most backups to a network location, but periodically back up to remov-
able media to create a copy for off-site storage. Some companies combine the
methods by running all backups to a network server and periodically backing
up the network server to removable media. Because these backups don’t impact
the database in any way, they can be run at a convenient time without worry-
ing about the impact on performance.
Taking a final look at our brokerage example, it’s likely that you would run
all of the backups to a network location because this would be the easiest config-
uration to fully automate. With as many backups as are required, automation
would be a critical concern. Then, as added security, you would back up the server
used as the backup destination on a regular basis. A daily backup schedule would
probably be appropriate, backing up all of the database backup files each time.
power supply (UPS). A UPS has an internal battery. AC power is used to charge
the battery, and electricity from the battery is then converted back to AC to
power the computer. This isolates the computer from the incoming AC power,
protecting it from power surges and sags (high and low power events). It also
powers the computer is case AC power is lost. Most UPS systems are designed
to work with the operating system to initiate a proper shutdown before its bat-
tery completely discharges, helping prevent data loss or corruption.
FOR EXAMPLE
The Need for Backups
The importance of backups cannot be overemphasized. Good, reliable back-
ups can mean the difference between an annoying interruption and a cata-
strophic and unrecoverable loss. Backups are your best, last line of defense
against problems that threaten your data.
When making your design decisions, keep in mind that recovery after a
hardware or software failure isn’t the only possible use of database backups.
They may also be needed to recover from other possible problems. Consider
this situation. Everyone is gone and you’re working late, cleaning up some
problems that resulted from using some inexperienced temporary employees
to do data entry. You know what you’re doing, so you work directly from a
command line in a query window. A few minutes into your cleanup, you acci-
dentally delete the ORDERS table, six months’ worth of customer orders that
include all of the open orders waiting to be filled and shipped. Rather than
writing up your letter of resignation and leaving quietly through a side door,
you can recover the lost table, and just the lost table, from backups.
SELF-CHECK
• How many hard disks does a disk mirroring configuration require?
• What guideline should you use when assigning user permissions in
SQL Server 2005?
• What is the preferred method for limiting user access to database
tables?
• You want to minimize the time required to run daily backups and
have backups run completely unattended. Which is the best solution?
• What is a benefit of running backups to removable media?
406 DATA ACCESS AND SECURITY
SUMMARY
This chapter looked at concepts and activities relating to data access and data
access security. The chapter first looked at server connectivity and designing to
meet connectivity requirements. It then looked at security principals, focusing on
a two-tier security system with separate server-level and database-level principals,
including their role in the connection process. Finally, the chapter discussed man-
aging data access through access security and methods for protecting against and
recovering from data loss.
KEY TERMS
Active Directory Mixed authentication
ADO.NET Modem
Authentication Mutual authentication
Automatic failover Open Database Connectivity (ODBC)
Common gateway interface (CGI) OLE DB
Connection Packet size
Connection path Protocol
Data carousel Public
Differential backup Removable media
Disk duplexing Role
Disk mirroring Sa account
Disk striping with parity Securable
Domain Security context
Fault-tolerant storage Security principal
Full backup Split read
Guest SQL Native Client
Hypertext Transfer Protocol SQL Server authentication
(HTTP) Strong password
Instant messaging (IM) Transmission Control Protocol/
Inactive portion Internet Protocol (TCP/IP)
Incremental backup Transaction log backup
Integrated security Trusted connection
Login User
Login credentials Virtual private network (VPN)
Manual failover WiFi
Middleware Windows authentication
SUMMARY QUESTIONS 407
Summary Questions
1. You are designing a client/server database application that will run on
your local network. The network is a heterogeneous environment that
includes Windows and non-Windows client computers that will require
direct access to the database server. The network is not configured as
an Active Directory domain. How should you configure authentication?
(a) Windows Windows authentication only.
(b) SQL Server authentication only
(c) mixed Windows and SQL Server authentication
(d) none of the above
2. Your company’s programmers are developing a point-of-sale application.
The customer will have the option of choosing from various data platform
options. What is the most appropriate data interface to use in this situation?
(a) ODBC
(b) SQL Native Client
(c) OLE DB for SQL Server
(d) you don’t have enough information to choose
3. During database design and implementation, the database administrator
is directly responsible for which of the following?
(a) client and server communication path
(b) server security principals
(c) application data interface API
(d) All all of the above
4. In a multitier application environment, middleware refers to the special-
ized software used to integrate the various application components. True
or False?
5. You deploy a database server running SQL Server 2005. Which of the
following client configurations would require SQL Native Client?
(a) a multitier application client that needs to access a web application
through a PC browser
(b) a client/server application client that needs to access the server
through a local Windows application that performs data retrieval and
modification
408 DATA ACCESS AND SECURITY
(c) a client computer used to update data after hours that runs scripts that
call user stored procedures to modify data through a Sqlcmd interface
(d) a client computer that is used by the database administrator to
remotely manage the server
6. You are configuring logins on a database server configured for Windows
authentication. What information do you provide?
(a) account name
(b) password
(c) default database
(d) both a and b
(e) both a and c
7. When using mixed authentication in SQL Server, after creating a SQL
Server login, what must you do to enable access to databases hosted
on the server other than the login’s default database?
(a) configure login access permissions
(b) configure database access permissions
(c) add the login to the Public role
(d) Create create database users associated with the login.
8. What is the biggest potential security concern when using Sqlcmd to
connect to a database with a SQL Server login?
(a) The login name might be compromised.
(b) The password might be compromised.
(c) The server name might be compromised.
(d) The initial database name might be compromised.
9. You run the following to open a connection from a Windows client
computer:
sqlcmd -sDataSrc -E
(a) Who has responsibility for maintaining the connection path and
physical connection components?
(b) Describe the software requirements on the client side for connectivity.
(c) Explain how security principals for connecting to the database server
should be configured and managed. Include specific requirements
and responsible administrative roles.
(d) Describe the role of database users in this configuration.
(e) You want to keep the configuration for data backup and recovery as
simple as possible and allow for quick recovery after a failure. Full
backups can be run no more than once a week. Periodic backups
should run no more than twice a day. Describe and justify the
configuration including backup types and destinations.
2. You are designing a multi-tier database application that provides its user
interface through a Web application. The Web server connects to the
database server through a separate application server. Clients connect to
the Web server using any current Web browser. The Web server allows
for public access. The SQL Server database server used as the data source
is configured for Windows authentication. The network on which the
database server is deployed is configured as an Active Directory domain
with the database server configured as a domain member.
(a) Except for the database server, identify which computers, if any, must
be domain members. For each computer type, explain why or why not.
(b) Briefly describe how the application server would connect to the
database server.
(c) Summarize the database access permission requirements to support
the application.
3. A database server is deployed in a highly secure environment. Data
access must be tightly controlled. Data must be protected against loss
or accidental disclosure through unauthorized access.
(a) Explain at least two ways your design could allow for off-site storage
of backups.
(b) Describe hard disk configuration options that would meet the data
security requirements.
(c) Assuming that the database server uses a two-tier security system,
briefly compare and contrast the roles of the different security
principals.
YOU TRY IT
Planning a Client/Server Data Environment 3. Develop a plan that minimizes downtime and the
You are planning the database server for a client/server features required for DBMS support. Explain how
database solution. The solution includes two produc- the backup storage requirements will be met.
tion databases and one decision support database, all
running on the same database server host. The data- Planning a Secure Multitier Environment
bases will support multiple client applications, each You are desiging the database solution portion of a mul-
running on the appropriate client machine, as well as titier application. The solution must support both direct
direct access through scripts, procedures, and ad hoc client access through the local network and indirect ac-
queries run from command line interfaces. The data- cess through an application server. There will be two
base server and all clients are members of an Active Di- management clients that need server access only. The
rectory domain. Server and database access must be application server requires encrypted communications
managed separately. with both the database server and its clients. You need
The design should be as easy to maintain as possible to know who has been connected to the database
with minimal database administrator requirements for server, and when, at any time.
managing access and security. Data access require- Backups must be designed to minimize the amount
ments are determined by users’ job requirements. of time required to back up data during the day for
Users often make lateral shifts, changing jobs within backups run between 7:00 a.m. and 6:00 p.m., when
the company based on seasonal requirements. the last of these periodic backups are run. An addi-
The design must minimize downtime and ensure that tional backup should run each night at midnight and
you have the server back up and running as soon as a full backup run each weekend. All backups must
possible after a failure. The wording in the requirements run without any operator interaction. The backup sys-
document calls for “near zero downtime” for the data- tem must be set up so that no more than 30 minutes’
base server. A copy of all data must be physically worth of work is at risk at any time and based around
stored in a secure location, specifically a safe deposit full, differential, and transaction log backups. Recov-
box leased for that purpose, with the copy refreshed on ery should not require transaction log backups taken
a weekly basis. At any time, there must be at least one over multiple days. Backups should be immediately
copy at that location. available for recovery.
In planning the client connection requirements, you need
1. Develop a plan to meet the access requirements.
to identify the features that the DBMS must support and
Identify security principals involved for direct ac-
design the server side of the access requirements. Be as
cess to data and software requirements to facili-
specific as possible.
tate connectivity. Compare and contrast how the
1. Plan the data access requirements and how they data access design will vary depending on
will be met, including server features necessary whether the database server supports a one-tier
to meet the specifications. or two-tier security system.
2. Also, develop a plan for managing database ac- 2. Design a backup system that meets all of the
cess security, based on user access require- specified requirements. Describe what backups
ments, that minimizes management overhead. will run, when, and to what destination(s). Com-
Database security can allow for direct access of pare design requirements when using permanent
database tables. or removable storage media.
411
12
SUPPORTING DATABASE
APPLICATIONS
Starting Point
Go to www.wiley.com/college/gillenson to assess your knowledge of data
application support.
Determine where you need to concentrate your effort.
INTRODUCTION
One of the strengths of PC-based database management systems (DBMSs) is
their flexibility. You can deploy them in various configurations, supporting all
sorts of applications, as part of a small network or to be published on the Inter-
net. This chapter looks at the three most common database configurations for
application support. We start with a look at centralized database configurations
and variations on the traditional client/server model. Then, we’ll look at distrib-
uted database configurations that you might use to support enterprise applica-
tions. We finish with a quick look at issues related to supporting Internet-based
applications.
and let users share data files. Another component that you will commonly see
is a database server.
Figure 12-1
PC PC
Database
Server
PC
Database
Query
Issued Printer
Here
PC
PC PC
Two-tier data.
How these multiple data sources are handled depends on your DBMS and
client application. Some DBMSs support a flexible data dictionary that can make
the process transparent to the database application. With other DBMSs, includ-
ing SQL Server, the DBMS tracks data hosted in its own databases only. Multi-
tiered data environments like those discussed here are treated as distributed
data environments because of the heterogeneous data sources, even though all
data sources are physically located on the local network.
As is often the case, you must be careful how you use terms relating to data-
base applications and make sure that you understand the context in which they
are used. When discussing a three-tier configuration, you need to specify whether
you are talking about the data locations or the database application configuration.
The same term, three-tier, is sometimes used when discussing a multitiered appli-
cation, referring to the database server, application servers, and client computers
as the three application tiers. The middle tier could involve multiple computers,
such as an application server and a Web server. You can easily have a single-tiered
approach to data storage, with all of the data stored on the database server, sup-
porting a multi-tier (or three-tier) database application.
416 SUPPORTING DATABASE APPLICATIONS
Figure 12-2
Database
PC PC
Database
Server/ Mainframe
Gateway Computer
PC
Database
Query
Printer
Issued
Here
PC
PC PC
Three-tier data.
These can be different instances of the same DBMS, or could even be different
DBMSs, such as having SQL Server and My SQL running on the same computer.
Each instance, though running on the same hardware, is treated as a completely
separate database server. Each is managed separately, with separate configura-
tions, and using computer resources separately.
Using separate server instances should almost always be limited to a devel-
opment environment as a way of minimizing hardware when testing different
database servers or different server configuration parameters. However, there are
rare instances when you might use this configuration in a production environ-
ment. For example, you might need multiple database servers that need different
security configurations, but the resources requirements are such that they can
share a hardware platform without interfering with each other. As with running
different server applications, this configuration needs to be carefully monitored.
FOR EXAMPLE
Data in Transition
You might wonder about the justification for two- and three-tier data when
a centralized environment with a single database and a single server is obvi-
ously easier to manage and maintain. The decision to use a multitiered data
environment is usually one made more out of necessity than out of choice.
It’s easy to think of the transition from PCs hosting their own data, data
stashed away in different applications such as spreadsheets, or from a tra-
ditional mainframe, to a PC-based DBMS as something almost surgical in
nature. One day, you’re operating with all of these different sources and man-
ually compiling the results, and the next, everything is sitting in the data-
base server ready for use. The reality is often very different. The transition
is often more of a gradual migration than a sudden switch.
There are several issues that must be considered and real-world factors
that must be included when migrating to a new data environment. Even
though you are moving data from PCs to the database, the data is contin-
ually changing. It may be necessary to run different systems in parallel until
the changeover is complete. The transition could be made in phases,
putting the most critical data or data needed by the largest number of users
onto the database first and then gradually moving the rest of the data
as time and resources allow.
There are special issues that can arise when dealing with an existing
mainframe or minicomputer. A company’s options are often limited by long-
term lease or service agreements. Maybe when you acquired another com-
pany and its assets, the mainframe contract came along as an unavoidable
part of the deal. You might not need the mainframe as your primary data
source after you’ve extracted the data from it, but you have to keep paying
for it, so you might as well get some use out of it. In other cases there are
practical reasons why you can’t move the data to a PC-based database server.
The transition might be too expensive to make it cost effective, or the
mainframe data is a required source for a proprietary or third-party appli-
cation that runs on the mainframe only. It might be that communication
with customers or clients is mainframe-based and it is either too difficult
to move to a different method or there are licensing, patent, or copyright
issues that prevent it.
The reason really doesn’t matter. The bottom line is that there are often
valid, unavoidable justifications for maintaining data in a two-tier or three-tier
configuration, even if just temporarily “during the migration.” It’s important
to understand, however, that “during the migration” has become a permanent
way of life for some companies.
12.2.1 UNDERSTANDING DISTRIBUTED DATA 419
SELF-CHECK
• Can a traditional centralized data environment have multiple data-
base servers as long as they are all deployed on the same LAN?
• What does a three-tier data environment include?
• What is a data warehouse used to support?
• What is a multi-purposed server?
▲ Reduce costs.
▲ Make the data universally available.
420 SUPPORTING DATABASE APPLICATIONS
Figure 12-3
ASIA
• Tokyo EUROPE
ARCTIC
OCEAN
GREENLAND U.K. • Paris
(DEN.)
CANADA AFRICA
A B C
NORTH AMERICA
PACIFIC D E F ATLANTIC
OCEAN • New York OCEAN
Los Angeles • UNITED STATES
• Memphis
Gulf of
MEXICO Mexico
Caribbean
Sea
SOUTH AMERICA
Figure 12-4
ASIA
• Tokyo EUROPE
ARCTIC
D E
OCEAN F
GREENLAND U.K. • Paris
(DEN.)
CANADA AFRICA
NORTH AMERICA
PACIFIC A B ATLANTIC
OCEAN • New York OCEAN
Los Angeles • UNITED STATES
C • Memphis
Gulf of
MEXICO Mexico
Caribbean
Sea
SOUTH AMERICA
Distributed databases.
clear understanding will help you balance the costs and drawbacks against the
benefits. Sometimes the best solution is to keep your centralized design and
improve your communication infrastructure rather than changing your data
design.
The simplest solution, in most cases, is to locate the data near its primary con-
sumer. In our example, that means deploying databases and placing the tables
physically near the offices that most frequently use them. Figure 12-4 shows how
you might do this: Tables A and B are located in New York, while Table C is moved
to Memphis, Tables D and E to Tokyo, and Table F to Paris. Let’s take a look at
what this means from a support and management standpoint. With Table F in
Paris, for example, the people there can use it as much as they want without run-
ning up any telecommunications costs. Furthermore, the Paris employees can exer-
cise local autonomy over the data, taking responsibility for its security, backup
and recovery, and concurrency control.
422 SUPPORTING DATABASE APPLICATIONS
Is this a workable solution? Possibly, but you don’t know enough about data
needs yet to know definitely. Does this solution have possible drawbacks? Defi-
nitely, including some drawbacks that carry over from the centralized database
and some new ones introduced by the distributed environment.
The main problem that is carried over from the centralized approach is avail-
ability. If New York went down in the original configuration, Table F (as well as
the rest of them) is unavailable. In our new design, if the Paris site goes down,
Table F is equally unavailable to the other sites.
One new problem relates to data access. What if an office issues a query that
joins data from multiple tables? In the dispersed approach, a join might require
tables located at different sites! Though not an insurmountable problem, this
would obviously add some major complexity. Another possible issue is security.
Although you can make the argument that local autonomy is good for issues like
security control, an argument can also be made that security for the overall data-
base can better be handled at a single, central location.
Moving the data to be near its primary consumer is one possible solution,
but is by far not the only solution, nor is it often the best solution. It’s doubt-
ful that any one location will need access to only one database table. How-
ever, you do have several different configuration options available, though
they may be limited somewhat by the functionality and features provided by
your DBMS.
Another possibility is to keep a full copy of the database at every loca-
tion, as shown in Figure 12-5. As you can see, this involves maintaining a
database with all six tables physically located in each office location. An obvi-
ous advantage to this configuration is availability. If a table is replicated at
two or more sites and one of those sites goes down, everyone everywhere else
on the network can still access the table at the other site(s). Telecommunica-
tion requirements to support client access are minimized. Joins are run using
local tables rather than distributed copies, making them easier to manage as
well. Before you create this distribution environment, however, you need to
make a couple of key decisions. You need to determine which copies of the
database will support data updates and how those updates will get distrib-
uted to the other copies. You also need to consider possible problems inher-
ent in this design.
One of the biggest potential concerns with a distributed data design that
duplicates data between locations is disk space. You need sufficient disk space
at each location to support the database copies. Security is another possible con-
cern. If a table is replicated at several sites, it becomes more of a security risk
by the mere fact that there are more copies in more locations and more oppor-
tunities for a dedicated hacker or data thief. Security management overhead,
though possibly (but not necessarily) distributed, increases because there are
more servers and databases to secure.
12.2.2 UNDERSTANDING REPLICATED DATA 423
Figure 12-5
ASIA
• Tokyo
ARCTIC EUROPE
A B C OCEAN
D E F GREENLAND U.K. • Paris
(DEN.) A B C
D E F
CANADA AFRICA
A B C
A B C NORTH
PACIFIC AMERICA D E F ATLANTIC
OCEAN D E F • New York OCEAN
Los Angeles • UNITED STATES
• Memphis
A B C
MEXICO D E F
Caribbean
Sea
SOUTH AMERICA
Possibly the biggest problem that data replication introduces is that of data
updates and concurrency control. If you allow updates to only one master copy
of the database that contains a consolidated data set, say the copy on New York,
you cut down significantly on the possibility of concurrency errors, but you lose
any benefits relating to telecommunication and improved access any time you
need to make a data update. If you allow updates to all of the databases, the
possibility of concurrency errors increases significantly. With either, you have the
issue of getting the updates made to all of the database copies.
Figure 12-6
ASIA
• Tokyo EUROPE
ARCTIC
C OCEAN B F
GREENLAND U.K. • Paris
(DEN.)
CANADA AFRICA
A B C
NORTH AMERICA
PACIFIC D E F ATLANTIC
OCEAN A E
UNITED STATES • New York OCEAN
Los Angeles •
D • Memphis
Gulf of
MEXICO Mexico
Caribbean
Sea
SOUTH AMERICA
Figure 12-7
ASIA
• Tokyo
ARCTIC EUROPE
A F
OCEAN
GREENLAND U.K. • Paris
(DEN.)
B D E
CANADA AFRICA
NORTH AMERICA
PACIFIC A B F ATLANTIC
OCEAN C D
UNITED STATES • New York OCEAN
Los Angeles •
D E • Memphis
Gulf of
MEXICO Mexico
Caribbean
Sea
SOUTH AMERICA
The principle behind making this concept work is flexibility in placing repli-
cated tables where they will do the most good. You want to
▲ Place copies of tables at the sites that use the tables most heavily in
order to minimize telecommunications costs.
▲ Ensure that there are at least two copies of important or frequently used
tables to realize the gains in availability.
▲ Limit the number of copies of any one table to control the security and
concurrency issues.
▲ Avoid any one site becoming a bottleneck.
around the world and no one consolidated database. Some database adminis-
trators don’t feel comfortable with a configuration like this and prefer to have
one central consolidated copy to ensure that all database data is secured and
backed up on a regular basis.
Figure 12-7 shows an arrangement of replicated tables based on the princi-
ples mentioned. There are two copies each of Tables A, B, E, and F, and three
copies of Table D. Apparently, Table C is relatively unimportant or infrequently
used, and it is located solely in Los Angeles.
You can take this partitioning of table data down to an even lower level. Most
manufacturers’ replication schemes support horizontal and vertical data parti-
tioning. With horizontal partitioning, you filter the rows so that a subset of the
available rows is replicated to the remote site. This is used when you want a local
copy of some, but not all, of the rows. With vertical partitioning, you filter the
columns so that only selected columns are replicated. This is typically done as a
security measure, giving the remote locations only the data they need to do their
jobs without generating multiple copies of more sensitive data. You can combine
the two types of partitioning and filter a table both horizontally and vertically.
You can see an example of partitioning, which is sometimes referred to as
table fragmentation, in Figure 12-8. This figure shows the same network we’ve
been using, but with Table G added. Table G might be, for example, the com-
pany’s employee table: the records of the employees who work in a given city
are stored in that city’s computer. G1 is the subset of records of Table G with
the records of the employees who work in Memphis, G2 is the subset of records
for the employees who work in Los Angeles, and so forth. This makes sense
when one considers that most of the query and access activity on a particular
employee’s record will take place at his or her work location.
The drawback of partitioning is that when one of the sites, say the New York
headquarters location, occasionally needs to run an application that requires
accessing the employee records of everyone in the company, it must collect the
records from every one of the five sites. One way to minimize this problem would
be to keep a consolidated copy of Table G with all employee records in New York.
Figure 12-8
ASIA
• Tokyo
ARCTIC EUROPE
A F
OCEAN
G5 GREENLAND U.K. • Paris
(DEN.)
B D E
G4
CANADA AFRICA
A B F
C D
NORTH AMERICA
PACIFIC G3 ATLANTIC
OCEAN G2 • New York OCEAN
UNITED STATES
Los Angeles •
D E • Memphis
G1 Gulf of
MEXICO Mexico
Caribbean
Sea
SOUTH AMERICA
1. The transaction processor starts the prepare phase and locates the data-
bases hosting the affected tables.
2. The changes are written to the database transaction logs.
12.2.4 DISTRIBUTED DATA SUPPORT ISSUES 431
FOR EXAMPLE
The Need for Distributed Operations
Let’s revisit our nationwide auto parts chain store. Previously, it had set up the
local stores with the data each store needs for point-of-sale operations. Now
we’re going to throw in another likely design requirement as a business rule.
Each customer must be listed once, and only once, in the consolidated database.
There are others requirements you might add, such as checking other
nearby locations if you don’t have something in stock or letting customers
return items to any location, but this is enough for now. You can think about
how to handle those problems on your own, later.
Life has definitely gotten more complicated, possibly more complicated
than you realize at first glance. Each customer is tracked separately in the sys-
tem. You want to make sure that all sales are tracked for sales analysis and mar-
keting purposes (all this data feeds a data warehouse at the home office). Let’s
say that you track customers by telephone number, and, by definition in your
data design document, each telephone number represents a unique customer.
The customer steps up to check out with a purchase, gives the employee
a telephone number, and the application does a search of the local customer
table. The number comes up not in file, so you need to enter a new customer
into the system, right? Not necessarily. What if this person purchased items
from a different location before? The application would do a second search,
as a distributed query, just to make sure. It could issue a query against all
of the other locations, but this is where the master copy comes in. The most
efficient way to check is to run a query against the master customer table
in the home office.
Now, let’s assume that the customer was found in the master table. All
updates must be made to the local databases, so now you have another prob-
lem, updating the customer record with this sale. We mentioned embedding
the location as part of the primary key, and now that information becomes
important. The application would use that code location to identify the
remote database. It would have to create the order as a distributed transac-
tion that updates the local inventory, employee (for sales bonus), and order
tables while also updating a remote customer table.
Would it be easier to code the application so that updates that affect
other locations are made directly to the master table? Possibly, but this
would add two additional problems. One, the design requirement is that all
updates be made to the local database. The other is that you set up repli-
cation so that the remote copies update the master copy, not the other way
around. If you start allowing updates at both, the possibility of concurrency
errors (such as lost updates) and replication conflicts increases.
434 SUPPORTING DATABASE APPLICATIONS
as to the rights and responsibilities to make changes to the servers and databases.
Though not an absolute requirement, this model works best when all of the data-
base servers are well-connected.
Centralized management is not always appropriate or even possible. Local-
ized management with central oversight delegates some or all of the admin-
istrative duties to local administrators. The general idea is that a local adminis-
trator would know his or her server, network, and user requirements better than
someone off in a remote location. Unfortunately, there are several potential prob-
lems inherent in this model. There is a possibility of inconsistency in how servers
are managed or in how well administrators adhere to company guidelines. It can
be difficult to coordinate changes that have to be made to multiple servers at
the same time. Some local administrators might resist controls imposed by com-
pany guidelines or changes made from a central location.
There are some special requirements at the central management point in either
of these models. The requirements aren’t special in terms of what is expected of
administrators so much as they are a matter of scope. For example, administra-
tors working in the central role need to understand server configurations and data
stores at each of the locations. They also need a thorough understanding of data
requirements at each location and data flows between the locations. They need
to not only understand company guidelines regarding data management, but how
they apply to each location.
SELF-CHECK
• What are some reasons for implementing a distributed data envi-
ronment?
• Which replication type supports the lowest level of granularity for
incremental updates?
• What is latency?
• What replication options are available in most DBMSs?
• What is the advantage of permitting local autonomy when determin-
ing how a distributed data environment is managed?
that is specifically not connected to the Internet. In the Internet database envi-
ronment, the general public potentially has access (planned or unplanned access
by hackers) to the company’s company’s databases. The public responses to the
applications that involve the Internet are often unpredictable, meaning that the
load on the system and access to the databases can change rapidly. The environ-
ment requires constant monitoring and management, with administrators having
to react quickly when the situation changes.
These spikes (some huge) in Internet traffic require predictive capacity plan-
ning. Companies want to be able to maintain reasonable response times during
spikes without spending large amounts of money to buy a lot of extra computer
equipment that will sit idle much or most of the time. Accomplishing this takes
planning and significant expertise.
436 SUPPORTING DATABASE APPLICATIONS
The challenge is to make the information systems and their databases avail-
able “24/7” without going overboard in terms of cost. How do you handle sys-
tem failures, electrical outages, or planned maintenance time? You can meet these
needs with redundant computer hardware and such accessories as electrical gen-
erators, UPSs, and batteries, but are these all cost-effective solutions? The trick,
and the concern during planning and design, is to find ways to prepare for such
eventualities at a reasonable cost. Excessive traffic is another issue. It can be
caused by legitimate traffic spikes, which can certainly reduce availability, but
can be planned and accounted for. Unexpected problems like computer viruses
that reproduce many copies of themselves, automated “robots” searching Web
sites for information, or dedicated attacks can clog systems, too.
Attacks that target Web sites, such as Denial of Service (DoS) attacks that
flood the Web site with traffic, can have multiple effects. Not only do these
attacks tie up the Web site so that legitimate traffic can’t get through, but the
added load could crash any Web applications running on the server. If the appli-
cation happened to be accessing the database server at the same time, you could
end up with lost or corrupted data.
Virus attacks and malicious attacks must be prevented whenever possible.
You can reduce the attack surface and minimize their effects through your design
and the security measures that you implement. These measures alone, however,
are usually not enough. Another key is constant monitoring by software that
watches for such conditions.
Other actions you take depend on your level of concern, the possibility of attack,
planned user loads, predictable traffic spikes, and so forth. Balanced against these
possible actions is your budget. You can only take the actions that you can afford
to take. Some of the most effective configurations for ensuring availability are also
some of the most expensive. Solutions such as clustering, fault-tolerant storage sub-
systems, and mirrored or standby servers, require duplicate hardware and software.
Growth also impacts availability, though it’s somewhat easier to project and
plan for its effects. Some electronic commerce efforts, in both “pure” e-commerce
startup companies and established companies, have experienced rapid growth.
In one case, the growth rate of traffic to the Web site was estimated at 1,000 to
4,000 percent per year in the early years. This is certainly good news for any
company that experiences it! However, this means that the information system
that supports this Web site must be scalable; that is, it must be capable of grow-
ing in size without adversely affecting the operations of the site. It is thus imper-
ative that hardware and software be chosen that is capable of rapid and major
expansion. Check that your chosen DBMS, as well as other components, have
the ability to scale up and scale out as necessary.
438 SUPPORTING DATABASE APPLICATIONS
▲ Separating the different parts of the information system so that they run
on different computers. The Web server and the database server should be
different computers. Furthermore, these servers should be separated from
the rest of the company’s information system by being on a separate LAN.
▲ Making use of firewalls. Firewalls provide a layer of protection between
your network and the Internet. Firewalls can be hardware-based, software-
based or both. They can provide different types of protection such as
limiting open data paths or checking incoming messages for viruses and
other suspicious code. Firewalls can be dedicated hardware devices
designed specifically for that purpose or can also be implemented
through specialized server software. Hardware firewalls are usually more
efficient and a better choice in high volume environments. Software-
based firewalls are often more cost-effective and can frequently be
deployed on multi-purposed servers.
FOR EXAMPLE
What Are We Getting Into?
Good Reading Bookstores, a fictitious bookseller, decides to move into the
e-commerce age with a Web site. You, working in the dual role of data and
database administrator, are responsible for the data tier of this new multi-
tiered application. That means getting the database ready, as well as being
there to advise and assist the programmers, network administrators, and
newly hired Webmaster.
Let’s start with your direct responsibilities. For a data administrator, data is
always a good place to begin. The good news is that most of the basic frame-
work required, things like inventory, customer, and order tracking components,
are already in place. You still need to go through the whole data design process,
though, to identify what new data requirements you have, if any. For example,
if you aren’t already tracking email addresses for your customers, you need to
start doing so now. You might need to add a large object data column to the
inventory table so you can include pictures of book covers and other items that
you stock. If the bookstore has not done any shipping in the past, you need to
add appropriate columns and tables to track shippers and shipping information.
Switch over to your database administrator hat now. Your security con-
cerns and the possibility of attack have just increased significantly. You need
to review database security to see what changes you need to make there. You
also need to provide for connectivity for the application that will feed data to
and process transactions for the Web server. This means working with the pro-
grammers, both to determine their requirements and to provide information
about data structures. You need to work with network administrators to see
what protections they are putting in place and what impact they might have
on how you configured communication parameters at the database server.
You also have a new set of performance concerns. You need to see what
is being done from the context of the Web server and network support com-
ponents to optimize performance, but you may need to make some changes
at your end, too. The server load is going to increase, but how much? That
is going to be difficult to predict. It depends on how long it takes people to
find the Web site and what kind of interest it generates. That doesn’t mean
you can’t do anything to prepare. Specific actions you can take include:
• Updating the baseline performance statistics.
• Reviewing current queries and query performance.
• Reviewing any new query requirements, watching for potential problems
like transactions likely to deadlock.
• Testing and tuning the database using queries needed by the new Web
application and creating additional indexes as necessary.
Continued
12.3.3 MANAGING SECURITY AND PRIVACY ISSUES 441
SELF-CHECK
• What does response time refers to?
• What are some of the factors that affect Web site performance?
• What can cause problems with availability of a Web site?
• What is a justification for placing a firewall between a Web server
and database server?
442 SUPPORTING DATABASE APPLICATIONS
SUMMARY
This chapter looked at different database configuration options and how they
can be used to support database applications. The chapter began with a close
look at a centralized data environment, including the possibility of two- and
three-tier data requirements in that environment. It looked at several options for
implementing a distributed data environment and the role of replication in main-
taining distributed data. Finally, it looked at some key issues related to sup-
porting Internet-based applications.
KEY TERMS
Authoritative Merge replication
Centralized data environment Microsoft Distributed Transaction
Centralized management Coordinator (MS DTC)
Change replication Multi-purposed server
Commit phase Prepare phase
Database persistence Proxy server
Distributed database servers Query cache
Distributed data environment Replication
Distributed join Response time
Distributed query Server instances
Firewall Snapshot replication
Fragmentation Three-tier approach
Gateway computer Transaction replication
Horizontal partitioning Two-phase commit
Local area network (LAN) Two-tier approach
Latency Vertical partitioning
Local autonomy Well-connected
Localized management with central
oversight
SUMMARY QUESTIONS 443
Summary Questions
1. A collection of PCs connected by communication lines in a relatively
small geographic area is known as a local area network (LAN). True or
False?
2. Which of the following is an example of a three-tier centralized data
approach?
(a) the application includes a database server, application server, and
Web server, with clients accessing data through a browser
(b) source data located on a database server, on the local PCs running
the client application, and on a mainframe computer
(c) source data is spread across three database servers deployed on the
same network
(d) source data is deployed on three database servers deployed in differ-
ent geographic locations
3. Which statement best describes the data stored in a data warehouse?
(a) operational data needed to support day-to-day activities
(b) archive data that is no longer applicable but kept available in case it
is needed in the future
(c) accumulated current and historical data used by decision support
applications
(d) external data waiting to be evaluated and integrated with production
data
4. What is the biggest concern when deploying a database server instance
on a multi-purposed server?
(a) The requirement to share resources with other server applications
could lead to less than optimal performance.
(b) Security configuration settings must be the same for all servers
hosted on the same computer.
(c) Other server applications might overwrite data stored in the database
server’s databases.
(d) It makes it more difficult for database clients to recognize the
computer as a database server.
444 SUPPORTING DATABASE APPLICATIONS
(c) Allow changes to any copy and allow any copy to directly update
any other copy.
(d) You don’t have enough information to determine the correct answer.
11. A distributed join is a query that consolidates data from tables located in
different databases. True or false?
12. You initiate a distributed query from a database server running SQL
Server. There are multiple copies of the data table located on different
database servers, but with the same table name. How do you ensure that
the correct copy of the data is updated?
(a) List the affected database servers when you initiate the transaction.
(b) Let SQL Server choose the most appropriate table copy.
(c) Let MS DTC choose the most appropriate table copy.
(d) Identify each table by its fully qualified name.
13. A firewall can be used to filter traffic between a network and the internet
to monitor for malicious activity. True or False?
14. How is database persistence implemented?
(a) through a query cache associated with the Web server
(b) by caching the query execution plan at the database server
(c) through a query cache on the firewall
(d) none of the above
15. What actions can you take to prevent accidental disclosure of data stored
on a SQL Server 2005 database?
(a) Encrypt communications into and out of the database server.
(b) Encrypt data stored in database tables.
(c) Allow access permissions on an as-needed basis only.
(d) all of the above
Planning a Centralized Environment directly support any one clinic. Instead, each clinic will
You have a heterogeneous network environment with have its own local database server. Each office will
data stored on local PCs and on a mainframe com- maintain its own separate staff, equipment, and inven-
puter. The network links to the mainframe through a tory records. Every clinic must have full access to all pa-
gateway computer. You want to consolidate the data tient and treatment records. There are also a full set of
on a centralized SQL database server. The transition medical references online.
will be accomplished in phases, first consolidating the The primary concerns are accuracy, consistency,
PC data and then including the mainframe data. security, and how quickly patient and treatment infor-
The database server will support specialized cus- mation can be retrieved at the local clinics. All clients
tom database applications. The applications are being need access to the medical references, but speed of
developed in-house by staff programmers. You provide access to this information is less critical. Update la-
assistance to the programmers as necessary. Most ac- tency must be kept to a minimum. All communciation
cess to the database server will be limited to access via between locations and all patient records are en-
the database applications, but a small number of crypted.
clients will have direct access to the server. Treatment information is related to each patient and
At the same time, you are designing a separate data- includes information about any staff involved with the
base that will support an online research and reference treatment, medications given during the visit to the pa-
application. Most of the tables needed for the application tient, and prescriptions written for the patient. The clin-
will rarely need updating. Access to this server will be lim- ics do not fill prescriptions, but do sometimes give out
ited to a.NET Web application running on a Web server. samples they have received from drug representatives.
Your database server does not support locating ex- These are noted on the treatment record, but not
ternal data sources unless they are explicitly identified. tracked in any detail.
You want to minimize storage requirements, but
1. Describe the requirements for supporting data must meet the performance expectations for data re-
access during the intial deployment phases. trieval.
2. You need to select the hardware platform
needed to support the SQL databases. The 1. Describe the data requirements at each clinic.
databases will be hosted on different DBMS Include which tables, if any, would need to be
products. Discuss the considerations for con- partitioned. Identify data, if any, that would be
solidating the data requirements on a single stored on the consolidated data server only.
computer. Justify your answers. 2. Describe the replication requirements needed to
perform an initial load of the remote databases
Planning a Distributed Environment and to mainiain data integrity and consistency.
You are designing and deploying a distributed data 3. Identify situations that would call for distributed
environment that will support for-profit medical clinics. transactions. Be as specific as possible. Include
You have completed the initial data design and the data the steps necessary to execute a distributed
load on a database server that consolidates the records transaction using a two-phase commit process,
for all of the clinics. The consolidated server does not including the specific steps involved.
447
GLOSSARY
Access path A plan for what steps to take to respond to a query.
Access security Controlling user and application access to data by allowing or disal-
lowing different activities.
ACID Transaction acronym standing for atomicity, consistency, isolation, and durability.
Active Directory A Microsoft directory service that supports a full featured network
security environment organized around logical entities known as domains.
ADO.NET A collection of data object libraries that let you connect to various data
sources, send requests, and receive and manipulate data.
Aggregate function A function that operates on a set of values, returning a single
result based on these values.
Alert Performance threshold monitor.
AND Boolean operator used to test conditions in which both must be true for the
result to be true.
Application data requirements Information that the application needs and how it
will be used by the application.
Arbitration Act of resolving disputes.
ASCII American Standard Code for Information Interchange. An encoding standard
for encoding English language characters and control characters, supporting 127 print-
able characters.
Associative entity Entity designed to associate key values from two entities between
which a many-to-many relationship is defined.
Atomicity Referring to the fact that either all of the statements in the transaction are
executed, or none of the statements are executed.
Attribute Information describing a data entity.
Audit trail Ongoing log of user access and activity.
Authentication The process by which a security principal is identified and allowed
access.
GLOSSARY 449
Data normalization A methodology for organizing attributes into tables so that redun-
dancy among the nonkey attributes is eliminated.
Data planning Identifying data requirements, analyzing available data, and designing
ways to meet data needs.
Data query Term commonly used to refer to data access.
Data repository Refers to the data storage unit where physical data files are kept.
Data security Processes and procedures implemented with the goal of keeping data
safe. Basic data security categories are access security and physical security.
Data type A way of describing the kind of data that can be stored in a column or a
variable.
Data type Data storage format used to specify storage characteristics for data columns
and variables.
Data volatility Referring to how often data is added to the database and how often
the data changes.
Data volume assessment The process of determining the storage space required for
each database table.
Data warehouse A decision support database used to store a large amount of (typically
historical) data.
Database An ordered collection of related data elements intended to meet the infor-
mation needs of an organization and designed to be shared by multiple users.
Database administrator Role responsible for operationally-oriented administration
activities.
Database engine Core DBMS component responsible for data retrieval and modifica-
tion, coordinating other DBMS component actions, providing an interface with the data
user, and interfacing with the platform operating system.
Database object Something created as part of a database, such as a table or view.
Database persistence Act of caching queries in an Internet-based application where
data is cached at the Web server or a proxy server.
Database practitioner Information Technology personnel responsible for designing,
creating, and maintaining databases.
DDL statement Statement used to create and manage server and database objects.
DDL trigger Trigger that fires when database or database server objects are created or
modified.
Deadlock A condition where two transactions mutually block each other.
Decision support database A database designed to support advanced data mining
activities and provide support for strategic decision making.
Declarative statement Statement in which you specify the data for which you are
looking and let the DBMS determine the procedure for accessing that data.
Decomposition process Another term for the data normalization process. Also called
“non-loss decomposition.”
GLOSSARY 453
Dynamic SQL Statements executed directly through an interactive user interface, such
as a command prompt, for immediate execution on the database server.
Electronic data interchange (EDI) Technologies and standards for electronic data
transfer.
Embedded SQL Statements executed in the context of another programming language.
Enterprise Resource Planning (ERP) software Multi-function, integrated software
based on multiple software modules used to manage common business tasks and based
on a shared database.
Entity Refers to something having a distinct existence. In the context of database
technologies, refers to the items that can be described and tracked in the database.
Entity integrity Relating to ensuring that each row is uniquely identified.
Entity-Relationship (E-R) Modeling The process of identifying entities, attributes, and
relationships for a relational database model.
Entity-Relationship Diagram (ERD) Data diagram based on entities and relationships.
Exception conditions Less commonly used normal forms.
Exclusive lock A lock that, when acquired by one transaction, prevents other transac-
tions from accessing the object.
Explicit conversion The process of manually converting data.
Explicit transaction Transaction in which the entire transaction is controlled through
BEGIN TRAN, COMMIT TRAN, and ROLLBACK TRAN.
Fault-tolerant storage Physical data storage method designed to avoid data loss or
errors in case of hardware failures.
Fault-tolerant system Designed to protect data in the event of a hard disk failure.
Firewall Provide a layer of protection between your network and the Internet. Firewalls
can be hardware-based, software-based or both.
First normal form Each attribute value is atomic, that is, no attribute is multivalued.
Foreign key Relational database object used to define and maintain relationships
between tables.
Forms generator DBMS component used to design forms. Most commonly used to
design data input and data presentation forms.
Fragmentation A condition where the data is randomly spaced in small pieces all over
the hard disk rather than being located in one place.
Fragmentation Term sometimes used to describe table partitioning.
Full backup A backup of all data, including the transaction log, if any.
Fully qualified object name Complete object name that defines the object as globally
unique.
Function Named set of executable steps that support input parameters and return a
value.
Function Special executable designed to operate on data and return a result.
GLOSSARY 455
Nondeterministic function A function that might return different results, even if called
with exactly the same arguments.
Non-logged operation That means that the change isn’t written to the transaction log,
but instead is made directly to the database.
Non-loss decomposition Normalization process in which neither data nor relation-
ships are lost.
Nonrepeatable read A condition where data changes between reads of the same
data set.
Nonvolatile storage Storage media that continues to hold the data it contains when
power is lost or the computer is turned off. The most common example is disk drives.
Normal forms Defined rules for data normalization.
Null value An undefined value, usually used to identify that no value is provided for
that attribute
Nullability Column definition as to whether or not a column can accept null values.
Object-oriented database model Database model where entities are treated as objects,
which are individual items that can be defined and described more completely than in a
relational model.
Object-relation model Data model based on both relational and object modeling
concepts.
Object-relational database model Hybrid database model based on the relational
model, but integrating features and functionality from the object-oriented model.
OLE DB A Microsoft data access API.
One-to-many relationship A relationships in which one referenced entity instance
can be related to any number of referencing entity instances.
One-to-one relationship A relationships in which one referenced entity instance can
be related to one referencing entity.
Online transaction processing Database type characterized by the need to support data
update, change, and delete data, the need to optimize for support of concurrency and
throughput, and the requirement to scale to meet the needs of a large number of users.
Open Database Connectivity (ODBC) An industry standard API Supporting data
connectivity.
Open transaction A pending transaction that has not been committed or rolled back.
Operating system software Software that controls the computer hardware, provides
an interface to the user, and provides an environment in which applications can run.
Operator precedence Operator ranking used to determine the order in which opera-
tors are processed.
Optimistic processing Transaction control method that assumes that conflicts are
unlikely and does not check for conflicts before processing the transaction.
OR Boolean operator used to test conditions in which the result is true if either condi-
tion is true.
GLOSSARY 459
Outer join A query returning qualifying rows from one table and all rows from the
second “outer” table.
Output parameter A value returned by a procedure.
Packet size Network communication parameter setting the maximum amount of data
that can be sent at one time.
Paging file An area of hard disk storage set aside for use as system memory.
Parameters Values provided that help determine how a command runs.
Partial functional dependency A dependency where data is dependent on part of the
primary key.
Partial rollback A process by which part, but not all, of the statements in a transaction
are reversed.
Partial update A condition when only part of the tables involved in an operation are
updated instead of updating all tables.
Performance baseline Performance data you collect to use for comparison later when
performance problems are suspected.
Performance counter Module designed to monitor a specific aspect of system or appli-
cation performance.
Performance object A group of related performance counters.
Phantom read A condition where a row appears or disappears in subsequent reads of
the same data.
Phantom A data row that has appeared or disappeared in a subsequent read of the
same data.
Physical data pointer Value used to link and associate data in the hierarchical and
network database models. Values to link data values to physical storage locations in
many implementations of the relational database model.
Physical design Design process that includes identifying and implementing the physical
database and database objects.
Physical security Protecting data against physical corruption or loss; typically involves
protecting the server, ensuring the storage media and duplicating data.
Precedence constraint Decision point within a maintenance plan used to control
execution.
Prepare phase Initial phase of the two-phase commit process where the databases are
prepared for update.
Primary index Index for a table’s primary key.
Primary key Database object used to enforce uniqueness in a table.
Procedure Compiled set of executable statements.
Processor affinity Configuration settings relating to multiple processor use.
Production database Database that provides support for day-to-day business activities.
460 GLOSSARY
Repeatable read Transaction isolation level that prevents dirty reads and nonrepeatable
reads by not reading uncommitted data and not allowing changes to data read by the
transaction.
Replication Means by which a database server updates one or more destination data-
base servers.
Report writer DBMS component responsible for designing and processing reports.
Reporting database A decision support database designed to support report generation.
Response time The delay from the time that the request is made (such as when the
Enter key is pressed) to execute a query until the result appears on the screen.
Response time The elapsed time required for a Web server’s response to display in
the client’s browser.
Result set Rows and columns returned by a SELECT command.
Role In SQL Server, login or data user groups with defined permissions.
Roll forward A recovery process that updates database tables based on committed
transactions in the transaction log.
Rollback The process of backing out the changes made by an unsuccessful transaction
so that they are not written to the hard disk.
Sa account SQL Server administrator account.
Scalable Term describing a solution that can grow to meet business needs.
Scaling out The process of partitioning data and a database application across multiple
distributed database servers.
Scaling up The process of upgrading the database hardware platform by improving
the processor, memory, disk subsystem, or other hardware resources.
Schema Metadata used to describe database objects.
Schema binding Prevents you from making any changes to the structure of the base
tables, such as adding columns or changing column data types.
Script A set of executable statements saved together as a file.
Search argument Logical and conditional operators used to filter the rows processed
by a command.
Search attribute A value used to retrieve particular records.
Second normal form Every nonkey attribute must be fully functionally dependent on
the entire key.
Secondary index Additional database indexes (other than the primary index).
Secondary storage Nonvolatile data storage such as disk and tape drives.
Securable Server or database object over which you have control over permission
assignments.
Security context The security default for a connection based on the security princi-
pal’s assigned rights and permissions.
462 GLOSSARY
Security principal Refers to an entity that can be uniquely identified through authen-
tication and sets a connection’s security context.
Selectivity Referring to the number of different values in a table column.
Serializable Most restrictive transaction isolation level designed to prevent dirty reads,
nonrepeatable reads, and phantoms.
Serialization Non-concurrent transaction processing (serial processing).
Server instances Multiple, separately-managed database servers running on a single
computer.
Shared lock A lock that allows other transactions to access the same data for read.
Snapshot replication Configuration where an entire table is copied to the destina-
tion server.
Snapshot SQL Server-specific transaction isolation level in which the transaction uses
a snapshot of database data taken at the start of the transaction.
Split read Read request serviced by multiple hard disks.
SQL Native Client OLE DB-based data access technology that supports the full features
set of SQL Server 2005.
SQL Server authentication Authentication method based on SQL Server logins and
database server authentication.
Sqlcmd The command used to launch the preferred character-based command line
interface for Microsoft SQL Server 2005.
Stakeholder Any person with a personal interest in or who is affected by database
design and implementation.
Strategic data planning Data planning relating to long-term data use issues.
Strong password Password that includes mixed case letters, numbers, and “special”
characters.
Subquery A way of retrieving information where one query is dependent on another
query.
Surface area A term referring to how exposed an database is to access and
manipulation.
System analyst IT specialist responsible for analyzing complex systems and how
they interact, in order to determine how computers can best be used to meet business
requirements.
System stored procedures Predefined procedures that install with SQL Server.
Table Fundamental database object in the relational database model where entities
are described as rows and columns.
Table alias A value used to replace and represent a table name when writing queries.
Table order The order in which rows are physically stored in a table.
Temporary table A table created in memory only and having a limited scope.
GLOSSARY 463
M Microsoft Windows
Mainframe, 40 access authentication, 383–84, 387, 391–92
Main memory, 40, 41 disk fragmentation utility, 267
Maintenance plans, 52–53, 322–24 as PC operating system, 43
Managed objects, 394–96 performance monitors, 271–72, 275
Management, see Database management; Data management security, 387
Manual failover, 399 Middleware, 379
Manual modeling, 92 Mini-computer, 40
Manual tasks, 320 Minimally logged operation, 212
Many-to-many (M-M) binary relationship, 81–85, 87–90, Minus sign operators, 197
111–13 Mirror image, 15
Many-to-many (M-M) relationships, 87–91 Mixed authentication, 383
Many-to-many (M-M) unary relationships, 84, 85–86, M-M (many-to-many) binary relationship, 81–85, 87–90,
115–16 111–13
Mass deployment databases, 29 M-M (many-to-many) relationships, 87–91
Mathematical SQL function category, 204 M-M (many-to-many) unary relationships, 84, 85–86,
Member records, network database models, 33–34 115–16
Memory and performance, 266, 268–69 Modalities, 82–83
Memory buffers, 41–42 Models, see Database models; Data models
Memory management by operating system, 45 Modems, 373–74
Merge replication, 424 Monitoring
Merge-scan join, 280 active connections, 390
Metadata, 43, 47, 315 activity logs, 46–47, 390–91
Microsoft Access, 44, 306 Activity Monitor, Microsoft SQL Server 2005, 360,
Microsoft Developer Network (MSDN), 225, 253 362, 390, 392
Microsoft Distributed Transaction Coordinator background, 46
(MS DTC), 431 performance, 270–75, 314
Microsoft Maintenance Plan Wizard, 322–23 security, 314–15
Microsoft SQL Server 2005 MSDN (Microsoft Developer Network), 225, 253
Activity Monitor, 360, 362, 390, 392 MS DTC (Microsoft Distributed Transaction
automation, 321–22, 325 Coordinator), 431
command context, 191 Multiplication operators (*), 197, 207
configurations, 414–17 Multi-purposed servers, 416, 438
connectivity interfaces, 376–77 Multi-statement table-valued functions, 288
database performance, 266–67 Multi-tier approach, data configuration, 414, 415, 416,
distributed data, 415–17, 431–32, 439 418, 438
functions, 204, 288 Multi-tier connectivity, 379–82
hardware performance, 265–70 Mutual authentication, 376–77
indexed view, 176–77 MySQL software, 185, 186, 417
interactive SQL, 187
maintenance, 322–23 N
Management Studio, 188–89, 320, 386, 390 Navigational approaches to databases, 31
object-relational database model, 36 Nested-loop join, 280
performance monitoring, 270–75 Nested procedures, 286
programmable objects, 284–92 Nested subqueries, 242, 246–47
query mode, 187 Nested transactions, 343–45
security, 382–83, 439 NET Framework, 378, 390
SELECT, 185–87, 191–92, 208, 213, 224, 225 Network adapter, 373
server security and access, 383–92 Network interface card (NIC), 373
transaction management, 355–62 Networks
triggers, 162 database models of, 33–34
troubleshooting, 324–25 defined, 39
utility decisions, 320 local area networks (LANs), 372–73, 413–14
Windows authentication, 383–84, 387, 391–92 operating system controls, 45
Microsoft Visio, 92–93 shared for data backup, 402–4
474 INDEX