0% found this document useful (0 votes)
9 views

DATABASE SYSTEMS[1]

The document provides a comprehensive overview of database architecture, detailing specifications, rules, and processes governing data storage and access. It emphasizes the importance of a Database Management System (DBMS) in managing data efficiently, ensuring data integrity, and facilitating user access. Additionally, it contrasts traditional file systems with DBMS, highlighting the advantages of using a DBMS for complex data management tasks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

DATABASE SYSTEMS[1]

The document provides a comprehensive overview of database architecture, detailing specifications, rules, and processes governing data storage and access. It emphasizes the importance of a Database Management System (DBMS) in managing data efficiently, ensuring data integrity, and facilitating user access. Additionally, it contrasts traditional file systems with DBMS, highlighting the advantages of using a DBMS for complex data management tasks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Database architecture- is the set of specifications, rules, and processes that dictate how data is

stored in a database and how data is accessed by components of a system. It includes data types,
relationships, and naming conventions. The database architecture describes the organization of
all database objects and how they work together. It affects integrity, reliability, scalability, and
performance. The database architecture involves anything that defines the nature of the data, the
structure of the data, or how the data flows.
This document is intended to be a fairly comprehensive description of a database architecture
proposal. The specific database architecture being suggested may not be a perfect fit for every
environment. However, even if it does not meet all the needs of a particular situation, it should
provide some valuable ideas and important points of consideration.
The database architecture proposed here is the result of much research and practical experience.
The advice of many industry experts was gathered from several different sources. That advice
was merged with day-to-day experience building the database architecture for a new company
from the ground up. The company, a growing data processing services firm (totally separate
from Wingenious), is currently using a large custom software package that is based upon a
database architecture very much like this one.
This database architecture has served the business mentioned above very well since it was
adopted there in 2001. As of late 2005, the company maintains roughly 30 databases on three
separate servers. The databases contain roughly 1500 tables with roughly 250 million records.
The main database contains about 200 tables with about 50 million records. It’s used mainly for
OLTP and large batch processing chores, but it also handles some OLAP tasks. DBA duties are
greatly simplified and extremely efficient largely due to dynamic routines that are made possible
by the consistency of the database architecture.
This database architecture is intended to be generic and applicable to any business. It addresses
only the back-end of a system, leaving the front-end choices for others to debate. The various
options for presenting data to users are beyond the scope of this document. This document
discusses the database itself primarily, but it also touches on getting data to a middle tier of a
multi-tier system. The information should be beneficial to a DBA or a software developer, and
especially a person whose job includes aspects of both positions.
DBMS
A DBMS is a collection of programs that manages the database structure and controls access to the data stored in the database.
The DBMS serves as the intermediary between the end user and the database by translating user requests into the complex
computer code. The end user interacts with the DBMS through an application program. The application program is
Written by a programmer, using through the use of DBMS utility programs A database is a collection of data, t y p i c a l l y
describing the activities o f one or more related o r g a n i z a t i o n s .
Database Management Systems (DBMSs) are complex, mission-critical software systems. Today’s
DBMSs embody decades of academic and industrial research and intense corporate software development. Database
systems were among the earliest widely deployed online server systems and, as such, have pioneered design solutions
spanning not only data management, but also applications, operating systems, and net- worked services. The early DBMSs
are among the most influential soft- ware systems in computer science, and the ideas and implementation issues pioneered
for DBMSs are widely copied and reinvented.
For a number of reasons, the lessons of database systems architecture are not as broadly
known as they should be. First, the applied database systems community is fairly small. Since market forces only support a few
competitors at the high end, only a handful of successful DBMS implementations exist. The community of people involved in
designing and implementing database systems is tight: many attended the same schools, worked on the same influential research
projects, and collaborated on the same commercial products second; academic treatment of database systems often ignores
architectural issues. Textbook presentations of database systems traditionally focus on algorithmic
For example, a university database might contain i n f o r m a t i o n about the following:
DATABASE SYSTEM
A database management system, or DBMS, is software designed to assist in maintaining and utilizing large
collections of data, and the need for such systems, as well as their use, is growing rapidly. The alternative to using
a DBMS is to use ad-hoc approaches that do not carry over from one application to another; for example, to store
the data in files and write application-specific code to manage it. The use of a DBMS has several important
advantages .
The area of database management systems is a microcosm of computer scien ce in general. The issues addressed and the
techniques used span a wide spectrum, including languages, object-orientation and other programming paradigms,
compilation, operating systems, concurrent programming, data structures, algorithms, theory, parallel and distributed
systems, user interfaces, expert systems and artificial intelligence, statistical techniques, and dynamic programming. We
will not be able to go into all these aspects of database management in this book, but it should be clear that this is a rich
and vibrant discipline
The problems in the use of computer file systems make the database system very
desirable. The current generation DBMS provides the following functions:
 Stores the definitions of data relationships (metadata) in a data dictionary. In turn, all
programs that access the database work through the DBMS. The DBMS uses the data in the
data dictionary to look up the required data-component structures and relationships. Any
changes made in a database file are automatically recorded in the data dictionary, thus freeing
us from having to modify all the programs that access a changed file. The DBMS removes
structural and data dependency from the system;
Creates the complex structures required for data storage, thus relieving us of the difficult task
of defining and programming the physical data characteristics;
Transforms entered data to conform to the data structures and relieves us of the chore of
making a distinction between the data logical format and the data physical format;
Creates a security system and enforces security and privacy within the database;
Creates complex structures that allow multiple-user access to the data;
Provides backup and data recovery procedures to ensure data safety and integrity;
Promotes and enforces integrity rules to eliminate data integrity problems, thus allowing us to
minimize data redundancy and maximize data consistency;
Provides data access via a query language (a nonprocedural language) and via procedural
(3GL) languages.
1. Entities such as students, faculty, courses, and classrooms.

2.Relationships between entities, such as students’ enrollment in courses, faculty teaching


courses, and the use of rooms for courses.
ADVANTAGES OF A DBMS
Using a DBMS to manage data h a s many advantages:

1. Data independence: Application programs shou ld be as independent as possible from details of data
r e p r e s en t a t io n and storage. The DBMS can provide an abstract view of the data t o insulate application code
from such details.
2. Efficient data access: A DBMS utilizes a variety of sophisticated techniques to store and retrieve data
efficiently . This feature is especially important if the data is stored on external s t o r a g e devices.
3. Data integrity and security: If data is always accessed through the DBMS, the DBMS can enforce integrity
constraints on the data. For example, before inserting salary i n f o r m a t i o n for an employee, th e DBMS can check
that the department budget is not exceeded. Also, the DBMS can enforce access controls that govern what data i s
visible to different classes of users.
4. Data administration: When several users share the data, centralizing the ad- ministration of data can offer
significant improvements. Experienced professionals who understand the nature of the data being managed, and
how different groups of users use it, can be responsible for organizing the data representation to minimize
redundancy and for fine-tuning the storage of the data to make retrieval efficient.
5. Concurrent access and crash recovery: A DBMS schedules concurrent accesses to the data in such a
manner that users can think of the data as being accessed by only one user at a time. Further, the DBMS
protects users from the effects of system failures.
6. Reduced application development time: Clearly, the DBMS supports many important functions that are
common to many applications accessing data stored in the DBMS. This, in conjunction with the high-level
interface to the data, f a c i l it a t e s q u i c k development of applications. Such applications are also likely to be
more robust t h a n applications developed from scratch because many important tasks are handled by the DBMS
instead of being implemented by the application.
A DBMS is a complex piece of software, optimized for certain kinds of workloads (e.g., answering complex
queries or handling many concurrent requests), and its performance may not be adequate for certain specialized
applications. Examples include applications with tight real-time constraints or applications with just a few well-
defined critical opera- tions for which efficient custom code must be written. Another reason for not using a
DBMS is that an application may need to manipulate the data in ways not supported by the query language. In
such a situation, the abstract view of the data presented by the DBMS does not match the application’s needs,
and actually gets in the way.
Traditional FILE SYSTEMS VERSUS A DBMS/Architecture/ Schemas
To understand the need for a DBMS, let us consider a motivating scenario: A company has a large collection (say, 500
GB1 ) of data on employees, departments, products, sales, and so on. This data is accessed concurrently by several
employees. Questions about the data must be answered quickly, changes made to the data by different users must be
applied consistently, and access to certain parts of the data (e.g., salaries) must be restricted.

We can try to deal with this data management problem by storing the data in a collection of operating system
files. This approach has many drawbacks, including the following:

We probably do not have 500 GB of main memory to hold all the data. We must therefore store data in a storage
device such as a disk or tape and bring relevant parts into main memory for processing as needed.

Even if we have 500 GB of main memory, on computer systems with 32-bit ad- dressing, we cannot refer directly
to more than about 4 GB of data! We have to program some method of identifying all data items.

We have to write special programs to answer each question that users may want to ask about the data. These
programs are likely to be complex because of the large volume of data to be searched.
We must protect the data from inconsistent changes made by different users ac- cessing the data concurrently.
If programs that access the data are written with such concurrent access in mind, this adds greatly to their
complexity.
We must ensure that data is restored to a consistent state if the system crashes while changes are being made.
Operating systems provide only a password mechanism for security. This is not sufficiently flexible to enforce
security policies in which different users have per- mission to access different subsets of the data.

A DBMS is a piece of software that is designed to make the preceding tasks easier. By storing data in a DBMS,
rather than as a collection of operating system files, we can use the DBMS’s features to manage the data in a
robust and efficient manner. As the volume of data and the number of users grow—hundreds of gigabytes of data
and thousands of users are common in current corporate databases—DBMS support becomes indispensable.
Conceptual Schema
The conceptual schema (sometimes called the logical schema) describes the stored data i n terms of the data m o d e l
of the DBMS. In a relational D B M S , the conceptual schema describes all relations that are stored in the database. In
our sample university database, these relations c o n t a i n information about e n t i t i e s , such as students and
faculty, and about relationships, such as students’ enrollment in courses. All student entities c a n be described u sin g
records in a Students relation, a s we saw earlier. In fact, each collection of entities and each collection of relationships
can be described as a relation, l e a d i n g to the following conceptual schema:

Students(sid: string, name: string, login: string,


age: integer, gpa: real)
Faculty(fid: string, fname: string, sal: real) Courses(cid: string, cname: string,
credits: integer)
Rooms(rno: integer, address: string, capacity: integer)
Enrolled(sid: string, cid: string, grade: string) Teaches(fid: string, cid:
string)
Meets In(cid: string, rno: integer, time: string)

The choice of relations, and the choice of fields for each relation, is not always obvious, and the process of
arriving at a good conceptual schema is called conceptual database design.
Physical Schema
The physical schema specifies additional storage details. Essentially, the physical schema summarizes how the
relations described in the conceptual schema are actually stored on secondary storage devices such as disks and tapes.
We must decide what file organizations to use to store the relations, and create auxiliary data structures called indexes to
speed up data retrieval operations. A sample physical schema for the university database follows:

Store all relations as unsorted files of records. (A file in a DBMS is either a collection of records or a
collection of pages, rather than a string of characters as in an operating system.)
Create indexes on the first column of the Students, Faculty, and Courses relations, the sal column of Faculty, and
the capacity column of Rooms.

Decisions about the physical schema are based on an understanding of how the data is typically accessed. The process
of arriving at a good physical schema is called physical database design. We discuss physical database design in
External Schema
External schemas, which usually are also in terms of the data model of the DBMS, allow data access to be
customized (and authorized) at the level of individual users or groups of users. Any given database has exactly
one conceptual schema and one physical schema because it has just one set of stored relations, but it may have
several external schemas, each tailored to a particular group of users. Each external schema consists of a collection of
one or more views and relations from the conceptual schema. A view is conceptually a relation, but the records in a
view are not stored in the DBMS. Rather, they are computed using a definition for the view, in terms of relations
stored in the DBMS.

The external schema design is guided by end user requirements. For example, we might want to allow students to find
out the names of faculty members teaching courses, as well as course enrollments. This can be done by defining the
following view:
Courseinfo(cid: string, fname: string, enrollment: integer)
A user can treat a view just like a relation and ask questions about the records in the view. Even though the records in
the view are not stored explicitly, they are computed as needed. We did not include Courseinfo in the conceptual
schema because we can compute Courseinfo from the relations in the conceptual schema, and to store it in addition
would be redundant. Such redundancy, in addition to the wasted space, could
lead to inconsistencies. For example, a tuple may be inserted into the Enrolled relation, indicating that a particular
student has enrolled in some course, without incrementing the value in the enrollment field of the corresponding record
of Courseinfo (if the latter also is part of the conceptual schema and its tuples are stored in the DBMS).
Database vs. Data Warehouse
So how is a data warehouse different from your regular database? After all, both are databases, and both have some
tables containing data. If you look deeper, you’d find that both have indexes, keys, views, and the regular jing-bang. So is
that ‘Data warehouse’ really different from the tables in you application? And if the two aren’t really different, maybe
you can just run your queries and reports directly from your application databases!
Well, to be fair, that may be just what you are doing right now, running some EOD (end-of-day) reports as complex
SQL queries and shipping them off to those who need them. And this scheme might just be serving you fine right
now. Nothing wrong with that if it works for you.
But before you start patting yourself on the back for having avoided a data warehouse altogether, do spend a moment
to understand the differences, and to appreciate the pros and cons of either approach.

The primary difference betwen you application database and a data warehouse is that while the former is designed (and optimized)

to record , the latter has to be designed (and optimized) to respond to analysis questions that are critical for your business.
Application databases are OLTP (On-Line Transaction Processing) systems where every transation has to be recorded,
and super-fast at that. Consider the scenario where a bank ATM has disbursed cash to a customer but was unable to
record this event in the bank records. If this started happening frequently, the bank wouldn’t stay in business for too long.
So the banking system is designed to make sure that every trasaction gets recorded within the time you stand before the
ATM machine. This system is write-optimized, and you shouldn’t crib if your analysis query (read operation) takes a lot
of time on such a system.

A Data Warehouse (DW) on the other end, is a database (yes, you are right, it’s a database) that is designed for
facilitating querying and analysis. Often designed a s OLAP (On-Line Analytical Processing) systems, these databases
contain read-only data that can be queried and analysed far more efficiently as compared to your regular OLTP
application databases. In this sense an OLAP system is designed to be read-optimized.

The Relational Model


In this section we provide a brief introduction to the relational model. The central data description construct in this
model is a relation, which can be thought of as a set of records.
A description of data in terms of a data model is called a schema. In the relational model, the schema for a relation
specifies its name, the name of each field (or attribute or column), and the type of each field. As an example,
student information in a university database may be stored in a relation with the following schema:
Students(sid: string, name: string, login: string, age: integer, gpa: real) The preceding schema says that each

record in the Students relation has five fields,


with field names and types as indicated.2 An example instance of the Students relation appears in Figure 1.1.

sid name Login age gpa


53666 Jones jones@cs 18 3.4
53688 Smith smith@ee 18 3.2
53650 Smith smith@math 19 3.8
53831 Madayan madayan@music 11 1.8
53832 Guldu guldu@music 12 2.0

Each row in the Students relation is a record that describes a student. The description is not complete—for example,
the student’s height is not included—but is presumably adequate for the intended applications in the university
database. Every row follows the schema of the Students relation. The schema can therefore be regarded as a
template for describing a student.

We can make the description of a collection of students more precise by specifying integrity constraints, which
are conditions that the records in a relation must satisfy. For example, we could specify that every student has a unique
sid value. Observe that we cannot capture this information by simply adding another field to the Students schema.
Thus, the ability to specify uniqueness of the values in a field increases the accuracy with which we can describe our
data. The expressiveness of the constructs available for specifying integrity constraints is an important aspect of a
data model.
VIEWS A view is a table whose rows are not explicitly stored in the database but are computed as needed from a
view definition. Consider the Students and Enrolled relations. Suppose that we are often interested in finding the
names and student identifiers of students who got a grade of B in some course, together with the cid for the course.
We can define a view for this purpose. Using SQL-92 notation:

CREATE VIEW B-Students (name, sid, course)


AS SELECT S.sname, S.sid, E.cid
FROM Students S, Enrolled E
WHERE S.sid = E.sid AND E.grade = ‘B’

The view B-Students has three fields called name, sid, and course with the same domains as the
fields sname and sid in Students and cid in Enrolled. (If the optional
arguments name, sid, and course are omitted from the CREATE VIEW statement, the
column names sname, sid, and cid are inherited.)

This view can be used just like a base table, or explicitly stored table, in defining new queries or views. Given
the instances of Enrolled and Students shown in Figure 3.4, B- Students contains the tuples shown in Figure
3.18. Conceptually, whenever B-Students is used in a query, the view definition is first evaluated to obtain
the corresponding instance of B-Students, and then the rest of the query is evaluated treating B-Students like
any other relation referred to in the query.

In most of this paper, our focus is on architectural fundamentals sup - porting core database
functionality. We do not attempt to provide a comprehensive review of database algorithmic that
have been extensively documented in the literature. We also provide only minimal discussion of
many extensions present in modern DBMSs, most of which provide features beyond core data
management but do not significantly alter the system architecture. However, within the various
sections of this paper we note topics of interest that are beyond the scope of the paper, and where
possible we provide pointers to additional reading.
We begin our discussion with an investigation of the overall architecture of database systems. The first topic
in any server system architecture is its overall process structure, and we explore a variety of viable alternatives
on this front, first for uniprocessor machines and then for the variety of parallel architectures available today.
This discussion of core server system architecture is applicable to a variety of systems, but was to a large
degree pioneered in DBMS design. Following this, we begin on the more domain -specific components of a
DBMS. We start with a single query’s view of the system, focusing on the relational query processor.
Following that, we move into the storage architecture and transactional storage management design. Finally,
we present some of the shared components and utilities that exist in most DBMSs
HSAM(Hierarchy sequential access Method ) It Stands for Hierarchy sequential access method .In this
technique the Record of the database is accessed sequentially and in sequential access database the segment
i.e. one after another.HSAM provides sequential access to root segment s and dependent segment we can
access data in HSAM Database. But we cannot update any of data in this Database. However this technique
provides faster access compared to SHSAM (Serial hierarchy sequential access model).SHSAM is the first
technique introduced by IBM based on IMS (Information Management system).

HISAM (Hierarchy Indexing sequential access Method) It Stand’s for hierarchal index sequential
method introduced by IBM based on information management System .The major draw back of HISAM is that
it can provide only data accessing i.e. HSAM does not allow data updating ,But HISAM does not only provide
data accessing but also data processing unlike HSAM , It provides indexing method to access data record
directly i.e. the use of index here there is a traid of between access time and Memory Space. In this scheme
different type of indices based on instance record can be created i.e. if the file contains unique key then
primary index can be created .otherwise clustering or secondary index may be used .

HDAM (Hierarchy Direct access Management ) It Stand’s for hierarchal direct access Management
.In this technique we use a mathematical function usually referred as Hash function to compute the address of
each and every record but the segment of the database are stored in hierarchal sequence In this technique we
can access or update any data record directly on the basis of address computed by hash function But some
time it is seen that the address computed for two different records become same this phenomenon is called
collision. In this case either we provide next free space to the colliding record or to store the collided record in
overflow area and keeping pointer to maintain the address of overflow record. If the size of file become’s too
large and the no of collision could be more in that case we use bucket concept rather than overflow area
approach. So to reduce the access time or to minimize the memory space good hash function is necessary

HIDAM (Hierarchy Index Direct access Management ) It Stands for hierarchal index direct access
management .A hierarchal database is a database that maintain hierarchal segment sequence of each segment
by having segment Point to root at one by one hierarchal database are stored on direct access device in most
cases each segment in this database has one or more direct access pointer’s are used database record’s and
segment’s can be stored anywhere in the space in hierarchal Database can be reuse hierarchal database access
root segment in more than one way’s

• By using Randomizing Module


• By using a primary index

Difference between HDAM and HIDAM


Hierarchal database that use a randomizing module is called HDAM Database, where as hierarchal Database
which use a primary index is called hierarchal index access method,The Storage organization in hierarchal
Database that use randomizing module and in hierarchal that uses a primary index is basically the same .The
primary difference is how their root segment are accessed. In HDAM Database the randomization module
examine root key to determine the address of a pointer to the root segment .where as in HIDAM each root
segment Storage location in HIDAM Database the primary index is a database that in IMS loads and maintains

Query processor A relational database consists of many parts, but at its heart are two major components:
the storage engine and the query processor. The storage engine writes data to and reads data from the disk. It
manages records, controls concurrency, and maintains log files. The query processor accepts SQL syntax,
selects a plan for executing the syntax, and then executes the chosen plan. The user or program interacts with
the query processor, and the query processor in turn interacts with the storage engine. The query processor
isolates the user from the details of execution: The user specifies the result, and the query processor
determines how this result is obtained .

Basic Steps in Query Processing


1) The scanning, parsing, and validating module produces an internal representation of the query.
2) The query optimizer module devises an execution plan which is the execution strategy to retrieve the result
of the query from the database files. A query typically has many possible execution strategies differing in
performance, and the process of choosing a reasonably efficient one is known as query optimization. Query
optimization is beyond this course
3) The code generator generates the code to execute the plan.
4) The runtime database processor runs the generated code to producethe query result.

External Sorting
• It refers to sorting algorithms that are suitable for large files of records on disk that do not fit entirely in main
memory, such as most database files..
• The sort process is at the heart of many relational operations:
ORDER BY.
• Sort-merge algorithms for JOIN and other operations (UNION, INTERSECTION).
• Duplicate elimination algorithms for the PROJECT operation (DISTINCT).
• Typical external sorting algorithm uses a sort-merge strategy:
•Sort phase: Create sort small sub-files (sorted sub-files are called runs).
• Merge phase: Then merges the sorted runs. N-way merge uses N memory buffers to buffer input runs, and 1
block to buffer output. Select the 1st record (in the sort order) among input buffers, write it to the output buffer
and delete it from the input buffer. If output buffer full, write it to disk. If input buffer empty, read next block
from the corresponding run..
Algorithms for implementing SELECT operation
• These algorithms depend on the file having specific access paths and may apply only to certain types of
selection conditions.
• We will use the following examples of SELECT operations:
– (OP1): σSSN=‘123456789’ (EMPLOYEE)
– (OP2): σ DNUMBER > 5 (DEPARTMENT)
– (OP3): σDNO=5 (EMPLOYEE)
– (OP4): σ DNO=5 AND SALARY>30000 AND SEX = ‘F’ (EMPLOYEE)
– (OP5): σESSN=‘123456789’ AND PNO=10 (WORKS_ON)
• Many search methods can be used for simple selection: S1 through S6
• S1: Linear Search (brute force) –full scan in Oracle’s terminology-
– Retrieves every record in the file, and test whether its attribute values satisfy the selection condition: an
expensive approach.
– Cost: b/2 if key and b if no key
• S2: Binary Search
– If the selection condition involves an equality comparison on a key attribute on which
the file is ordered.
– σSSN=‘1234567’ (EMPLOYEE), SSN is the ordering attribute.
– Cost: log2b if key.
• S3: Using a Primary Index (hash key)
– An equality comparison on a key attribute with a primary index (or hash key).
– This condition retrieves a single record (at most).
– Cost : primary index : bind/2 + 1 (hash key: 1bucket if no collision).

Backup and Recovery?


In general, backup and recovery refers to the various strategies and procedures involved in protecting your
database against data loss and reconstructing the database after any kind of data loss.A backup is a copy of
data from your database that can be used to reconstruct that data. Backups can be divided into physical
backups and logical backups. Physical backups are backups of the physical files used in storing and
recovering your database, such as datafiles, control files, and archived redo logs. Ultimately, every physical
backup is a copy of files storing database information to some other location,
whether on disk or some offline storage such as tape. Logical backups contain logical data (for example,
tables or stored procedures) exported from a database with an Oracle export utility and stored in a binary
file, for later re-importing into a database using the corresponding Oracle import utility. Physical backups are
the foundation of any sound backup and recovery strategy. Logical backups are a useful supplement to
physical backups in many circumstances but are not sufficient protection against data loss without physical
backups. Unless otherwise specified, the term "backup" as used in the backup and recovery
documentation refers to physical backups, and to back up part or all of your database
is to take some kind of physcial backup. The focus in the backup and recovery
documentation set will be almost exclusively on physical backups.
The files and other structures that make up an Oracle database store data and
safeguard it against possible failures. This discussion introduces each of the physical
structures that make up an Oracle database and their role in the reconstruction of a
database from backup. This section contains these topics:
■ Datafiles and Data Blocks
■ Redo Logs

■ Undo Segments

■ Control Files

Datafiles and Data Blocks


An Oracle database consists of one or more logical storage units called tablespaces.Each tablespace in an
Oracle database consists of one or more files called datafiles,physical files under the host operating system
which collectively contain the data
stored in the tablespace.The simplest Oracle database would have one tablespace,stored in one datafile.
The database manages the storage space in the datafiles of a database in units calleddata blocks. Data
blocks are the smallest units of storage that the database can use orallocate.
Modified or new data is not written to datafiles immediately. Updates are buffered in memory and written to
datafiles at intervals. If a database has not gone through a
normal shutdown (that is, if it is open, or exited abnormally, as in an instance failure or
a SHUTDOWN ABORT) then there are typically changes in memory that have not been
written to the datafiles. Datafiles that were restored from backup, or were not closed
during a consistent shutdown, are typically not completely up to date.
Copies of the datafiles of a database are a critical part of any backup.
Redo Logs
Redo logs record all changes made to a database's data files. Each time data is
changed in the database, that change is recorded in the online redo log first, before it
is applied to the datafiles.
An Oracle database requires at least two online redo log groups, and in each group
there is at least one online redo log member, an individual redo log file where the
changes are recorded.
At intervals, the database rotates through the online redo log groups, storing changes
in the current online redo log .
Because the redo log contains a record of all changes to the datafiles, if a backup copy
of a datafile from some point in time and a complete set of redo logs from that time
forward are available, the database can reapply changes recorded in the redo logs, in
order to re-construct the datafile contents at any point between the backup time and
the end of the last redo log. However, this is only possible if the redo log has been
preserved.
Therefore, preserving the redo logs is a major part of most backup strategies. The first
level of preserving the redo log is through a process called archiving. The database
can copy online redo log groups that are not currently in use to one or more archive
locations on disk, where they are collectively called the archived redo log. Individual
files are referred to as archived redo log files. After a redo log file is archived, it can
be backed up to other locations on disk or on tape, for long term storage and use in
future recovery operations.
Without archived redo logs, your database backup and recovery options are severely
limited. Your database must be taken offline before it can be backed up, and if you
must restore your database from backup, the database contents are only available as of
the time of the backup. Reconstructing the state of the database at a point in time
between backups is impossible without the archived log.
Control Files
The control file contains the record of the physical structures of the database and their
status. Several types of information stored in the control file are related to backup and
recovery:
■ Database information (RESETLOGS SCN and time stamp)

■ Tablespace and datafile records (filenames, datafile checkpoints, read/write status,

offline ranges)
■ Information about redo threads (current online redo log)

■ Log records (log sequence numbers, SCN range in each log)

■ A record of past RMAN backups

■ Information about corrupt datafile blocks

The recovery process for datafiles is in part guided by status information in the control
file, such as the database checkpoints, current online redo log file, and the datafile
header checkpoints for the datafiles. Loss of the control file makes recovery from a
data loss much more difficult.

Undo Segments
In general, when data in a datafile is updated, "before images" of that data are written
into undo segments. If a transaction is rolled back, this undo information can be used
to restore the original datafile contents.
In the context of recovery, the undo information is used to undo the effects of
uncommitted transactions, once all the datafile changes from the redo logs have been
applied to the datafiles. The database is actually opened before the undo is applied.
You should not have to concern yourself with undo segments or manage them directly
as part of your backup and recovery process.
■ Incremental backups, which provide more compact backups (storing only

changed blocks) and faster datafile media recovery (reducing the need to apply
redo during datafile media recovery)
■ Block media recovery, in which a datafile with only a small number of corrupt

data blocks can be repaired without being taken offline or restored from backup
■ Unused block compression, where RMAN can in some cases skip unused datafile

blocks during backups


■ Binary compression, which uses a compression mechanism integrated into the

Oracle database server to reduce the size of backups


■ Encrypted backups, which uses encryption capabilities integrated into the Oracle

database to store backups in an encrypted format.

Relationships:
Database relationships create a hierarchy for the tables. A properly normalized database has a
well-organized hierarchy. For each relationship between two tables, one table is the parent and one table is
the child. For example, a Customer table may be the parent of an Order table, and in turn, an Order table
may be the parent of an OrderDetail table. In this example, the Order table is both a child (of Customer) and
a parent (to OrderDetail).
Database relationships are embodied through primary keys in parent tables and foreign keys in child tables.
There are many ways this can be implemented and the various options are another source of heated
discussion. There is no universally accepted approach, but there is an approach that appears to be the most
common. This document describes that approach and provides sound justification for it with technical
reasoning. The justification is based on practical considerations rather than blind adherence to theory. One
of the most important keys (pun intended) to this database architecture is the handling of primary keys and
foreign keys.
Primary Keys:
A primary key is a field in a table whose value uniquely identifies each record. There are many
qualities required of a field in this position of honor. It must not contain any duplicate values
and the values must not change. It should also be compact for the best performance. It’s often
very difficult to find such characteristics in naturally occurring data. Therefore, this database
architecture uses surrogate primary keys.
A surrogate primary key field contains artificially derived values that have no intrinsic meaning.
The sole purpose for the values is to uniquely identify each record. The values are derived in a
way that prevents duplication, they are never changed, and they are compact (usually integers).
The use of surrogate primary keys avoids the inevitable hassles of trying to use natural data for
primary keys. It’s also a very important factor in the advantages of this database architecture.
SQL Server provides a convenient field property that supports the use of surrogate primary keys.
One field in a table can be assigned the IDENTITY property. When a record is inserted into the
table that field is automatically given a value one greater than the previously inserted record. A
field with the IDENTITY property does not (normally) allow values to be explicitly provided
with an insert or changed with an update.
In this database architecture, the primary key field is the first field in every table. The field is a
4-byte integer (SQL Server data type int). The field uses the IDENTITY property (starting at 1
and incrementing by 1). The field is named with a prefix (lng), the table name (without a prefix),
and a suffix (ID). The primary key field for a Customer table would be named lngCustomerID.
The rules for the foundation of this database architecture are quite simple. Here they are in a
loose order of importance (with 1 being the most important)…
1) Every table has a primary key.
2) The primary key is a single field.
3) The primary key is the first field.
4) The primary key field is named to correspond with the table name.
5) The primary key migrates to child tables as a foreign key with the same characteristics.
6) The primary key field is numeric.
7) The primary key field is a 4-byte integer data type.
8) The primary key field uses the IDENTITY property (starting at 1 and incrementing by 1).
Foreign Keys:
A foreign key is a field in a table that refers to parent records in another table. The references
are represented by the primary key values of the corresponding parent records. A foreign key
field is the same data type as the primary key field of the referenced table. Usually, a foreign
key field is named the same as the primary key field of the parent table. This very beneficial
convention is called key migration in data modeling terminology.
Child table foreign key references to parent table primary keys embody database relationships. Some DBAs
vehemently argue that primary keys should be composed of natural data. They do
not like surrogate primary keys. This opinion seems to be based on a desire to worship theory
rather than a rational examination of the issues involved. Natural data is almost never compact
and almost never stable (not subject to change). A single field of natural data is often not even
unique. The uniqueness requirement is sometimes met by using multiple fields as a composite
primary key. However, that worsens the compactness.
The use of natural data for a primary key also has major implications for child tables. In order to
ensure uniqueness in their primary keys, all child tables are forced to include the primary key of
their parent table along with an additional field (or more) of their own. This approach becomes
extremely unwieldy very quickly in a relational database of any depth.
Indexes:
Indexes in a database are much like indexes in a book. They are used to quickly locate specific
content. The proper use of indexes is probably the single most important factor in achieving the
maximum level of performance from queries. Alas, there is no magic formula for doing proper
indexing. It’s heavily dependent on the data itself and how the data is being used.
This database architecture suggests a starting point for indexing where an index is created for
every foreign key in every table. Such indexes will improve the performance of queries that
select based on foreign keys. It also improves the performance of queries that include joins
between tables, and a properly normalized database means lots of joins.
Once indexes have been created for all foreign keys, further indexing requires an understanding
of the data itself and how the data is being used. Indexes are best applied to fields that are small,
highly selective (many unique values), and often used in queries. In SQL Server, indexes can be
clustered or not clustered. Only one index per table can be clustered, and that should be reserved
for data that is frequently used in queries for aggregation (GROUP BY), sorting (ORDER BY),
or filtering (WHERE). Indexes can also be applied to combinations of fields, in which case the
most frequently referenced field(s) should be listed first. A query that references only the fields
included in an index is especially efficient.
Be careful to avoid creating too many indexes. Indexes must be maintained by the system and
that involves processing overhead during inserts and updates.
When experimenting with different indexes it’s very important to carefully test the performance of existing
queries. Do not be misled by the effects of data caching at the server.

Concurrency control:
Any multi-user database application has to have some method of dealing with concurrent access
to data. Concurrent access means that more than one user is accessing the same data at the same
time. A problem occurs when user X reads a record for editing, user Y reads the same record for
editing, user Y saves changes, then user X saves changes. The changes made by user Y are lost
unless something prevents user X from blindly overwriting the record. This problem may not be
handled by the database alone. The solution usually involves the application as well.
One way of dealing with concurrent access is to use a counter field that is incremented with each
modification to the record. In the example above, user X would be alerted by the application that
the record had been modified between the time of reading and the time of attempting to save
changes. This situation is noticed by the database and/or by the application because the counter
had changed. The application should allow user X to start over with the current record contents,
or overwrite the changes made by user Y. No changes would be unknowingly lost.
Pessimistic Locking: This concurrency control strategy involves keeping an entity in a database locked the
entire time it exists in the database's memory. This limits or prevents users from altering the data entity that
is locked. There are two types of locks that fall under the category of pessimistic locking: write lock and
read lock.

With write lock, everyone but the holder of the lock is prevented from reading, updating, or deleting the
entity. With read lock, other users can read the entity, but no one except for the lock holder can update or
delete it.

Optimistic Locking: This strategy can be used when instances of simultaneous transactions, or collisions, are
expected to be infrequent. In contrast with pessimistic locking, optimistic locking doesn't try to prevent the
collisions from occurring. Instead, it aims to detect these collisions and resolve them on the chance
occasions when they occur

Pessimistic locking provides a guarantee that database changes are made safely. However, it becomes less
viable as the number of simultaneous users or the number of entities involved in a transaction increase
because the potential for having to wait for a lock to release will increase.

Optimistic locking can alleviate the problem of waiting for locks to release, but then users have the potential
to experience collisions when attempting to update the database.
Entity-relationship (ER) Data model allows us to describe the data involved in a real-world enterprise in
terms of objects and their relationships and is widely used to develop an initial database design. In this chapter, we
introduce the ER model and discuss how its features allow us to model a wide range of data faithfully.

The ER model is important primarily for its role in database design. It provides useful concepts that allow us to move
from an informal description of what users want from their database to a more detailed, and precise, description
that can be implemented in a DBMS. We begin with an overview of database design in Section 2.1 in order to
motivate our discussion of the ER model. Within the larger context of the overall design process, the ER model is
used in a phase called conceptual database design. We then introduce the ER model in Sections 2.2, 2.3, and 2.4. In
Section 2.5, we discuss database design issues involving the ER model. We conclude with a brief discussion of
conceptual database design for large enterprises.

We note that many variations of ER diagrams are in use, and no widely accepted standards prevail. The
presentation in this chapter is representative of the family of ER models and includes a selection of the most popular
features.

This is usually an informal process that involves discussions with user groups, a study of the current operating
environment and how it is expected to change, analysis of any available documentation on existing applications
that are expected to be replaced or complemented by the database, and so on. Several methodologies have been
pro- posed for organizing and presenting the information gathered in this step, and some automated tools have been
developed to support this process.

(1) Requirements Analysis: The very first step in designing a database application is to understand what data is to
be stored in the database, what applications must be built on top of it, and what operations are most frequent and
subject to performance requirements. In other words, we must find out what the users want from the database.

(2) Conceptual Database Design: The information gathered in the requirements analysis step is used to develop
a high-level description of the data to be stored in the database, along with the constraints that are known to hold
over this data. This step is often carried out using the ER model, or a similar high-level data model, and is
discussed in the rest of this chapter.

(3) Logical Database Design: We must choose a DBMS to implement our database design, and convert the
conceptual database design into a database schema in the data model of the chosen DBMS. We will only consider
relational DBMSs, and therefore, the task in the logical design step is to convert an ER schema into a relational
database schema. We discuss this step in detail in Chapter 3; the result is a conceptual schema, sometimes called the
logical schema, in the relational data model.

(4) Schema Refinement: The fourth step in database design is to analyze the collection of relations in our
relational database schema to identify potential problems, and to refine it. In contrast to the requirements analysis and
conceptual design steps, which are essentially subjective, schema refinement can be guided by some elegant and
powerful theory. We discuss the theory of normalizing relations—restructuring them to ensure some desirable
properties—in Chapter 15.

(5) Physical Database Design: In this step we must consider typical expected workloads that our database
must support and further refine the database design to ensure that it meets desired performance criteria. This step
may simply involve build- ing indexes on some tables and clustering some tables, or it may involve a substantial redesign
of parts of the database schema obtained from the earlier design steps. We discuss physical design and database
tuning in Chapter 16.

(6) Security Design: In this step, we identify different user groups and different roles played by various users
(e.g., the development team for a product, the customer support representatives, the product manager). For each role
and user group, we must identify the parts of the database that they must be able to access and the parts of the
database that they should not be allowed to access, and take steps to ensure that they can access only the necessary
parts.
Hierarchical Model

The hierarchical data model organizes data in a tree structure. There is a hierarchy of parent and
child data segments. This structure implies that a record can have repeating information,
generally in the child data segments. Data in a series of records, which have a set of field values
attached to it. It collects all the instances of a specific record together as a record type. These
record types are the equivalent of tables in the relational model, and with the individual records
being the equivalent of rows. To create links between these record types, the hierarchical model
uses Parent Child Relationships. These are a 1:N mapping between record types. This is done by
using trees, like set theory used in the relational model, "borrowed" from maths. For example,
an organization might store information about an employee, such as name, employee number,
department, salary. The organization might also store information about an employee's children,
such as name and date of birth. The employee and children data forms a hierarchy, where the
employee data represents the parent segment and the children data represents the child segment.
If an employee has three children, then there would be three child segments associated with one
employee segment. In a hierarchical database the parent-child relationship is one to many. This
restricts a child segment to having only one parent segment. Hierarchical DBMSs were popular
from the late 1960s, with the introduction of IBM's Information Management System (IMS)
DBMS, through the 1970s.
Network Model
The popularity of the network data model coincided with the popularity of the hierarchical data
model. Some data were more naturally modeled with more than one parent per child. So, the
network model permitted the modeling of many-to-many relationships in data. In 1971, the
Conference on Data Systems Languages (CODASYL) formally defined the network model. The
basic data modeling construct in the network model is the set construct. A set consists of an
owner record type, a set name, and a member record type. A member record type can have that
role in more than one set, hence the multiparent concept is supported. An owner record type can
also be a member or owner in another set. The data model is a simple network, and link and
intersection record types (called junction records by IDMS) may exist, as well as sets between
them . Thus, the complete network of relationships is represented by several pairwise sets; in
each set some (one) record type is owner (at the tail of the network arrow) and one or more
record types are members (at the head of the relationship arrow). Usually, a set defines a 1:M
relationship, although 1:1 is permitted. The CODASYL network model is based on
mathematical set theory.
Relational Model
(RDBMS - relational database management system) A database based on the relational model
developed by E.F. Codd. A relational database allows the definition of data structures, storage
and retrieval operations and integrity constraints. In such a database the data and relations
between them are organised in tables. A table is a collection of records and each record in a
table contains the same fields.

Properties of Relational Tables:


• Values Are Atomic
• Each Row is Unique
• Column Values Are of the Same Kind
• The Sequence of Columns is Insignificant
• The Sequence of Rows is Insignificant
• Each Column Has a Unique Name

Certain fields may be designated as keys, which means that searches for specific values of that field will use
indexing to speed them up. Where fields in two different tables take values from the same set, a join
operation can be performed to select related records in the two tables by matching values in those fields.
Often, but not always, the fields will have the same name in both tables. For example, an "orders" table
might contain (customer-ID, product-code) pairs and a "products" table might contain (product-code, price)
pairs so to calculate a given customer's bill you would sum the prices of all products ordered by that
customer by joining on the product-code fields of the two tables. This can be extended to joining multiple
tables on multiple fields. Because these relationships are only specified at retreival time, relational databases
are classed as dynamic database management system. The RELATIONAL database model is based on the
Relational Algebra.

Object/Relational Model
Object/relational database management systems (ORDBMSs) add new object storage capabilities to the
relational systems at the core of modern information systems. These new facilities integrate management of
traditional fielded data, complex objects such as time-series and geospatial data and diverse binary media
such as audio, video, images, and applets. By encapsulating methods with data structures, an ORDBMS
server can execute comple x analytical and data manipulation operations to search and transform multimedia
and other complex objects.

As an evolutionary technology, the object/relational (OR) approach has inherited the robust transaction- and
performance-management features of it s relational ancestor and the flexibility of its object-oriented cousin.
Database designers can work with familiar tabular structures and data definition languages (DDLs) while
assimilating new object-management possibi lities. Query and procedural languages and call interfaces in
ORDBMSs are familiar: SQL3, vendor procedural languages, and ODBC, JDBC, and proprie tary call
interfaces are all extensions of RDBMS languages and interfaces. And the leading vendors are, of course,
quite well known: IBM, Inform ix, and Oracle.

Object-Oriented Model
Object DBMSs add database functionality to object programming languages. They bring much more than
persistent storage of programming language objects. Object DBMSs extend the semantics of the C++,
Smalltalk and Java object programming languages to provide full-featured database programming
capability, while retaining native language compatibility. A major benefit of this approach is the unification
of the application and database development into a seamless data model and language environment. As a
result, applications require less code, use more natural data modeling, and code bases are easier to maintain.
Object developers can write complete database applications with a modest amount of additional effort.

According to Rao (1994), "The object-oriented database (OODB) paradigm is the combination of object-
oriented programming language (OOPL) systems and persistent systems. The power of the OODB comes
from the seamless treatment of both persistent data, as found in databases, and transient data, as found in
executing programs."

In contrast to a relational DBMS where a complex data structure must be flattened out to fit into tables or
joined together from those tables to form the in-memory structure, object DBMSs have no performance
overhead to store or retrieve a web or hierarchy of interrelated objects. This one-to-one mapping of object
programming language objects to database objects has two benefits over other storage approaches: it
provides higher performance management of objects, and it enables better management of the complex
interrelationships between objects. This makes object DBMSs better suited to support applications such as
financial portfolio risk analysis systems, telecommunications service applications, world wide web
document structures, design and manufacturing systems, and hospital patient record systems, which have
complex relationships between data.
Weak Entities
Thus far, we have assumed that the attributes associated with an entity set include a key. This assumption
does not always hold. For example, suppose that employees can purchase insurance policies to cover their
dependents. We wish to record information about policies, including who is covered by each policy, but this
information is really our only interest in the dependents of an employee. If an employee quits, any policy
owned by the employee is terminated and we want to delete all the relevant policy and dependent
information from the database. We might choose to identify a dependent by name alone in this situation,
since it is rea-sonable to expect that the dependents of a given employee have di_erent names. Thus
the attributes of the Dependents entity set might be pname and age.
The attribute pname does not identify a dependent uniquely. Recall that the key for Employees is ssn; thus
we might have two employees called Smethurst, and each might have a son called Joe.Dependents is an
example of a weak entity set. A weak entity can be identified uniquely only by considering some of its
attributes in conjunction with the primary key of another entity, which is called the identifying owner.
The following restrictions must hold: The owner entity set and the weak entity set must participate in a one-
to-many relationship set (one owner entity is associated with one or more weak entities, but each weak entity
has a single owner). This relationship set is called the identifying relationship set of the weak entity set.
The weak entity set must have total participation in the identifying relationship Set a Dependents entity can
be identi_ed uniquely only if we take the key of the owning Employees entity and the pname of the
Dependents entity. The set of attributes of a weak entity set that uniquely identify a weak entity for a given
owner entity is called a partial key of the weak entity set. In our example pname is a partial
key for Dependents.
Data Dictionary Definition- Database management system provide a facility known as data definition
language (DDL), which can be used to define the conceptual scheme and also give some details about how
to implement this scheme in the physical devices used to store the data.

• All entity sets and their associated attribute as well as the relationships among the entity sets.
• Any constraints that have to be maintained, including the constraints on the value that can be
assigned to a given attribute and the constraints on the value assigned to different attributes in the
same or different records.

These definitions, which can be described as meta data about the data in the database, are expressed in the
DDL of the DBMS and maintained in a compiled form (usually as a set of tables). The compiled form of the
definition is known as a data dictionary, directory, or system catalog. The data dictionary contains
information on the data stored in the database and is consulted by the DBMS before any data manipulation
operation.
Thus information pertaining to the structure and usage of data contained in the database, the metadata, is
maintained in the data dictionary. The data dictionary is the database itself, documents the data. Each
database user can consult the data dictionary to learn what each piece of data and the various synonyms of
the data fields mean.
In an integrated system (i.e., in a system where the data dictionary is a part of the DBMS) the data dictionary
stores information concerning the external, conceptual, and internal levels of the database. It contains the
source of each data-field value, the frequency of its use, and an audit trail concerning updates, including the
"who" and "when" for each update.
Currently data dictionary systems are available as add-on to the DBMS. Standards have yet to be evolved for
integrating the data dictionary facility with the DBMS so that the two databases, one for metadata and the other for
data, can be manipulated using a unified DDL/DML.

A DBMS component that stores metadata. The data dictionary contains the data definition and its characteristics and
entity relationships. This may include the names and descriptions of the various tables and fields within the database.
Also included are things like data types and length of data items. The inclusion of primary and foreign key also adds to
the consistency of the database being built. Overall having a well designed data dictionary will help make it easier to
build and maintain your database.
To summarize a data dictionary is a centralized repository of information about data such as meaning, relationships to
other data, origin, usage, and format. (wikipedia.org)

There are two other types of Data Dictionaries-

Active Data Dictionary- A data dictionary that is automatically updated by the DBMS every time the database is
accessed.

Passive Data Dictionary- Similar to Active DD however it is not automatically updated and usually requires a batch
process to be run.

There is a third style of data dictionaries known as a middleware data dictionary. Middleware is computer software that
connects software components or applications. The software consists of a set of services that allows multiple
processes running on one or more machines to interact. Traditional data dictionaries provide structure and basic
function to the database. Middleware data dictionaries are located within the DBMS itself and operate on a higher
level. Middleware data dictionaries can provide alternate entity relationship structures that can be tailored to fit
different users that interact with the same database. Middleware data dictionaries can also assist in query optimization
as well as distributed database.

Middleware also helps database designers by reducing the amount of time it takes to create forms, queries, reports,
menus and many other database components. They do this by automatically generating SQL code for common items
such as forms and views. Some middleware data dictionaries can also help with data security as well as database
optimization. It is a growing field with many new companies entering the market.

Example of a Data Dictionary and Oracle SQL

Below is a Simple data dictionary. Creating a data dictionary first makes creating tables in SQL easier. There are
many variations to the data dictionary, however create one that will include everything you plan on coding later.

Creating the Student table in SQL is quite simple by using the data dictionary as a template. CREATE TABLE
STUDENT ( STU_ID NUMBER(9), STU_AGE NUMBER(2), STU_GPA NUMBER(3,2), STU_LNAME CHAR(30),
PRIMARY KEY (STU_ID), FOREIGN KEY (STU_GPA) REFERENCES GRADES); After the table has been created it
can be modified or used in a query or report by connecting the SQL database to Access or some other frontend
database management tool. For more information on connecting SQL database to Access or another DBMS see the
wiki entry for linking Access to Oracle.

Normalization
Normalization is a mechanism through which we systematically arranged data in database. It helps in
reducing redundancy of data and to make sure that all tables are properly dependent. With
normalization we can reduce the unnecessary amount of space in database.

We use normalization process to make our database more efficient and allow us to access data more
quickly. If database keep in normal form it will be easy to add or delete some data easily without
affecting database. In normalization we simplify data by reducing redundancy of data.

Normalization is the process through which data is stored in different tables to improve the efficiency
of database. If we properly normalize data then you can find that there is only one table for one field.
In this process you can find one table for particular data. Through normalization we can reduce
redundancies and increase efficiency. One of the disadvantage of normalization is that it will reduce
database performance because of many tables.

STAGES:- 1NF-> 2NF-> 3NF-> BC NF->4thNF->5thNF


First Normal Form (1NF)
First normal form (1NF) sets the very basic rules for an organized database:

• Eliminate duplicative columns from the same table.


• Create separate tables for each group of related data and identify each row with a
unique column or set of columns (the primary key).
• Queries become easier.
• Each cell must have one value & no repeating gps.

Employee (unnormalized)
emp_no name dept_no dept_name skills
1 Kevin Jacobs 201 R&D C, Perl, Java
2 Barbara Jones 224 IT Linux, Mac
3 Jake Rivera 201 R&D DB2, Oracle, Java
Employee (1NF)
emp_no name dept_no dept_name skills
1 Kevin Jacobs 201 R&D C
1 Kevin Jacobs 201 R&D Perl
1 Kevin Jacobs 201 R&D Java
2 Barbara Jones 224 IT Linux
2 Barbara Jones 224 IT Mac
3 Jake Rivera 201 R&D DB2
3 Jake Rivera 201 R&D Oracle
3 Jake Rivera 201 R&D Java

Second Normal Form (2NF)


Second normal form (2NF) further addresses the concept of removing duplicative data:

• Meet all the requirements of the first normal form.


• Remove subsets of data that apply to multiple rows of a table and place them in
separate tables.
• Create relationships between these new tables and their predecessors through the use
of foreign keys.
• Prevents update, insert, and delete anomalies.
• Every non-primary key Attribute is fully functionaly dependent on primary key. i.e.
Remove partially dependency.

Employee (1NF)
emp_no name dept_no dept_name skills
1 Kevin Jacobs 201 R&D C
1 Kevin Jacobs 201 R&D Perl
1 Kevin Jacobs 201 R&D Java
2 Barbara Jones 224 IT Linux
2 Barbara Jones 224 IT Mac
3 Jake Rivera 201 R&D DB2
3 Jake Rivera 201 R&D Oracle
3 Jake Rivera 201 R&D Java
emp_noskills
Employee (2NF) Skills (2NF)
emp_noname dept_nodept_name 1 C
1 Kevin Jacobs 201 R&D 1 Perl
2 Barbara Jones 224 IT 1 Java
3 Jake Rivera 201 R&D 2 Linux
Third Normal Form (3NF) 2 Mac
3 DB2
Third normal form (3NF) goes one large step further: 3 Oracle
3 Java
• Meet all the requirements of the second normal form.
• Remove columns that are not dependent upon the primary key.
• Prevents update, insert, and delete anomalies.
• Dependent on primary key. i.e. Remove transitive dependency.

Employee (2NF)
emp_noname dept_nodept_name
1 Kevin Jacobs 201 R&D
2 Barbara Jones 224 IT
3 Jake Rivera 201 R&D

Employee (3NF) Department (3NF)


emp_noname dept_no dept_nodept_name
1 Kevin Jacobs 201 201 R&D
2 Barbara Jones 224 224 IT
3 Jake Rivera 201

Boyce-Codd Normal Form (BCNF or 3.5NF)


The Boyce-Codd Normal Form, also referred to as the "third and half (3.5) normal form", adds
one more requirement:

• Meet all the requirements of the third normal form.


• Every determinant must be a candidate key.
• Strengthens 3NF by requiring the keys in the functional dependencies to be superkeys (a
column or columns that uniquely identify a row).
• Boyce Codd Normal Form: Every determenent is a candidate key.

Fourth Normal Form (4NF)


Finally, fourth normal form (4NF) has one additional requirement:

• Meet all the requirements of the third normal form.


• A relation is in 4NF if it has no multi-valued dependencies.
• Eliminate trivial multivalued dependencies.

Fifth Normal Form (5NF)


Eliminate dependencies not determined by keys.

Functional Dependency is the starting point for the process of normalization. Functional dependency exists
when a relationship between two attributes allows you to uniquely determine the corresponding attribute’s value. If ‘X’
is known, and as a result you are able to uniquely identify ‘Y’, there is functional dependency. Combined with keys,
normal forms are defined for relations.
Employee (1NF)
emp_no name dept_no dept_name skills
1 Kevin Jacobs 201 R&D C
1 Kevin Jacobs 201 R&D Perl
1 Kevin Jacobs 201 R&D Java
2 Barbara Jones 224 IT Linux
2 Barbara Jones 224 IT Mac
3 Jake Rivera 201 R&D DB2
3 Jake Rivera 201 R&D Oracle
3 Jake Rivera 201 R&D Java
Name, dept_no, and dept_name are functionally dependent on emp_no. (emp_no -> name,
dept_no, dept_name)

Skills is not functionally dependent on emp_no since it is not unique to each emp_no.
functional dependency (FD) is a constraint between two sets of attributes in a relation from a database.Given a
relation R, a set of attributes X in R is said to functionally determine another attribute Y, also in R, (written X → Y) if,
and only if, each X value is associated with precisely one Y value. Customarily we call X the determinant
set and Y the dependent attribute. Thus, given a tuple and the values of the attributes in X, one can determine the
corresponding value of the Y attribute. In simple words, if X value is known,Y value is certainly known. For the
purposes of simplicity, given that X and Y are sets of attributes in R, X → Y denotes that X functionally determines
each of the members of Y—in this case Y is known as thedependent set. Thus, a candidate key is a minimal set of
attributes that functionally determine all of the attributes in a relation.

(Note: the "function" being discussed in "functional dependency" is the function of identification.)
A functional dependency FD: X → Y is called trivial if Y is a subset of X.

The determination of functional dependencies is an important part of designing databases in the relational model,
and in database normalization and denormalization. The functional dependencies, along with theattribute
domains, are selected so as to generate constraints that would exclude as much data inappropriate to the user
domain from the system as possible.

For example, suppose one is designing a system to track vehicles and the capacity of their engines. Each vehicle
has a unique vehicle identification number (VIN). One would write VIN → EngineCapacitybecause it would be
inappropriate for a vehicle's engine to have more than one capacity. (Assuming, in this case, that vehicles only
have one engine.) However, EngineCapacity → VIN, is incorrect because there could be many vehicles with the
same engine capacity.

This functional dependency may suggest that the attribute EngineCapacity be placed in a relation with candidate
key VIN. However, that may not always be appropriate. For example, if that functional dependency occurs as a
result of the transitive functional dependencies VIN → VehicleModel and VehicleModel → EngineCapacity then
that would not result in a normalized relation.

DBA &ITS FUNCTIONALITY


Database management is among the fundamental processes in the software field of
computing.Databases are used in almost every enterprise and the bigger the enterprise is, the bigger and
more complicated the databases will have to be. Although facilitated to the greatest extent possible by
database management system solutions, it still needs continuous human intervention. And while a system
administrator can, in most cases, take care of the database, there are a lot of times when a specialist is
needed, someone dedicated to database management. The people responsible for managing databases
are called database administrators. Each database administrator, dubbed DBA for the sake of brevity, may
be engaged in performing various database manipulation tasks such as archiving, testing, running, security
control, etc., all related to the environmental side of the databases. Their job is very important, since in
today's world, almost all of the information a company uses is kept in databases. Due to the fact that
most databases use a different approach to storing and handling data, there are different types of DBAs.
Most of the major vendors who provide database solutions also offer courses to certify the DBA. The
exact set of database administration duties of each DBA is dependent on his/her job profile, the IT
policies applied by the company he/she works for and last but not least - the concrete parameters of the
database management system in use. A DBA must be able to think logically to solve all problems and to
easily work in a team with both DBA colleagues and staff with no computer training. Some of the basic
database management tasks of a DBA supplement the work of a system administrator and include
hardware/ software configuration, as well as installation of new DBMS software versions. Those are very
important tasks since the proper installation and configuration of the database management software and
its regular updates are crucial for the optimal functioning of the DBMS and hence - of the databases, since
the new releases often contain bug fixes and security updates. Database security administration and data
analysis are among the major duties of a DBA. He/she is responsible for controlling the DBMS security by
adding and removing users, managing database quotas and checking for security issues. The DBA is also
engaged in analyzing database contents and improving the data storage efficiency by optimizing the use
of indexes, enabling the ‘Parallel Query’ execution, etc. Apart from the tasks related to the logical and
the physical side of the database management process, DBAs may also take part in database design
operations. Their main role is to give developers recommendations about the DBMS specificities, thus
helping them avoid any eventual database performance issues. Other important tasks of the DBAs are
related to data modeling aimed at optimizing the system layout, as well as to the analysis and creation of
new databases.
Role of DBA
1) Deciding the storage structure and access strategy
2) Liaising with users
3) Defining authorization checks and validation procedures
4) Defining a strategy for back-up and recovery
5) Monitoring performance and responding to changes in requirements

DDL
Data Definition Language (DDL) statements are used to define the database structure or schema. Some examples:

o CREATE - to create objects in the database


o ALTER - alters the structure of the database
o DROP - delete objects from the database
o TRUNCATE - remove all records from a table, including all spaces allocated for the records are removed
o COMMENT - add comments to the data dictionary
o RENAME - rename an object

DML
Data Manipulation Language (DML) statements are used for managing data within schema objects. Some examples:

o SELECT - retrieve data from the a database


o INSERT - insert data into a table
o UPDATE - updates existing data within a table
o DELETE - deletes all records from a table, the space for the records remain
o MERGE - UPSERT operation (insert or update)
o CALL - call a PL/SQL or Java subprogram
o EXPLAIN PLAN - explain access path to data
o LOCK TABLE - control concurrency

DCL
Data Control Language (DCL) statements. Some examples:

o GRANT - gives user's access privileges to database


o REVOKE - withdraw access privileges given with the GRANT command
TCL
Transaction Control (TCL) statements are used to manage the changes made by DML statements. It allows statements to
be grouped together into logical transactions.

o COMMIT - save work done


o SAVEPOINT - identify a point in a transaction to which you can later roll back
o ROLLBACK - restore database to original since the last COMMIT
o SET TRANSACTION - Change transaction options like isolation level and what rollback segment to use

Data Warehouse?

A data warehouse is a relational database that is designed for query and analysis rather than for
transaction processing. It usually contains historical data derived from transaction data, but it
can include data from other sources. It separates analysis workload from transaction workload
and enables an organization to consolidate data from several sources.

In addition to a relational database, a data warehouse environment includes an extraction,


transportation, transformation, and loading (ETL) solution, an online analytical processing
(OLAP) engine, client analysis tools, and other applications that manage the process of
gathering data and delivering it to business users.

A common way of introducing data warehousing is to refer to the characteristics of a data


warehouse as set forth by William Inmon:

• Subject Oriented
• Integrated
• Nonvolatile
• Time Variant

Subject Oriented

Data warehouses are designed to help you analyze data. For example, to learn more about your
company's sales data, you can build a warehouse that concentrates on sales. Using this
warehouse, you can answer questions like "Who was our best customer for this item last year?"
This ability to define a data warehouse by subject matter, sales in this case, makes the data
warehouse subject oriented.

Integrated

Integration is closely related to subject orientation. Data warehouses must put data from
disparate sources into a consistent format. They must resolve such problems as naming conflicts
and inconsistencies among units of measure. When they achieve this, they are said to be
integrated.

Nonvolatile

Nonvolatile means that, once entered into the warehouse, data should not change. This is logical
because the purpose of a warehouse is to enable you to analyze what has occurred.
Time Variant

In order to discover trends in business, analysts need large amounts of data. This is very much in
contrast to online transaction processing (OLTP) systems, where performance requirements
demand that historical data be moved to an archive. A data warehouse's focus on change over
time is what is meant by the term time variant.

Data mining the extraction of hidden predictive information from large databases, is a powerful
new technology with great potential to help companies focus on the most important information
in their data warehouses. Data mining tools predict future trends and behaviors, allowing
businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses
offered by data mining move beyond the analyses of past events provided by retrospective tools
typical of decision support systems. Data mining tools can answer business questions that
traditionally were too time consuming to resolve. They scour databases for hidden patterns,
finding predictive information that experts may miss because it lies outside their expectations.

Most companies already collect and refine massive quantities of data. Data mining techniques
can be implemented rapidly on existing software and hardware platforms to enhance the value of
existing information resources, and can be integrated with new products and systems as they are
brought on-line. When implemented on high performance client/server or parallel processing
computers, data mining tools can analyze massive databases to deliver answers to questions
such as, "Which clients are most likely to respond to my next promotional mailing, and why?"

This white paper provides an introduction to the basic technologies of data mining. Examples of
profitable applications illustrate its relevance to today’s business environment as well as a basic
description of how data warehouse architectures can evolve to deliver the value of data mining
to end users.

The Foundations of Data MiningData mining techniques are the result of a long process of
research and product development. This evolution began when business data was first stored on
computers, continued with improvements in data access, and more recently, generated
technologies that allow users to navigate through their data in real time. Data mining takes this
evolutionary process beyond retrospective data access and navigation to prospective and
proactive information delivery. Data mining is ready for application in the business community
because it is supported by three technologies that are now sufficiently mature:

• Massive data collection


• Powerful multiprocessor computers
• Data mining algorithms

Commercial databases are growing at unprecedented rates. A recent META Group survey of
data warehouse projects found that 19% of respondents are beyond the 50 gigabyte level, while
59% expect to be there by second quarter of 1996.1 In some industries, such as retail, these
numbers can be much larger. The accompanying need for improved computational engines can
now be met in a cost-effective manner with parallel multiprocessor computer technology. Data
mining algorithms embody techniques that have existed for at least 10 years, but have only
recently been implemented as mature, reliable, understandable tools that consistently outperform
older statistical methods.
In the evolution from business data to business information, each new step has built upon the
previous one. For example, dynamic data access is critical for drill-through in data navigation
applications, and the ability to store large databases is critical to data mining. From the user’s
point of view, the four steps listed in Table 1 were revolutionary because they allowed new
business questions to be answered accurately and quickly.

Denormalization

Abstract

Denormalization is a technique to move from higher to lower normal forms of database


modeling in order to speed up database access. You may apply Denormalization in the process
of deriving a physical data model from a logical form.

Context

The analysis you have started, concentrating on important views only, shows that an Order View
is one of the most heavily used views in the database. Reading an Order and
the OrderItems costs a join statement and several physical page accesses. Writing it costs several
update statements. Further statistical analysis, again concentrating on the important parts of
orders, yields that 90 percent of Orders have no more than 5 positions.

Problem

How can you manage to read and write most physical views with a single page database access
when you have a parent/child (Order/OrderItem) relation?

Forces

• Time vs. space: Database optimization is mostly a question of time versus space
tradeoffs. Normalized logical data models are optimized for minimum redundancy and
avoidance of update anomalies. They are not optimized for minimum access time. Time
does not play a role in the denormalization process. A 3NF or higher normalized data
model can be accessed with minimum complex code if the domain reflects the relational
calculus and the logical data model based on it. Normalized data models are usually better
to understand than data models that reflect considerations of physical optimizations.

Solution

Fill up the Order entity database page with OrderItems until you reach the next physical page
limit. Physical pages are usually chunks of 2 or 4K depending on the database implementation.

The number of OrderItems should be near or greater than the number of OrderItems that reflects
the order size in 80-95% of cases.

Consequences

Time: You will now be able to access most Orders with a single database access.
Space: You need more disk space in the database depending on alignment rules. If you are lucky
and the database system will begin a new page with every new order the space is even free. As
you have to deal with the 5-20% or so cases of orders that do not fit into
the Order/OrderItem record, the code will become more complex.

Queries: The database will also become less queryable for ad hoc queries as the physical
structure does no longer exactly reflect the logical data model.

TIMESTAMP BASED PROTOCOL

A timestamp is a unique identifier created by the DBMS to identify a transaction. Typically,


timestamp values are assigned in the order in which the transactions are submitted to the system,
so a timestamp can be thought of as thetransaction start time. We will refer to the timestamp of
transaction T as TS(T). Concurrency control techniques based on timestamp ordering do not use
locks; hence, deadlocks cannot occur.
Generation of Timestamp:
Timestamps can be generated in several ways. One possibility is to use a counter that is
incremented each time its value is assigned to a transaction. The transaction timestamps are
numbered 1, 2, 3, . . . in this scheme. A computer counter has a finite maximum value, so the
system must periodically reset the counter to zero when no transactions are executing for some
short period of time. Another way to implement timestamps is to use the current date/time value
of the system clock and ensure that no two timestamp values are generated during the same tick
of the clock.

Timestamp Ordering Algorithm


ü Basic Timestamp Ordering
ü Strict Timestamp Ordering
ü Thomas's Write Rule
The idea for this scheme is to order the transactions based on their timestamps. A schedule in
which the transactions participate is then serializable, and the equivalent serial schedule has the
transactions in order of their timestamp values. This is called timestamp ordering (TO). Notice
how this differs from 2PL, where a schedule is serializable by being equivalent to some serial
schedule allowed by the locking protocols. In timestamp ordering, however, the schedule is
equivalent to the particular serial order corresponding to the order of the transaction timestamps.
The algorithm must ensure that, for each item accessed by conflicting operations in the schedule,
the order in which the item is accessed does not violate the serializability order. To do this, the
algorithm associates with each database item X two timestamp (TS) values:

1. Read_TS(X): The read timestamp of item X; this is the largest timestamp among all the
timestamps of transactions that have successfully read item X—that is, read_TS(X) = TS(T),
where T is the youngest transaction that has read X successfully.
2. Write_TS(X): The write timestamp of item X; this is the largest of all the timestamps of
transactions that have successfully written item X—that is, write_TS(X) = TS(T), where T is
the youngest transaction that has written Xsuccessfully.

Integrity constraint

The notion of an integrity constraint is strongly related to the notion of a transaction. Recall
that a transaction is a sequence of database statements that needs to execute atomically, as a
whole, in order to maintain database consistency. Some integrity constraints are validated
during the execution of a transaction and some at the time when the transaction completes its
execution. We call the time during which integrity constraints are validated at the end—
the commit or commitment time. A transaction is committed if it satisfies all the integrity
constraints. If a transaction violates an integrity constraint or it cannot complete successfully,
the DBMS aborts and rollbacks the transaction. That is, the system reverses the transaction
whenever it is necessary.

A DBMS enforces integrity constraints, in that it permits only legal instances to be stored in the
database.Integrity constraints are speci_ed and enforced at di_erent times:
1. When the DBA or end user de_nes a database schema, he or she speci_es the ICs
that must hold on any instance of this database.
2. When a database application is run, the DBMS checks for violations and disallows
changes to the data that violate the speci_ed ICs. (In some situations, rather than
disallow the change, the DBMS might instead make some compensating changes
to the data to ensure that the database instance satis_es all ICs. In any case,
changes to the database are not allowed to create an instance that violates any
IC.)
Many kinds of integrity constraints can be speci_ed in the relational model. We have
already seen one example of an integrity constraint in the domain constraints associated
with a relation schema (Section 3.1). In general, other kinds of constraints can be
speci_ed as well; for example, no two students have the same sid value. In this section
we discuss the integrity constraints, other than domain constraints, that a DBA or
user can specify in the relational model.

In order for the system to know the boundaries of the transaction, we must declare each
transaction by using:

• SET TRANSACTION READ WRITE (in some implementations, DECLARE TRANSACTION


READ WRITE).
• SET TRANSACTION READ ONLY (or DECLARE TRANSACTION READ ONLY).

The directives READ WRITE and READ ONLY are used to provide a hint to the system that the
transaction is an update or a read-only transaction, respectively. The system uses such hints to
optimize the execution strategy for the transaction. The default is READ WRITE, i.e., an update
transaction.
If a transaction is not explicitly declared, the system implicitly declares an update transaction at
the beginning of an SQL statement and expects the user to specify its termination condition. A
termination condition can be specified using the COMMIT and ROLLBACK statements.

• COMMIT is used to request that the updates of a transaction be made permanent to the
database.
• ROLLBACK is used to discard the updates of a transaction. Any changes to the database
are reversed back to the last transaction commit.

Specifying Syntactic Integrity Constraints


When creating a table we specified the domain for each column, the primary and foreign keys,
and some alternate keys. All these are examples of syntactic constraints. SQL allows us to
specify additional syntactic constraints as well as the time of their evaluation and enforcement.

The time of constraints enforcement can be

• DEFERRABLE until commit time.


• NOT DEFERRABLE, or immediate.

Here is a complete list of constraints on columns.

• NOT NULL constraint specifies the entity constraint. A column cannot accept a NULL
value.
• DEFAULT value allows the specification of default value without the DEFAULT-clause;
the default value is NULL.
• PRIMARY KEY (column-list) specifies the primary key.
• UNIQUE (column-list) allows the specification of an alternate key.
• FOREIGN KEY (column-list) REFERENCES TABLE (key) allows the specification of
referential integrity. It also allows the specification of actions to be taken if referential
integrity is violated during a delete or an update.

Serializability is the well-defined textbook notion of correctness


for concurrent transactions. It dictates that a sequence of interleaved
actions for multiple committing transactions must correspond to some
serial execution of the transactions — as though there were no parallel
execution at all. Serializability is a way of describing the desired behavior
of a set of transactions. Isolation is the same idea from the point of view
of a single transaction. A transaction is said to execute in isolation if it
does not see any concurrency anomalies—the “I” of ACID.
Serializability is enforced by the DBMS concurrency control model.
There are three broad techniques of concurrency control enforcement.
These are well-described in textbooks and early survey papers [7], but
we very briefly review them here:
1. Strict two-phase locking (2PL): Transactions acquire a shared lock on every data record
before reading it, and an exclusive lock on every data item before writing it. All
locks are held until the end of the transaction, at which time
they are all released atomically. A transaction blocks on a
wait-queue while waiting to acquire a lock.
2. Multi-Version Concurrency Control (MVCC): Transactions
do not hold locks, but instead are guaranteed a consistent
view of the database state at some time in the past, even if
rows have changed since that fixed point in time.
3. Optimistic Concurrency Control (OCC): Multiple transactions
are allowed to read and update an item without blocking.
Instead, transactions maintain histories of their reads
and writes, and before committing a transaction checks history
for isolation conflicts they may have occurred; if any are
found, one of the conflicting transactions is rolled back.
Most commercial relational DBMS implement full serializability via
2PL. The lock manager is the code module responsible for providing
the facilities for 2PL.
In order to reduce locking and lock conflicts some DBMSs support
MVCC or OCC, typically as an add-on to 2PL. In an MVCC
model, read locks are not needed, but this is often implemented at the
expense of not providing full serializability, as we will discuss shortly
in Section 4.2.1. To avoid blocking writes behind reads, the write is
allowed to proceed after the previous version of the row is either saved,
or guaranteed to be quickly obtainable otherwise. The in-flight read
transactions continue to use the previous row value as though it were
locked and prevented from being changed. In commercial MVCC implementations,
this stable read value is defined to be either the value at
the start of the read transaction or the value at the start of that transaction’s
most recent SQL statement.
While OCC avoids waiting on locks, it can result in higher penalties
during true conflicts between transactions. In dealing with conflicts
across transactions, OCC is like 2PL except that it converts
what would be lock-waits in 2PL into transaction rollbacks. In scenarios
where conflicts are uncommon, OCC performs very well, avoiding
overly conservative wait time. With frequent conflicts, however, excessive rollbacks and retries
negatively impact performance and make it a poor choice.

Keys: Links between Tables


What makes the relational database work is controlled redundancy. Tables within the
database share common attributes that enables to hook the tables together.
A key is a device that helps to define entity relationships. The link is created by having two
tables share a common attribute (the primary key of one table appears again as the link (or
foreign key) in another table). The foreign key must contain values that match the other table's
primary key value, or it must contain a null "value" in order for the table to exhibit referential
integrity.
The primary key plays an important role in the relational environment and is to examine
more carefully. There are several other kinds of keys that warrant attention, like superkeys,
candidate keys, and secondary keys.
The key's role is made possible because one or more attributes within a table display a
relationship known as functional dependence:
• The attribute (B) is functionally dependent on attribute (A) if each value in column (A)
determines one and only one value in column (B).
In relation to the concept of functional dependence, a key is an attribute that determines
the values of other attributes within the entity and it may take more than a single attribute to
define functional dependence; that is, a key may be composed of more than one attribute.
Such a
multi-attribute key is known as a composite key.
Given the possible existence of a composite key, it is further to refine the notion of
functional dependence by specifying full functional dependence:
• If the attribute (B) is functionally dependent on a composite key (A) but not on any subset of
that composite key, the attribute (B) is fully functionally dependent on (A).
A superkey is any key that identifies each entity uniquely (the superkey functionally
determines all of the entity's attributes).
A candidate key may be described as a superkey without the redundancies (it is a
superkey that does not contain a subset of attributes that contain a superkey).
A secondary key is defined as a key that is used strictly for data- retrieval purposes (the
primary key is the customer number and the secondary key is the customer last name and the
street name, for instance).
Relational Database Keys
Superkey An attribute (or combination of attributes) that uniquely identifies each entity
in a table.
Candidate Key A minimal superkey. A superkey that does not contain a subset of attributes
that is itself a superkey.
Primary Key A candidate key selected to uniquely identify all other attribute values in any
given row. Cannot contain null entries.
Secondary Key An attribute (or combination of attributes) used strictly for data- retrieval
purposes.
Foreign Key An attribute (or combination of attributes) in one table whose values must
either match the primary key in another table or be null.

Indexes
Indexes in the relational database environment work for search and retrieval purposes.
Data retrieval speed may be increased dramatically by using indexes. From a conceptual point of

view, an index is composed of an index key and a set of pointers. The index key is the index's reference
point. Indexes may be created very easily with the help of SQL commands. Human needs have increased
tremendously. Now people are doing much more composite tasks than ever before. The society
has become very complex, a person has to work with huge amount of information every day. In order
to work with the enormous information, we must have a system where we can store, manipulate and
share the information all over the world. It is one of the core reasons for introducing Database
Management Systems (DBMS) as well as Relational Database Management Systems (RDBMS) now-a-
days.

So, one thing is clear to us that we store and manipulate data / information into a database, where the
database contains various types of tables for storing various types of data / information.

Table explanation is not good enough for getting the desired data very quickly or sorting the data in a
specific order. What we actually need for doing this is some sort of cross reference facilities where for
certain columns of information within a table, it should be possible to get whole records of information
quickly. But if we consider a huge amount of data in a table, we need some sort of cross reference to
get to the data very quickly.

Use of SQL server indexes provide many facilities such as:

• Rapid access of information


• Efficient access of information
• Enforcement of uniqueness constraints

Correct use of indexes can make the difference between a top performing database with high
customer satisfaction and a non-performing database with low customer satisfaction.

two major types of indexes:

1. Clustered
2. Non-Clustered

Clustered

An index defined as being clustered, defines the physical order that the data in a table is stored. Only
one cluster can be defined per table. So it can be defined as:

• Clustered indexes sort and store the data rows in the table or view based on their key values.
These are the columns included in the index definition. There can be only one clustered index
per table, because the data rows themselves can be sorted in only one order.
• The only time the data rows in a table are stored in sorted order is when the table contains a
clustered index. When a table has a clustered index, the table is called a clustered table. If a
table has no clustered index, its data rows are stored in an unordered structure called a heap.

4.2 Non-Clustered

As a non-clustered index is stored in a separate structure to the base table, it is possible to create the
non-clustered index on a different file group to the base table. So it can be defined as:

• Non-Clustered indexes have a structure separate from the data rows. A non-clustered index
contains the non-clustered index key values and each key value entry has a pointer to the data
row that contains the key value.
• The pointer from an index row in a non-clustered index to a data row is called a row locator.
The structure of the row locator depends on whether the data pages are stored in a heap or a
clustered table. For a heap, a row locator is a pointer to the row. For a clustered table, the row
locator is the clustered index key.
• You can add nonkey columns to the leaf level of the Non-Clustered index to by-pass existing
index key limits, 900 bytes and 16 key columns, and execute fully covered, indexed, queries.

Deadlock in database
Transaction is unit of work done. So a database management system will have number of transactions.
There may be situations when two or more transactions are put into wait state simultaneously .In this
position each would be waiting for the other transaction to get released. Suppose we have two
transactions one and two both executing simultaneously. In transaction numbered one we update
student table and then update course table. We have transaction two in which we update course table
and then update student table. We know that when a table is updated it is locked and prevented from
access from other transactions from updating. So in transaction one student table is updated it is
locked and in transaction two course table is updated and it is locked. We have given already that both
transactions gets executed simultaneously. So both student table and course table gets locked so each
one waits for the other to get released. This is the concept of deadlock in DBMS.

TRANSACTION MGMT
The term transaction refers to a collection of operations that form a single logical unit of work. For
instance, transfer of money from one account to another is a transaction consisting of two updates,
one to each account. It is important that either all actions of a transaction be executed completely, or,
in case of some failure, partial effects of a transaction be undone. This property is called atomicity.
Further, once a transaction is successfully executed, its effects must persist in the database — a
system failure should not result in the database forgetting about a transaction that successfully
completed. This property is called durability. In a database system where multiple transactions are
executing concurrently, if updates to shared data are not controlled there is potential for transactions
to seeinconsistent intermediate states created by updates of other transactions. Such a sit-uation can
result in erroneous updates to data stored in the database. Thus, database systems must provide
mechanisms to isolate transactions from the effects of otherconcurrently executing transactions. This
property is called isolation.

A transaction is a unit of program execution that accesses and possibly updates var- ious data items.
Usually, a transaction is initiated by a user program written in a high-level data-manipulation
language or programming language (for example, SQL, COBOL, C, C++, or Java), where it is
delimited by statements (or function calls) of the form begin transaction and end transaction. The
transaction consists of all opera- tions executed between the begin transaction and end transaction.
To ensure integrity of the data, we require that the database system maintain the following
properties of the transactions .
• Atomicity. Either all operations of the transaction are reflected properly in the database, or
none are.
• Consistency. Execution of a transaction in isolation (that is, with no other transaction
executing concurrently) preserves the consistency of the database.
• Isolation. Even though multiple transactions may execute concurrently, the system
guarantees that, for every pair of transactions Ti and Tj , it appears to Ti that either Tj
finished execution before Ti started, or Tj started execu- tion after Ti finished. Thus, each
transaction is unaware of other transactions executing concurrently in the system.
• Durability. After a transaction completes successfully, the changes it has made to the
database persist, even if there are system failures. These properties are often called the ACID
properties; the acronym is derived from the first letter of each of the four properties.

Failure Classification
There are various types of failure that may occur in a system, each of which needs to be dealt
with in a different manner. The simplest type of failure is one that does not result in the loss of
information in the system. The failures that are more difficult to deal with are those that result in
loss of information. In this chapter, we shall consider only the following types of failure:
• Transaction failure. There are two types of errors that may cause a transaction to fail:
Logical error. The transaction can no longer continue with its normal ex- ecution because of some
internal condition, such as bad input, data not found, overflow, or resource limit exceeded.
System error. The system has entered an undesirable state (for example, deadlock), as a result of
which a transaction cannot continue with its nor- mal execution. The transaction, however, can
be reexecuted at a later time.
• System crash. There is a hardware malfunction, or a bug in the database soft- ware or the
operating system, that causes the loss of the content of volatile storage, and brings transaction
processing to a halt. The content of nonvolatile storage remains intact, and is not corrupted.
The assumption that hardware errors and bugs in the software bring the system to a halt, but do not
corrupt the nonvolatile storage contents, is known as the fail-stop assumption. Well-designed
systems have numerous internal checks, at the hardware and the software level, that bring the
system to a halt when there is an error. Hence, the fail-stop assumption is a reasonable one.
• Disk failure. A disk block loses its content as a result of either a head crash or failure during a data
transfer operation. Copies of the data on other disks, or archival backups on tertiary media, such as
tapes, are used to recover from the failure.

To determine how the system should recover from failures, we need to identify the failure modes of
those devices used for storing data. Next, we must consider how these failure modes affect the
contents of the database. We can then propose algorithms to ensure database consistency and
transaction atomicity despite failures. These algorithms, known as recovery algorithms, have two
parts:
1. Actions taken during normal transaction processing to ensure that enough information exists to
allow recovery from failures.

2. Actions taken after a failure to recover the database contents to a state that ensures database
consistency, transaction atomicity, and durability.

You might also like