Data Storage, Retrieval and DBMS
Data Storage, Retrieval and DBMS
Data
Double Precision: Real data values are commonly called single precision data because each
real constant is stored in a single memory location. This usually gives seven significant digits
for each real value. In many calculations, particularly those involving iteration or long
sequences of calculations, single precision is not adequate to express the precision required.
To overcome this limitation, many programming languages provide the double precision data
type. Each double precision is stored in two memory locations, thus providing twice as many
significant digits.
Logical Data Type: Use the Logical data type when you want an efficient way to store data that
has only two values. Logical data is stored as true (.T.) or false (.F.)
Characters: Choose the Character data type when you want to include letters, numbers,
spaces, symbols, and punctuation. Character fields or variables store text information such as
names, addresses, and numbers that are not used in mathematical calculations. For example,
phone numbers or zip codes, though they include mostly numbers, are actually best used as
Character values.
Strings: A data type consisting of a sequence of contiguous characters that represent the
characters themselves rather than their numeric values. A String can include letters, numbers,
spaces, and punctuation. The String data type can store fixed-length strings ranging in length
from 0 to approximately 63K characters and dynamic strings ranging in length from 0 to
approximately 2 billion characters. The dollar sign ($) type-declaration character represents a
String.
Variable is something that may change in value. E.g. - No. Of words in different pages of a
book.
33
KEY: relational means of specifying uniqueness. A database key is an attribute utilized to sort
and/or identify data in some manner. Each table has a primary key which uniquely identifies
records. Foreign keys are utilized to cross-reference data between relational tables.
The primary key of a relational table uniquely identifies each record in the table. It can either
be a normal attribute that is guaranteed to be unique (such as Social Security Number in a
table with no more than one record per person) or it can be generated by the DBMS (such as a
globally unique identifier, or GUID, in Microsoft SQL Server). Primary keys may consist of a
single attribute or multiple attributes in combination.
Examples:
Imagine we have a STUDENTS table that contains a record for each student at a university. The
student's unique student ID number would be a good choice for a primary key in the STUDENTS
table. The student's first and last name would not be a good choice, as there is always the chance
that more than one student might have the same name.
For example, suppose Table B has a foreign key that points to a field in Table A. Referential
integrity would prevent you from adding a record to Table B that cannot be linked to Table A. In
addition, the referential integrity rules might also specify that whenever you delete a record from
Table A, any records in Table B that are linked to the deleted record will also be deleted. This is
called cascading delete. Finally, the referential integrity rules could specify that whenever you
modify the value of a linked field in Table A, all records in Table B that are linked to it will also be
modified accordingly. This is called cascading update.
Consider the situation where we have two tables: Employees and Managers. The Employees table
has a foreign key attribute entitled Managed By which points to the record for that employees
manager in the Managers table. Referential integrity enforces the following three rules:
1.
We may not add a record to the Employees table unless the Managed By attributes points
to a valid record in the Managers table.
2.
If the primary key for a record in the Managers table changes, all corresponding records in
the Employees table must be modified using a cascading update.
3.
If a record in the Managers table is deleted, all corresponding records in the Employees
table must be deleted using a cascading delete.
Alternate Key: The alternate keys of any table are simply those candidate keys which are not
currently selected as the primary key. An alternate key is a function of all candidate keys
minus the primary key.
34
Secondary Key: Secondary keys can be defined for each table to optimize the data access.
They can refer to any column combination and they help to prevent sequential scans over the
table. Like the primary key, the secondary key can consist of multiple columns. A candidate
key which is not selected as a primary key is known as Secondary Key.
Index Fields: are used to store relevant information along with a document.
Date Fields The date field accepts data entered in date format.
Text Fields The text field accepts data as an alpha-numeric text string.
Information
It is the data that has been converted into a meaningful and useful context for specific end
users.
To obtain information data form is aggregated, manipulated and organized, its content
analysed and evaluated and placed in proper context for human use.
Information exists as reports, in a systematic textual format or as graphics in an organized
manner.
Information must be relevant, timely, accurate, concise and complete and should apply to
the current situation.
It should be condensed into useable length.
35
Physical Record
Logical Record
Meaning
Independence
Example
File: A file is a number of related records that are treated as a unit. For
example, a collection of employee records for one company would be
an employee file.
FILE
Employee 2
Employee 1
Employee No
XXX
Field
Salary
XXX
Character
36
Master File
Transaction File
Data Life
Master
file
contains
relatively These files contain temporary
permanent records for identification data which is to be processed
and
summarizing
statistical in combination with master file
information.
Content
Data Size
Examples
Access
method
Redundancy
File Organization
I. Serial File Organization
Records are arranged one after the other in no particular order- other than,
chronological order in which records are added to the file. This type of organization is
commonly found with transaction data, where records are created in a file in the order
in which transaction takes place.
II. Sequential File Organization
1. In sequential file, records are stored one after another in an ascending or
descending order determined by the key field of the records.
2. In Payroll example, the records of the employee file may be organized
sequentially by employee code sequence.
37
38
DIRECT SEQUENTIAL
ACCESS
SELF DIRECT
ADDRESSING METHOD
(A)
INDEX SEQUENTIAL
ADDRESSING METHOD
(B)
RANDOM ACCESS
ADDRESS GENERATION
METHOD
INDEXED RANDOM
A- Self -Addressing method: A record key is used as its relative address. Therefore, we can
computer the records address directly from the record key and the physical address of the
first record in the file.
B- Indexed Sequential File Organization :
1. A computer provides a better way to store information like the card catalogue; indeed,
most public libraries today keep their card catalogues on a computer. For each book in the
library, a data record is created that contains information gathered from the various card
catalogues. For example, the title of the book, the author's name, the physical location of
the book, and any other relevant information. A record is generally composed of several
fields, with each field used to store a particular piece of information. For example, we
might store the author's last name in one field and the first name in a separate field. All
the records (one for each book) are collected and stored in a file. The file containing the
records is typically called the data file.
2. Indexes are created so that a particular record in the data file can be located quickly. For
example, we could create an author index, a title index, and a subject index. The indexes
are typically stored in a separate file called the index file.
3. An index is a collection of "keys", one key for each record in the data file. A key is a subset
of the information stored in a record. When an index is created, the key values are
extracted from one or more fields of each record. The value of each key determines its
order in the index (i.e., the keys are sorted alphabetically or numerically). Each key has an
associated pointer that indicates the location in the data file of the corresponding
complete record. To find a particular record, a matching key is quickly located in the index,
and then the associated pointer is used to locate the complete record.
4. Consider the problem of locating a particular book in a library containing thousands of
books. Public libraries long ago developed the card catalogue as a means to efficiently
locate a particular book. Usually there were at least three card catalogues, one with cards
arranged in order by the name of the author, another arranged by the title of the book,
and a third arranged by subject heading. Each card contained information about the book,
most importantly its location in the library. Therefore, by knowing the name of the author,
the title of the book, or the appropriate subject heading, you could use the card catalogues
39
5.
6.
7.
8.
to quickly determine the location of a particular book. The card catalogues can be thought
of as indexes.
Consider the author index. There is a filing cabinet containing a card for each book in the
library, filed in alphabetical order by the author's name. Each drawer in the cabinet is
labelled, perhaps "A-E", "F-J", and so on. There are two broad kinds of searches that you
might want to perform on the author index.
First, you might want to make a list containing the name of every book in the library. To do
this you would start in the first drawer with the first card, and look at each card in order
until you reached the last card in the last drawer. This is called a "sequential" search
because you look at each card in the catalogue in sequential order.
Second, you might want to know the names of the books in the library that were written
by Thomas Jefferson. Instead of examining every card in the catalogue, you are first guided
by the labels on the drawers to the second drawer, the "F-J" drawer. You are then guided
by the tabs inside the drawer to the names that start with the letter "J". This is called a
"random" search. For any particular card, you can use the labels (or indexes) to go almost
directly to the desired card.
Actually locating the Thomas Jefferson card(s) involves both a random and sequential
search. We use random access to go directly to the correct drawer and correct tab inside
the drawer. The labels (or indexes) allow us to very quickly get close to the card of interest.
After locating the "J" tab inside the "F-J" drawer, we then use sequential access to locate
the particular Thomas Jefferson card(s) of interest.
Merits:
Allows efficient and economical use of sequential processing techniques
when activity rate is high.
Permits quick access to records in relatively sufficient way. This activity
is a small fraction of the total work load.
Demerits:
Less efficient in the use of storage space than other organization
Slow access to records because of using indexes. Relatively expensive
hardware and software resources are required.
Application:
Inventory control where sequential access and also inquiry required.
Students registration system.
40
Transactions can be processed in any order and written at any location through the
stored file. The desired records can be directly accessed using randomizing
procedure without accessing all other records in the file.
Merits:
Access to records for inquiry and updating possible immediately.
Immediate updating of several files as a result of single transaction is possible.
No need for sorting.
Demerits:
Risk to records in the on-line file line, loss of accuracy, breach of security.
Special backup and reconstruction procedures are established.
Less efficient in the use of storage space than sequentially organized file.
Relatively expensive software and hardware resources required.
Application:
Any type of inquiry such as
Railway reservation or Air reservation system.
o The Best File Organization
File management involves logical organization of data supplied to a computer in a predetermined
way. Storing data in a particular place is called a FILE. The file is created using a set of instructions
called PROGRAM. The data created in the file depends on the following factors:1. Data Dependence
2. Data Redundancy
3. Data Integrity
File Management Software
It is a software package that helps the users to organize data into files, process them and
retrieve the information.
The users can create report, formats, enter data into records, search records, sort them
and prepare reports.
They are designed for micro computers and menu- driven allowing end users to create files
by giving easy to use instructions.
Following are the criteria in choosing file organisation method:
1. File Volatility
41
(i) File Volatility is the number of additions and deletions to the file in a given period
of time. E.g. Payroll file of a company where the employee register is constantly
changing is a highly volatile file, and therefore direct access method is better.
2. File Activity
(i) File activity refers to the proportion of records accessed on a run to the no. Of
records in a file.
(ii) In case of real time files where each transaction is processed immediately only one
master record is accessed at a time, direct access method is appropriate.
(iii) In case where almost every record is accessed for processing sequentially ordered
file is appropriate.
3. File Interrogation
(i) File interrogation refers to the retrieval of information from a file.
(ii) If the retrieval of individual records must be fast to support a real time
operation such as Airline reservation then some kind of direct
organization is required.
(iii) If on the other hand, requirements for data can be delayed, then all the
individuals requests of information can be batched and run in a single
processing run with a sequential file organization.
4. File Size
(i) Large files which require many individual references to records with immediate
response must be organized under direct access method.
(ii) In case of small files, it is better to search the entire file sequentially or with a more
efficient binary search, to find an individual record than to maintain complex
indexes or complex direct addressing schemes.
Problems of the File Processing Systems:
i.
Data Redundancy: Same data is stored in different files since the data files are
independent. This result in lot of duplicated data and a separate file maintenance program
is necessities to update each file.
ii.
Data Dependence: The component of a file processing system depends on one another,
and therefore changes were made in the format and structure of data in a file. Changes
have to be made in all programs that use this file.
iii.
Data integrity: The same data is found in different forms in different files. Checking the
validity of data could not be uniformly implemented with the result that data in one file
42
may be correct and in another file wrong. Special computer programs were written to
retrieve data from such independent files which are time consuming and expensive.
iv.
Data Availability: Since data is scattered in many files, it would be necessary to look into
many files before relying on a particular data. Due to non- uniformity in the file design, the
data may have different identification numbers in different files and obtaining the
necessary data will be difficult.
v.
Management control: Uniform policies and standards cannot be set since the data is
scattered in different files. It is difficult to relate such files and difficult to implement a
decision due to non- uniform coding of the data files.
43
Advantages:
Since many small computers are used, the system is not dependent on one
large that could shut down the network if failed.
Micro computers tends to be less complex than large systems, therefore the
system is more useful to local users.
5. End user database: These databases consist of various data files of word, Excel and
database which end user has generated.
6. External Databases: These are also known as online databases provided by various data
banks or organizations at nominal fee.
7. Test Databases: These are informative databases available normally on CD- Rom disk
for certain price.
8. Images databases: These databases contain alpha numeric information. These are
available either on Internet or in CD at certain price.
9. Object oriented databases: This is a type of database structure developed to be
suitable to changing application needs. When integrated database structures were
developed, the need for OODB was felt. Database with relational qualities that are
capable of manipulating text, data, objects, images and audio/ video clips are used by
organisations. With OODB, OOP has been developed. In OOP (object oriented
programming), every object is described as a set of attributes describing what the
object is. The behaviour of the object is also included in the program. Objects with
44
similar qualities and behaviour can be grouped together. OOP is more useful in
decision making.
10. Partitioned Database (Partial Distribution): Some databases are centrally managed and
some managed in a decentralised manner. This approach is called partitioned
database. For e.g., financial, marketing, administrative data can be maintained in
headquarters whereas production data may be maintained in decentralised locations.
Factors to be addressed in maintaining a database:
1. Installation of Database:
2. Memory usage:
4. CPU usage:
45
Mark jobs that can be processed in run off period to unload the machine
during peak working hours.
Nave users who are not aware of the presence of the database system supporting
the usage.
Online users who may communicate with database either directly through online
terminal or indirectly through user interface or application programs. Usually they
acquire some skill and experience in communicating with the database.
DBA who can execute centralized control and is responsible for maintaining the
database.
The user interaction with the DBMS includes the definition of the logical relationships in
the database, input and maintenance of data, changing and deletion and manipulation of
data.
4. A host interface system: This is that part of DBMS which communicates with the
application programs. The host language interface interprets instructions in high level
language application programs, such as COBOL and BASIC programs that requests data
from files so that the data needed can be retrieved. During this period the OS interacts
with the DBMS. Application programs do not contain information about the file, thus the
program is independent of a database system.
5. The application programs: These programs perform the same functions as they do in
conventional system, but they are independent of the data files and use standard data
definitions. This independence and standardisation make rapid special purpose program
development easier and faster.
46
6. A Natural Language Interface System: The query permits online update and inquiry by
users who are relatively un -sophisticated about computer systems. This language is often
termed English- like because instructions of this language are usually in form of a simple
command in English, which are used to accomplish an enquiry task. Query language also
permits online programming of simple routines by managers who wish to interact with the
data. The natural language may also facilitate managers to generate special reports.
7. The data dictionary: Data dictionary is a centralized depository of information, in a
computerized form, about the data in database. The data dictionary also contains the
scheme of the database i.e. the name of each item in the database and a description and
definition of its attributes along with the names of the programs that use them and who is
responsible for the data authorization tables that specify users and the data and programs
authorized for their use. Their descriptions and definitions are referred to as the data
standards. Maintenance of a data dictionary is the responsibility of the DBA.
8. Online access and update terminals: These may be adjacent to computer or even
thousands of miles away. They may be dumb terminals, smart terminals or
microcomputers.
9. The output system or report generators: This provides routine job reports, documents and
special reports. It allows programmers, managers and other users to design output reports
without writing an application program in a programming language.
10. File Pointer: It is pointers that is placed in the last field of a record and contains the
address of another related record thus establish a link between records. It directs the
computer system to move to that related record.
11. Linked List: A Linked list is a group of data records arranged in an order, which is based on
embedded pointers. An embedded pointer is a special data field that links one record to
another by referring to the other record. The field is embedded in the first record, i.e. It is
a data element within the record.
Factors contributing to the Architecture of a Database:
1. External View
It is also known as user view.
As the name suggests, it includes only those application programs which are user
concerned.
It is described by users/ programmers by means of external schema.
2. Conceptual View
It is also known as global view.
It represents the entire data base and includes all data base entries
47
Ext. Schema
2
Ext. Schema
3
Conceptual Schema
Physical Schema
Data Independence
1. In a database an ability to modify a schema definition at one level is done without
affecting a schema in the next higher level.
2. It facilitates logical data independence
3. It assures physical data independence.
Structure of Database
The logical organizational approach of the database is called the Database structure. There are
three basic structures available, viz. Hierarchical, and Relational and Network database
structure.
48
Hierarchically structured database are less flexible than any other database structure
because the hierarchy of records must be determined and implemented before a
search can be conducted, or in other words, the relationships between records are
relatively fixed by the structure.
ii.
Managerial use of query language to solve the problem may require multiple searches
and proof which is very time consuming. Thus, analysis and planning activities, which
frequently involve ad-hoc management queries of the database, may not be supported
as effectively by a hierarchical DBMS as they are by other database structures.
iii.
Ad-hoc queries made by managers that require different relationships other than that
are already implemented in the database may be difficult or time consuming to
accomplish.
iv.
v.
vi.
vii.
viii.
49
MOVIE
THEATRE
THEATRE
ACTOR
MOVIE
THEATRE
ACTOR
ACTOR
MOVIE
THEATRE
Kamalhaasan
Manmadhan Ambu
Satyam
Dhanush
Aadukalam
PVR
Karthi
Siruthai
INOX
Trisha
Manmadhan Ambu
Satyam
Tammanna
Siruthai
PVR
50
i.
ii.
More than one file is compared at a time with the help of a common key field.
iii.
Each file is converted into a table and the analysis is done on the tables with the
help of common key field.
iv.
The row of the table represents the list of records and the column represents data
field.
v.
It is not necessary to maintain the entire file in a single physical location but it can
be maintained geographically at any place.
vi.
This is more suitable for wider analysis of data from different locations.
vii.
Queries are easily possible because software interacts with different records at the
same time.
Network Database Structure
This structure is more useful when data is transmitted from one place to another
place that is one-to-one mode, many-to-many model. This type of structure is
found in organizations where online data processing is carried out.
DBMS (Language)
I. Data Definition Language:
DDL defines the conceptual schema providing a link between the logical and physical
structures of the database. The logical structure of a database is schema. A subschema
is the way a specific application views the data from the database.
Following are the functions of DDL:
i.
They define the physical characteristics of each record, field in the record,
fields type and length, fields logical name and also specify relationships among
the records.
ii.
iii.
iv.
v.
vi.
51
They enable the user and application program to be independent of the physical
data structure and database structures maintenance by allowing to process data on
a logical and symbolic basis rather than on a physical location basis.
STRUCTURE OF DBMS
I. DDL Compiler
Tables contain meta data (data about the data) concerning the database.
52
V. Query manager
It interprets users online query
It converts to an efficient series of operations.
In a form it is capable of being sent to data manager.
It uses data dictionary to find structure of relevant portion of database.
It uses information to modify query.
It prepares an optimal plan to access database for efficient data retrieval.
VI. Data Dictionary
It maintains information pertaining to structure and usage of data and meta data.
It is consulted by the database users to learn what each piece of data and various
synonyms of data field means.
DATA BASE ADMINISTRATOR
A DBA is a person who actually creates and maintains the database and also carries out the
policies developed by the DA. Job of the DBA is a technical one. He is responsible for defining the
internal layout of the database and also for ensuring that the internal layout optimizes system
performance, especially in main business processing areas.
Main functions of a DBA are:1. Determining the physical design of a database and specify the hardware resource
requirement for the purpose. This can be done by determining the data requirement
schedule and accuracy requirements, the way and frequency of data access, search
strategies, physical storage requirements of data, level of security needed and the
response time requirement.
2. Define the contents of the database.
3. Use of data definition language (DDL) to describe formats relationships among various
data elements and their usage.
4. Maintain standard and control to the database.
5. Specify various rules, which must be adhered to while describing data for a database.
6. Allow only specified users to access the database by using access controls thus prevent
unauthorised access.
7. DBA also prepares documentation which includes recording the procedures, standard
guidelines and data descriptions necessary for the efficient and continuous use of
database environment.
53
8. DBA ensures that the operating staff perform its database processing related
responsibilities which include loading the database, following maintenance and security
procedures, taking backups, scheduling the database for use and following, restart and
recovery procedures after some hardware or software failure in a proper way.
9. DBA monitors the database environments.
10. DBA incorporates any enhancements into the database environment, which may include
new utility program or new system releases.
Structured Query Language
SQL is a query language that enables to create relational database which are sets of related
information stored in tables.
It is a set of commands for creating, updating and accessing data from database.
It allows programmers, managers and other users to ask ad-hoc queries of the database
interactively without the aid of programmers. It is a set of about 30 English like commands
such as Select..From.where.
SQL has following features:
a. Simple English like commands
b. command syntax is easy
c. Can be sued by non- programmers.
d. Can be used for different type of DBMS
e. Allows user to create, update database.
f. Allows retrieving data from database without having detailed information about
structure of the records and without being concerned about the processes the DBMS users
to retrieve the data.
g. Has become standard practice for DBMS.
Since SQL is used in many DBMS, managers who understand SQL are able to use the same set of
commands regardless of the DBMS software that they may use.
PROGRAM LIBRARY MANAGEMENT SYSTEM
Program library management system provides several functional capabilities to facilitate effective
and efficient management of the data centre software inventory. The inventory may include
application and system software program code, job control statements that identify resources
used and processes to be performed and processing parameters which direct processing.
Some of the capabilities are as follows:
54
a. Integrity- each source program is assigned a modification number and version number and
each source statement is associated with a creation date. Security to program libraries, job
control language sets and parameters file is provided through the use of passwords,
encryption, data compression facilities and automatic backup creation.
b. Update- Library management systems facilitate the addition, deletion, re-sequencing, and
editing of library members.
c. Reporting- With use of its facilities a list of additions, deletions and modifications along
with library catalogue and library member attributes can be prepared for management
and auditor review.
d. Interface- Library software packages may interface with the operating system, job
scheduling, access control system and online program management.
Need for Documentation:
DATA WAREHOUSE
A Data warehouse is a computer database that collects, integrates and stores an organisations
data with the aim of producing accurate and timely management information and supporting data
analysis. It provides tools to satisfy the information needs of employees or all organizational levels
and not just for complex data queries. It made possible to extract archived operational data and
overcome inconsistencies between different legacy data formats.
A Data Mart is a subset of a Data Warehouse. Most organizations do start designing a data mart to
attend to immediate needs. To keep it simple, consider Data Mart as a data reserve that satisfies
55
certain aspect of business or just one application (or a process). Data Warehouse is a super set
that engulfs all such mini Data marts to form one big reservoir of information.
56
systems to the data warehouse either on a transaction-by-transaction basis for real-time data
warehouses or on a regular cycle (e.g. daily or weekly) for offline data warehouses.
Data Transformation
The Data Transformation layer receives data from the data sources, cleans and standardises it,
and loads it into the data repository. This is often called "staging" data as data often passes
through a temporary database whilst it is being transformed. This activity of transforming data
can be performed either by manually created code or a specific type of software could be used
called an ETL tool. Regardless of the nature of the software used, the following types of activities
occur during data transformation:
Comparing data from different systems to improve data quality (e.g. Date of birth for a
customer may be blank in one system but contain valid data in a second system. In this
instance, the data warehouse would retain the date of birth field from the second system)
standardising data and codes (e.g. If one system refers to "Male" and "Female", but a
second refers to only "M" and "F", these codes sets would need to be standardised)
integrating data from different systems (e.g. if one system keeps orders and another stores
customers, these data elements need to be linked)
performing other system housekeeping functions such as determining change (or "delta")
files to reduce data load times, generating or finding surrogate keys for data etc.
Data Warehouse
The data warehouse is a relational database organised to hold information in a structure that best
supports reporting and analysis. Most data warehouses hold information for at least 1 year and
sometimes can reach half century, depending to the Business/Operations data retention
requirement. As a result these databases can become very large.
Reporting
The data in the data warehouse must be available to the organisation's staff if the data warehouse
is to be useful. There are a very large number of software applications that perform this function,
or reporting can be custom-developed. Examples of types of reporting tools include:
57
Business intelligence tools: These are software applications that simplify the
process of development and production of business reports based on data
warehouse data.
Executive information systems: These are software applications that are used to
display complex business metrics and information in a graphical way to allow rapid
understanding.
OLAP Tools: OLAP tools form data into logical multi-dimensional structures and
allow users to select which dimensions to view data by.
Data Mining: Data mining tools are software that allows users to perform detailed
mathematical and statistical calculations on detailed data warehouse data to
detect trends, identify patterns and analyse data.
Metadata
Metadata, or "data about data", is used to inform operators and users of the data warehouse
about its status and the information held within the data warehouse. Examples of data warehouse
metadata include the most recent data load date, the business meaning of a data item and the
number of users that are logged in currently.
Operations
Data warehouse operations comprises of the processes of loading, manipulating and extracting
data from the data warehouse. Operations also cover user management, security, capacity
management and related functions
Optional Components
In addition, the following components also exist in some data warehouses:
1. Dependent Data Marts: A dependent data mart is a physical database (either on the same
hardware as the data warehouse or on a separate hardware platform) that receives all its
information from the data warehouse. The purpose of a Data Mart is to provide a sub-set
of the data warehouse's data for a specific purpose or to a specific sub-group of the
organisation.
2. Logical Data Marts: A logical data mart is a filtered view of the main data warehouse but
does not physically exist as a separate data copy. This approach to data marts delivers the
same benefits but has the additional advantages of not requiring additional (costly) disk
space and it is always as current with data as the main data warehouse.
3. Operational Data Store: An ODS is an integrated database of operational data. Its sources
include legacy systems and it contains current or near term data. An ODS may contain 30
to 60 days of information, while a data warehouse typically contains years of data. ODS's
are used in some data warehouse architectures to provide near real time reporting
capability in the event that the Data Warehouse's loading time or architecture prevents it
being able to provide near real time reporting capability.
Different methods of storing data in a data warehouse
All data warehouses store their data grouped together by subject areas that reflect the general
usage of the data (Customer, Product, Finance etc.). The general principle used in the majority of
data warehouses is that data is stored at its most elemental level for use in reporting and
information analysis.
Within this generic intent, there are two primary approaches to organising the data in a data
warehouse.
The first is using a "dimensional" approach. In this style, information is stored as "facts" which are
numeric or text data that capture specific data about a single transaction or event, and
"dimensions" which contain reference information that allows each transaction or event to be
classified in various ways. As an example, a sales transaction would be broken up into facts such
as the number of products ordered, and the price paid, and dimensions such as date, customer,
product, geographical location and sales person. The main advantages of a dimensional approach
are that the Data Warehouse is easy for business staff with limited information technology
58
experience to understand and use. Also, because the data is pre-processed into the dimensional
form, the Data Warehouse tends to operate very quickly. The main disadvantage of the
dimensional approach is that it is quite difficult to add or change later if the company changes the
way in which it does business.
The second approach uses database normalisation. In this style, the data in the data warehouse is
stored in third normal form. The main advantage of this approach is that it is quite straightforward
to add new information into the database, whilst the primary disadvantage of this approach is
that it can be quite slow to produce information and reports.
The Advantages of using a Data Warehouse are:
1. Enhanced and user access to a wide variety of data.
2. Increased Data consistency
3. Increased productivity and decreased computational cost.
4. It is able to combine data from different sources, in one place.
5. It provides an infrastructure that could support change to data and replication of the
changed data back into the operational systems.
Concerns in using data warehouse
59
Integrated Data Warehouses are data warehouses that can be used for other systems to access
them for operational systems. Some Integrated Data Warehouses are used by other data
warehouses, allowing them to access them to process reports, as well as look up current data.
BACKUP AND RECOVERY
'Disaster recovery' differs from a database recovery scenario because the operating system
and all related software must be recovered before any database recovery can begin.
Database files that make up a database: Databases consist of disk files that store Data.
When you create a database either using any database software command-line utility, a
main database file or root file is created. This main database file contains database tables,
system tables, and indexes. Additional database files expand the size of the database and
are called dbspaces.
A mirror log is an optional file and has a file extension of .mlg. It is a copy of a transaction
log and provides additional protection against the loss of data in the event the transaction
log becomes unusable.
Online backup, offline backup, and live backup: Database backups can be performed while
the database is being actively accessed (online) or when the database is shutdown (offline)
When a database goes through a normal shutdown process (the process is not being
cancelled) the database engine commits the data to the database files An online database
backup is performed by executing the command-line or from the 'Backup Database' utility.
When an online backup process begins the database engine externalizes all cached data
pages kept in memory to the database file(s) on disk. This process is called a checkpoint.
The database engine continues recording activity in the transaction log file while the
database is being backed up. The log file is backed up after the backup utility finishes
backing up the database. The log file contains all of the transactions recorded since the last
database backup. For this reason the log file from an online full backup must be 'applied'
to the database during recovery. The log file from an offline backup does not have to
participate in recovery but it may be used in recovery if a prior database backup is used.
60
A Live backup is carried out by using the backup utility with the command-line option. A
live backup provides a redundant copy of the transaction log for restart of your system on
a secondary machine in the event the primary database server machine becomes
unusable.
Full and Incremental database backup: Full backup is the starting point for all other types
of backup and contains all the data in the folders and files that are selected to be backed
up. Because full backup stores all files and folders, frequent full backups result in faster
and simpler restore operations.
Incremental backup stores all files that have changed since the last FULL, DIFFERENTIAL
OR INCREMENTAL backup. The advantage of an incremental backup is that it takes the
least time to complete.
For example, you're running a backup on Friday: this first backup always would be a
full backup by default. Then, upon your working with theses files on Monday, Leo Backup
performs the incremental backup: this backup will transfer only those files that changed since
Friday. A Tuesday backup will carry only those files that changed since Monday. And the same
course for the following days.
61
3. Document the backup commands and create procedures outlining backups which are kept
in a file. Also identify the naming convention used as well as the king of backups
performed.
4. Incorporate health checks into the backup procedures to ensure that the database is not
corrupt. Database health check can be performed prior to backing up a database or on a
copy of the database from the back up.
5. Deployment of backup and recovery consists of setting up backup procedures on the
production server. Verification of the necessary hardware in place and any other
supporting software required to perform these tasks must be done. Modify procedures to
reflect the change in development.
6. Monitor backup procedures to avoid unexpected errors. Make sure that any changes in
the process are reflected in the documentation.
Data Centre and the challenges faced by the management of a data
centre:
i.
A Data centre is a centralized repository for the storage, management and dissemination
of data and information.
ii.
Data centre is a facility used for housing a large amount of electronic equipment, typically
computers and communication equipment.
iii.
The purpose of a data centre is to provide space and bandwidth connectivity for server in a
reliable, secure and scalable environment.
iv.
It also provides facilities like housing websites, providing data serving and other services
for companies. Such type of data centre may contain a network operation s centre (NOC)
which is restricted access area containing automated system that constantly monitor
server activity, web traffic, network performance and report even slight irregularities to
engineers so that they can stop potential problems before they occur.
Challenges:
Maintaining Infrastructure A Data centre needs to set up an infrastructure comprising of
a member of electronic equipment, typically computers and band width connectivity for
server in a reliable secure and saleable environment.
Skilled Human Resources a Data centre needs skilled staff expert at network management
having software and hardware operating skill.
Selection of Technology- A Data centre also faces the challenge of proper selection of
technology crucial to the operation of the data centre.
Maintaining system performance A Data centre has to maintain maximum uptime and
system performance, while establishing sufficient redundancy and maintaining security.
62
DATA MINING
Data mining is the extraction of implicit, previously unknown and potentially useful information
from data. It searches for relationship and global patterns that exist in large databases but are
hidden among the vast amount of data. These relationships represent valuable knowledge about
database and objects in the database that can be put to use in the areas such as decision support,
prediction, forecasting and estimation.
In other words, data mining is concerned with the analysis of data and the use of software
techniques used for finding patterns and regularities in sets of data. It is the computer responsible
for finding the patterns by identifying the underlying rules and features in the data.
Stages in data mining
1. Selection: Selecting or segmenting the data according to some criteria so that sub sets of
the data can be determined.
2. Pre- processing: This is the data cleansing stage where certain information is removed
which is deemed unnecessary and may slow down queries. Also the data is re-configured
to ensure a consistent format as there is a possibility of inconsistent formats because the
data is drawn from several sources.
3. Transformation: The data is not merely transferred across but transformed in that overlays
may be added. For example, Demographic overlays are commonly used in market
research. The data is made usable and navigable.
4. Data mining: This stage is concerned with the extraction of patterns from the data. A
pattern can be defined as a given set of facts. One popular example of data mining is using
past behaviour to rank customers. Such tactics have been employed by financial
companies for years as a means of deciding whether or not to approve loans and credit
cards.
5. Integration and Evaluation: The patterns identified by the systems are interpreted into
knowledge which can then be used to support human decision making. For example,
prediction and classification tasks, summarising the contents of a database or explaining
observed phenomenon.
63