0% found this document useful (0 votes)
73 views

CSC 901

Uploaded by

Melissa Barbara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

CSC 901

Uploaded by

Melissa Barbara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

DATA MODEL

A data model is a collection of conceptual tools for describing data, data relationships, data
semantics, and consistency constraints. It is a notation for describing data or information. The
description generally consists of three parts: Structure of the data, Operations on the data and
Constraints on the data.

A relational database is based on the relational model and uses a collection of tables to represent
both data and the relationships among those data. It also includes a Data Manipulation Language
(DML) and Data Definition Language (DDL). The relational model is today the primary data
model for commercial data processing applications - especially for storing financial records,
manufacturing and logistical information, personnel data and much more. It attained its primary
position because of its simplicity, which eases the job of the programmer, compared to earlier
data models such as the network model or the hierarchical model.

RELATIONAL DATABASE

For most of the last 40 years, businesses relied on relational database management systems
(RDBMSs)—that used Structured Query Language (SQL) as the programming language.
ScaleGrid reports that 60.5% of the commonly used databases are SQL-based RDBMS.
Applications in domains such as Multimedia, Geographical Information Systems, and digital
libraries demand a completely different set of requirements in terms of the underlying database
models. The conventional relational database model is no longer appropriate for these types of
data. Furthermore the volume of data is typically significantly larger than in classical database
systems.

Relational database management systems (RDBMSs) use SQL, a database management language
that offers a highly organized and structured approach to information management. Similar to the
way a phone book has different categories of information (name, number, address, etc.) for each
line of data, relational databases apply strict, categorical parameters that allow database users to
easily organize, access, and maintain information within those parameters. The primary reasons
why SQLbased RDBMSs continue to dominate are because they are highly stable and reliable,
adhere to a standard that integrates seamlessly with popular software stacks and have been used
for more than 40 years. Popular examples of SQL database engines include Oracle Database,
MySQL and Microsoft SQL Server

STRUCTURE OF RELATIONAL DATABASES

A relational database consists of a collection of tables, each of which is assigned a unique name.
For example, consider the Facilitator table (Table 1), which stores information about facilitators.
The table has four columns: ID, name, department and rank. Each row of this table records
information about a facilitator, consisting of the facilitator‘s ID, name, department, and rank.
Similarly, the course table (Table 2) stores information about courses, consisting of a course

1
code, title, department, and Credit Hour(CH), for each course. Note that each instructor is
identified by the value of the column ID, while each course is identified by the value of the
column course code.

ID Name Department Rank


0010 Pat Computer science Reader
0011 Effiong Chemistry Professor
0015 Gabriel Physics Lecturer1
0025 Daniel History Professor
Table 1: Facilitator table

Course code Title Department CH


CIT 751 Database Computer science 3
CHM432 Organic chemistry Chemistry 2
PHY 234 Introduction to Physics Physics 4
HIS 121 History education History 4
TABLE 2: facilitator table

In general, a row in a table represents a relationship among a set of values. Since a table is a
collection of such relationships, there is a close correspondence between the concept of table and
the mathematical concept of relation, from which the relational data model takes its name. In
mathematical terminology, a tuple is simply a sequence (or list) of values. A relationship
between n values is represented mathematically by an n-tuple of values, i.e., a tuple with n
values, which corresponds to a row in a table. Thus, in the relational model the term relation is
used to refer to a table, while term tuple is used to refer to a row.

Similarly, the term attribute refers to a column header of a table while the data type describing
the types of values that can appear in each column is called a domain. For example, suppose the
table facilitator had an attribute phone number, which can store a set of phone numbers
associated to the facilitator. Then the domain of phone number would not be atomic, since an
element of the domain is a set of phone numbers (may be more than one phone numbers), and it
has subparts, namely each individual phone numbers in the set.

The important issue is not what the domain itself is, but rather how we use domain elements in
our database. Suppose now that the phone number attribute stores a single phone number. Even
then, if we split the value from the phone number attribute (for example +234-873-424-1626),
into a country code, network provider, an area code and a local number, then it is considered as a
non-atomic value. If view each phone number as a single indivisible unit, then the attribute
phone number would then have an atomic domain.

ADVANTAGES OF RDBMS

ACID compliance: If a database system is "ACID compliant," it satisfies a set of priorities that
measure the atomicity, consistency, isolation, and durability of database systems. The more

2
ACID-compliant a database is, the more it serves to guarantee the validity of database
transactions, reduce anomalies, safeguard data integrity, and create stable database systems.
Generally, SQL-based RDBMSs achieve a high level of ACID compliance, but NoSQL
databases give up this distinction to gain speed and flexibility when dealing with unstructured
data.

Ideal for consistent data systems: With an SQL-based RDBMS, information will remain in its
original structure. They offer great speed and stability especially if it does not involve massive
amounts of data.

Better support options: Because RDBMS databases have been around for over 40 years, it's
easier to get support, add-on products, and integrate data from other systems.

DISADVANTAGES OF RDBMS

Scalability challenges and difficulties with sharing: RDBMSs have a more difficult time scaling
up in response to massive growth compared to NoSQL databases. These databases also present
challenges when it comes to sharing. On the other hand, a non-relational database system
(NoSQL-based) handle scaling and growth better than relational databases.

Less efficient with NoSQL formats: Most RDBMSs are now compatible with NoSQL data
formats, but they don't work with them as efficiently as nonrelational databases.

Another characteristic of conventional databases is that there are hardy international standards
available or used for the content of the databases, being the data that is entered by its users. This
typically means that local conventions are applied to limit the diversity of data that may be
entered in those databases. As local conventions usually differ from other local conventions this
has as disadvantage that data that are entered in one database cannot be compared or integrated
with data in other databases, even if those database structures are the same and even if the
application domain of the databases is the same.

NON-RELATIONAL DATABASE SYSTEMS

Non-Relational Database Systems (NoSQL-based) When tasked with managing large amounts of
unstructured data—like text from emails and customer surveys, data collected by a network of
mobile apps, or random social media information. The information is disorganized. There is no
clearly-defined schema like you would find an RDBMS. You cannot store it in an RDBMS. But
you can with a non relational (or NoSQL) database system.

ADVANTAGES OF NON-RELATIONAL DATABASE SYSTEMS

Excellent for handling "big data" analytics: The main reason why NoSQL databases are
becoming more popular is that they remove the bottleneck of needing to categorize and apply
strict structures to massive amounts of information. NoSQL databases like HBase, Cassandra,

3
and CouchDB support the speed and efficiency of server operations while offering the capacity
to work with large amounts of data.

No limits on types of data you can store: NoSQL databases give you unlimited freedom to store
diverse types of data in the same place. This offers the flexibility to add new and different types
of data to your database at any time.

Easier to scale: NoSQL databases are easier to scale. They are designed to be fragmented across
multiple data centers without much difficulty.

No data preparation required: When there isn't time to design a complex model, and you need to
get a database running fast, non-relational databases save a lot of time.

DISADVANTAGES OF NON-RELATIONAL DATABASE SYSTEMS

More difficult to find support: Because the NoSQL community doesn't have 40 years of history
and development behind it, it could be more difficult to find experienced users when you in need
support.

Lack of tools: Another disadvantage relating to newness is that—compared to SQL-based


RDBMS solutions—there aren't as many tools to assist with performance testing and analysis.

Compatibility and standardization challenges: Newer NoSQL database systems also lack the
high degree of compatibility and standardization offered by SQL-based alternatives.

DISTRIBUTED PROCESSING

Distributed processing makes use of two or more (usually, many more) computers that are
networked together and all working on a single task in a well-coordinated fashion. The
individual computers involved can be ordinary desktop or laptop machines, high-end machines,
or specialized servers that carry out specific tasks like storage and retrieval of datasets. In a
complex distributed system, sub-components of the system (a subgroup of networked
computers) can be devoted to a specific task while other groups concentrate on separate tasks.

With proper communications links and instructions to the machines, a series of distributed
computers can do the work of much more powerful stand-alone systems, and can even reach
processing power and speeds of the fastest supercomputers. Many gaming systems rely on
distributed processing setups, where gamers' individual machines carry out some of the
processing in addition to more central servers providing the gaming backbone.

DATABASE DESIGN

The design issues regarding relational databases are described. In general, the goal of a relational
database design is to generate a set of relation schemas that allow us to store information without
unnecessary redundancy, yet allowing us to retrieve information easily. A well-structured and

4
efficient database has the following advantages: it saves disk space by eliminating redundant
data, maintains data accuracy and integrity, provides access to the data in useful ways.

Designing an efficient, useful database is a matter of following the proper process, including
these phases: strategy and planning: typically the cycle starts with the strategy and planning
phase to identify the need and scope of a new system; Requirements analysis: a more detailed
requirements analysis will be carried out which will include identifying what the users require of
the system; this will involve conceptual analysis; Design phase – this will involve producing a
conceptual, logical and physical design. To undertake these processes it is important to be able to
understand and apply the data modeling techniques which are covered in this book. When a
suitable logical design has been obtained the development phase can begin; Development phase
– this involves creating the database structure using an appropriate Database;
Deployment/implementation – when the system has been developed it will be tested, it will then
be deployed ready for use; Operations and maintenance: following the system release for use it
will be maintained until it reaches the end of its useful life, at this stage the development
lifecycle may restart.

DATABASE DESIGN PHASE

The requirements gathering and specification provides you with a high-level understanding of
the organization, its data, and the processes that you must model in the database. Database
design involves constructing a suitable model of this information. Since the design process is
complicated, especially for large databases, database design is divided into three phases:

Conceptual database design

Logical database design

Physical database design

CONCEPTUAL SCHEMA MODELLING

Once all the requirements have been collected and analyzed, the next steps is to create a
conceptual schema for the database, using a high‐level conceptual data model. That is to develop
layout or a visual representation of the proposed database. In many environments modelling is
used to ensure that a product will satisfy the user‘s requirements before it is produced. For
example, an architect may use a scale model of a building so the client can see what it will look
like before it is built. This allows for any changes to be made to the design following feedback
and before any expensive building work takes place. Similarly, a modelling approach is needed
when designing a database system so that interested parties can check that the design will satisfy
the requirements.

In order to design an effective database system you need to be able to understand an


organization‘s information needs and, in particular, identify the data needed to satisfy these

5
needs. Entity Relationship is an important top-down analysis technique which is used to show
the structure of the data used by a system. Initially, a conceptual model is produced which is
independent of any hardware or DBMS system; this is achieved by using an Entity Relationship
Diagram (ERD) or alternatively a UML Class Diagram (CD). This modelling technique will be
used to determine how this business data is structured and show the relationships between the
different data entities. The model forms the basis for the design of the database system that will
be built.

SPECIFICATION REQUIREMENTS GATHERING

The most critical aspect of specification is the gathering and compilation of system and user
requirements. This process is normally done in conjunction with managers and users. The initial
phase of database design is to characterize fully the data needs of the prospective database users.
The database designer needs to interact extensively with domain experts and users to carry out
this task. The outcome of this phase is a specification of user requirements. While there are
techniques for diagrammatically representing user requirements, in this unit we restrict ourselves
to textual descriptions of user requirements

The major goals in requirements gathering is to collect the data used by the organization, identify
relationships in the data, identify future data needs, and determine how the data is used and
generated. The starting place for data collection is gathering existing forms and reviewing
policies and systems. Then, ask users what the data means, and determine their daily processes.
These things are especially critical: Identification of unique fields (keys), Data dependencies,
relationships, and constraints (high-level) and The data sizes and their growth rates.

Fact-finding is using interviews and questionnaires to collect facts about systems, requirements,
and preferences. Five fact-finding techniques includes examining documentation (example
invoices, invoices, timesheets, surveys etc.), Comb through any existing data systems (including
physical and digital files), interviewing, observing the enterprise in operation, research and
questionnaires

Start by gathering any existing data that will be included in the database. Then list the types of
data you want to store and the entities, or people, things, locations, and events that those data
describe. This information will later become part of the data dictionary, which outlines the tables
and fields within the database. Be sure to break down the information into the smallest useful
pieces. For instance, consider separating the street address from the country so that you can later
filter individuals by their country of residence. Also, avoid placing the same data point in more
than one table, which adds unnecessary complexity. The result of this step is concisely written as
a set of users‘requirements. These consist of the user–defined operations (or transactions) that
will be applied to the database and they include both retrievals and updates. In software design, it
is common to use data flow diagrams, sequence diagrams, scenarios, and other techniques for
specifying functional requirements

6
DATABASE MANAGEMENT SYSTEM 

DBMS database management system (DBMS) is a powerful computer program that stores and
manages information in a digital repository deployed on a server or mainframe system. This
program lets users keep, sort, update, retrieve and modify their records in a single database; for
example they can keep and update profiles in a client base. DBMS apps are widely used in
business to model and manage business objects within corporate databases. Such applications
provide a number of advantages to enable organizations to keep their business records secured,
consistent and relevant.

To interact with a database, a DBMS package generally uses SQL queries. It receives a
command from a database administrator (DBA) and prompts the system to perform the necessary
action. These instructions can be about loading, retrieving, or modifying existing data in the
system. A database can store different data in several ways. Some of the types of data that can be
stored in a database are textual data, Numerical data, Binary data and Data and time

A database management software features data independence, as the storage mechanism and
formats can be changed without altering the entire application within the database. The list of
most popular DBMS software or free database management tools includes MySQL, Microsoft
SQL Server, Microsoft Access DBMS, Oracle, IBM DB2, and FoxPro. For example, a common
DBMS tool, MySQL, a free business database software, is a high-performance database software
that helps enterprise users build scalable database applications. Similarly, the features of FoxPro
include creating, adding, editing, and removing information from a database.

FEATURES OF A DBMS

In a database, the chances of data duplication are quite high as several users use one database. A
DBMS reduces data repetition and redundancy by creating a single data repository that can be
accessed by multiple users, even allowing easy data mapping while performing ETL. Most
organizational data is stored in large databases. A DBMS helps maintain these databases by
enforcing user-defined validation and integrity constraints, such as user-based access.

When handling large amounts of data, security becomes the top-most concern for all businesses.
A database management software doesn’t allow full access to anyone except the database
administrator or the departmental head. Only they can modify the database and control user
access, making the database more secure. All other users are restricted, depending on their access
level.

By implementing a database management system, organizations can create a standardized way to


use files and ensure consistency of data with other systems and applications. Manipulating and
streamlining advanced data management systems is essential. The application of an advanced
database system allows using the same rules to all the data throughout the organization.

7
DBMS LANGUAGE

To communicate database updates and queries, DBMS language is used. Different types of
database languages are explained below:

Data Definition Language (DDL): It is used to save information regarding table’s schemas,


indexes, columns, constraints, etc.

Data Manipulation Language (DML): It is used for accessing and manipulating databases.

Data Control Language (DCL): It is used to access the saved data. It also allows to give or
revoke access from a user.

Transaction Control Language (TCL): It is used to run or process the modifications made by the
DML.

DATABASE MANAGEMENT SYSTEMS SOFTWARE

There are different database management system, which can be broadly classified into four
types. The most popular type of DBMS software include:

HIERARCHICAL

A hierarchical DBMS organizes data in a tree-like arrangement, in the form of a hierarchy, either
in a top-down or bottom-up design. The hierarchy is defined by a parent-child relationship,
where a parent may have numerous children, but children can only have a single parent. This
type of DBMS commonly includes one-to-one and one-to-many relationships. A one-to-one
relationship exists when a parent has a single child. Whereas, in a one-to-many relationship, a
parent has multiple children. As data is hierarchical, it becomes a complicated network, if one-
to-many relationships are disrupted.

FIGURE 1: HIERARCHICAL DATABASE MODEL

8
NETWORK

A network DBMS is a slightly complex extension of hierarchical DBMS in which data has
many-to-many relationships that appear in the form of a network. The advantages of the network
database model are that records are arranged in a graph that can be accessed via numerous data
paths. In this database structure, a child can have multiple parents. Therefore, it allows you to
model more intricate relationships. The ability to build more relationships among different data
types makes these databases more efficient.

FIGURE 2: NETWORK DBMS

OBJECT-ORIENTED

The object-oriented model describes a database as a group of objects, which stores both values
and operations/methods. Objects with similar values and operations are grouped as classes. As
this type of database integrates with object-oriented programming languages and utilizes an
identical representation model, programmers can leverage the uniformity of a single
programming environment. Object-oriented databases are compatible with various programming
languages, such as Delphi, JavaScript, Python, Java, C++, Perl, Scala, and Visual Basic .NET.

9
STRUCTURED QUERY LANGUAGE (SQL)

Structured Query Language (SQL) is the most widely used commercial relational database
language. It was designed for managing data in Relational Database Management System
(RDBMS). It was originally developed at IBM in the SEQUEL XRM and System-R projects
(1974-1977). Almost immediately, other vendors introduced DBMS products based on SQL.
SQL continues to evolve in response to changing needs in the database area. This unit explains
how to use SQL to access and manipulate data from database systems like MySQL, SQL Server,
MS Access, Oracle, Sybase, DB2, and others.

BASICS CONCEPTS OF SQL

Structured Query Language is a standard language for accessing and manipulating databases.
SQL lets you access and manipulate databases. It is used for defining tables and integrity
constraints and for accessing and manipulating data. SQL. This unit explains how to use SQL to
access and manipulate data from database systems like MySQL, SQL Server, MS Access,
Oracle, Sybase, DB2, and others. Application programs may allow users to access a database
without directly using SQL, but these applications themselves must use SQL to access the
database. Although SQL is an ANSI (American National Standards Institute) standard, there are
many different versions of the SQL language. However, to be compliant with the ANSI standard,
they all support at least the major commands (such as SELECT, UPDATE, DELETE, INSERT,
WHERE) in a similar manner. Most of the SQL database programs also have their own
proprietary extensions in addition to the SQL standard. The SQL language has several aspects to
it.

DATA MANIPULATION LANGUAGE (DML)

This subset of SQL allows users to pose queries and to insert, delete, and modify rows. Queries
are the main focus of this unit. We covered DML commands to insert, delete, and modify rows

DATA DEFINITION LANGUAGE (DDL)

This subset of SQL supports the creation, deletion, and modification of definitions for tables and
views. Integrity constraints can be defined on tables, either when the table is created or later.
Although the standard does not discuss indexes, commercial implementations also provide
commands for creating and deleting indexes. Triggers and Advanced Integrity Constraints: The
new SQL:1999 standard includes support for triggers, which are actions executed by the DBMS
whenever changes to the database meet conditions specified in the trigger.

EMBEDDED AND DYNAMIC SQL

Embedded SQL features allow SQL code to be called from a host language such as C or
COBOL. Dynamic SQL features allow a query to be constructed (and executed) at run-time.
Client-Server Execution and Remote Database Access: These commands control how a client

10
application program can connect to an SQL database server, or access data from a database over
a network. The SQL:1999 standard includes object-oriented features, recursive queries, decision
support queries, and also addresses emerging areas such as data mining, spatial data, and text and
XML data management.

HISTORY OF SQL

SQL was developed by IBM Research in the mid 70‘s and standardized by the ANSI and later by
the ISO. Most database management systems implement a majority of one of these standards and
add their proprietary extensions. SQL allows the retrieval, insertion, updating, and deletion of
data. A database management system also includes management and administrative functions.
Most – if not all –implementations also include a command-line interface (SQL/CLI) that allows
for the entry and execution of the language commands, as opposed to only providing an
application programming interface (API) intended for access from a graphical user interface
(GUI). The first version of SQL was developed at IBM by Andrew Richardson, Donald C.
Messerly and Raymond F. Boyce in the early 1970s.

This version, initially called SEQUEL, was designed to manipulate and retrieve data stored in
IBM's original relational database product; System R. IBM patented their version of SQL in
1985, while the SQL language was not formally standardized until 1986 by the American
National Standards Institute (ANSI) as SQL-86. Subsequent versions of the SQL standard have
been released by ANSI and as International Organization for Standardization (ISO) standards.
Originally designed as a declarative query and data manipulation language, variations of SQL
have been created by SQL database management system (DBMS) vendors that add procedural
constructs, flow-of-control statements, user-defined data types, and various other language
extensions. With the release of the SQL: 1999 standard, many such extensions were formally
adopted as part of the SQL language via the SQL Persistent Stored Modules (SQL/PSM) portion
of the standard. SQL was adopted as a standard by ANSI in 1986 and ISO in 1987. In a nutshell,
SQL can perform the following : SQL can execute queries against a database, it can retrieve data
from a database. It can insert records in a database.

THE FORM OF A BASIC SQL QUERY

The basic form of an SQL query is as follows:

SELECT [DISTINCT] select-list

FROM from-list

WHERE qualification

Every query must have a SELECT clause, which specifies columns to be retained in the result,
and a FROM clause, which specifies a cross-product of tables. The optional WHERE clause
specifies selection conditions on the tables mentioned in the FROM clause.

11
THE SYNTAX OF A BASIC SQL QUERY

The from-list in the FROM clause is a list of table names. A table name can be followed by a
range variable; a range variable is particularly useful when the same table name appears more
than once in the from-list.

The select-list is a list of (expressions involving) column names of tables named in the from-list.
Column names can be prefixed by a range variable.

The qualification in the WHERE clause is a Boolean combination (i.e., an expression using the
logical connectives AND, OR, and NOT) of conditions of the form expression op expression,
where op is one of the comparison operators {<, <=,=, <>, >=, >}. An expression is a column
name, a constant, or an (arithmetic or string) expression.

The DISTINCT keyword is optional. It indicates that the table computed as an answer to this
query should not contain duplicates, that is, two copies of the same row. The default is that
duplicates are not eliminated.

SQL STATEMENTS

Most of the actions you need to perform on a database are done with SQL statements. Some
database systems require a semicolon at the end of each SQL statement. Semicolon is the
standard way to separate each SQL statement in database systems that allow more than one SQL
statement to be executed in the same call to the server. We are using MS Access and SQL Server
2000 and we do not have to put a semicolon after each SQL statement, but some database
programs force you to use it. SQL statements can be divided into two parts: SQL statements are
basically divided into four; viz; Data Manipulation Language (DML), Data Definition Language
(DDL), Data Control Language (DCL) and Transaction Control

FORMAT OF SQL STATEMENTS

Most of the actions you need to perform on a database are done with SQL statements. The
following SQL statement will select all the records in the "Persons" table:

SELECT * FROM Persons

Some database systems require a semicolon at the end of each SQL statement. Semicolon is the
standard way to separate each SQL statement in database systems that allow more than one SQL
statement to be executed in the same call to the server.

12
DISTRIBUED DATABASE

Database is primarily used for storing and manipulating the data for the organization or any
particular requirement. It contains data and its related architectures. In ideal setting, any database
will have its server and more than one user. Hence when database is designed, its server is kept
at one place (or computer) and users are made to access this system. A distributed database is a
type of database in which storage devices are not all attached to a common central processing
unit. The data may be stored in multiple computers located in the same physical location, or may
be dispersed over a network of interconnected computers.

A distributed database is a type of database in which storage devices are not all attached to a
common central processing unit. The data may be stored in multiple computers located in the
same physical location, or may be dispersed over a network of interconnected computers.
Collections of data can be distributed across multiple physical locations. A distributed database
can reside on network servers on the internet, corporate intranets or other company networks.
The replication and distribution of databases improves database performance at end-user places.

Initially when database is created, it will be like a skeleton. As and when user starts accessing the
database, its size grows or shrinks. Usually the size of the database grows drastically than
shrinking. Similarly number of users may also increase. These users may not be from one single
location.

FIGURE 3: STRUCTURE OF DISTRIBUTED DATABASE SYSTEM

However, there are important differences in structure and functionality, and these characterize a
distributed database system:

13
Distributed file systems simply allow users to access files that are located on machines other than
their own. These files have no explicit structure (i.e., they are flat) and the relationships among
data in different files (if there are any) are not managed by the system and are the users
responsibility. A DDB, on the other hand, is organized according to a schema that defines both
the structure of the distributed data, and the relationships among the data. The schema is defined
according to some data model, which is usually relational or object-oriented (see Distributed
Database Schemas).

A distributed file system provides a simple interface to users which allows them to open,
read/write (records or bytes), and close files. A distributed DBMS system has the full
functionality of a DBMS. It provides high-level, declarative query capability, transaction
management (both concurrency control and recovery), and integrity enforcement. In this regard,
distributed DBMSs are different from transaction processing systems as well, since the latter
provide only some of these functions.

A distributed DBMS provides transparent access to data, while in a distributed file system the
user has to know (to some extent) the location of the data. A DDB may be partitioned (called
fragmentation) and replicated in addition to being distributed across multiple sites. All of this is
not visible to the users. In this sense, the distributed database technology extends the concept of
data independence, which is a central notion of database management, to environments where
data are distributed and replicated over a number of machines connected by a network. Thus,
from a user s perspective, a DDB is logically a single database even if physically it is distributed.

They will be from around the world. Hence the transaction with the database also increases. But
this will create heavy network load as the users are at different location and server is at some
other remote location. All these increasing factors affect the performance of database – it reduces
its performance. But imagine systems like trading, bank accounts, etc which gives such a slow
performance will lead to issues like concurrency, redundancy, security etc. Moreover users have
to wait for longer time for their transaction to get executed. User cannot sit in front of their
monitor to see the balance in their account for longer time.

In order to overcome these issues, a new way of allocating users and DB server is introduced.
This new method is known as Distributed Database System. In this method, database server is
kept at different remote locations. That means different database server is created and are placed
at different locations rather than at single location. This in turn kept in sync with each other to
maintain the consistency. The users accessing the DB will access any of these DB servers over
the network as if they are accessing the DB from single location. They will be able to access the
server without knowing its location. This in turn reduced the accessing time for the user. i.e.;
when a user issues a query, the system will fetch the server near to that user and access will be
provided to nearest server. Hence it reduces the accessing time and network load too.

14
In ensuring that the distributive databases are up to date and current, there are two processes
which are replication and duplication. Replication is using the specialized software that looks for
changes in the distributive database. Once the changes have been identified, the replication
process makes the entire database look the same. The replication process can be complex and
also requires a lot of time with computer resources. This will depend on the size and number of
the distributive database. Duplication on the other hand identifies one database as a master and
then duplicates that database. A distributed database does not share main memory.

A database user accesses the distributed database through Local application which does not
require data from other sites and Global applications which do require data from other sites.

TYPES OF DISTRIBUTED DATABASE SYSTEMS

Homogeneous DDB: This type of distributed database system will have identical database
systems distributed over the network. In Homogenous distributed database system, the data is
distributed but all servers run the same Database Management System (DBMS) software. When
we say identical database systems it includes software, hardware, operating systems etc – in short
all the components that are essential for having a DB. For example a database system with
Oracle alone distributed over the network, or with DB2 alone distributed over the network etc.
this type of DDBMS system does not give the feel that they are located at different locations.
Users access them as if they are accessing the same system.

Heterogeneous DDB: This is contrast to above concept. In Heterogeneous distributed databases


different sites run under the control of different DBMSs, These databases are connected
somehow to enable access to data from multiple sites. Here we will have different DBs
distributed over the network. For example DB at one location can be oracle; at another location
can be Sybase, DB2 or SQL server. In other words, in this type of DDB, at least one of the DB is
different from other DBs. In addition to this, the operating systems that they are using can also
be different – one DB may be in Windows system while other would be in LINUX. Irrespective
of the type of DDBMS used, user will be accessing these DBs as if they are accessing it locally.

ADVANTAGES AND DISADVANATGES OF DISTRIBUTED DATABASES

Although DBs helps in performance, security and recovery. Transparency Levels: In this
systems, physical location of the different DBs, the data like files, tables and any other data
objects are not known to the users. They will have the illusion that they are accessing the single
database at one location. Thus this method gives the distribution transparency about the
databases. In addition, the records of the tables can also be distributed over the databases either
wholly or partially by fragmenting them. This type of system provides location transparency by
allowing the user to query any database or tables from any location, network transparency by
allowing to access any DB over the network, naming transparency by accessing any names of
objects like tables, views etc, replication transparency by allowing to keep the copies of the

15
records at different DBs, fragmentation transparency by allowing to divide the records in a table
horizontally or vertically.

Availability and Reliability: Distribution of data among different DBs allows the user to access
the data without knowing failure of any one of the system. If any system fails or crashes, data
will be provided from other system. For example, if DB-IN fails, the user will be given data from
DB-ALL or vice versa. User will not be given message that the data is from DB-IN or DB-ALL,
i.e.; all the DBs will be in sync with each other to survive the failure. Hence data will be
available to the user all the time. This will in turn guarantee the reliability of the system.

Performance : Since the users access the data that are present in the DBs which are near to
them, it reduces the network load and network time. This also reduces the data management
time. Hence this type of systems gives high performance.

Modularity : Suppose any new DB has to be added to this system. This will not require any
complex changes to the existing system. Any new DBs can be easily added to the system,
without having much change to the existing system (because the entire configuration to have
multiple DB already exists in the system). Similarly, if any DB has to be modified or removed, it
can also be done without much effort.

DISADVANTAGES

This system also has disadvantages which includes Increased Complexity: This is the main
drawback of this system. Since it has many DBs, it has to maintain all of them to work together.
This needs extra design and work to keep them in sync, coordinate and make them work
efficiently. These extra changes to the architecture makes DDBMS complex than a DB with
single server.

Very Expensive: Since the complexity is increased, cost of maintaining these complexity also
increases. Cost for multiple DBs and manage them are extra compared to single DB.

Difficult to maintain Integrity: Extra effort is needed to maintain the integrity among the DBs
in the network. It may need extra network resources to make it possible.

Security: Since data is distributed over the DBs and network, extra caution is to be taken to have
security of data. The access levels as well unauthorized access over the network needs some
extra effort.

Fragmentation of data and their distribution gives extra challenges to the developer as well as
database design. This in turn increases the complexity of database design to meet the
transparency, reliability, integrity and redundancy.

16
COMPONENTS OF DISTRIBUTED DATABASE SYSTEMS

Database manager is one of major component of Distributed Database systems. He/she is


responsible for handling a segment of the distributed database. Database manager is one of
major component of Distributed Database systems. Database Manager is software responsible for
handling a segment of the distributed database. User Request Interface is another important
component of distributed database systems. It is usually a client program which acts as an
interface to the Distributed Transaction Manager.

User Request Interface is another important component of distributed database systems. It is


usually a client program which acts as an interface to the Distributed Transaction Manager.

Distributed Transaction Manager is a program that helps in translating the user requests and
converting into format required by the database manager, which are typically distributed. A
distributed database system is made of both the distributed transaction manager and the database
manager.

Current trends in distributed data management are centered on the Internet, in which
petabytes of data can be managed in a scalable, dynamic, and reliable fashion. Two important
areas in this direction are cloud computing and peer-topeer databases.

TYPES OF DISTRIBUTED DATABASE SYSTEMS

In order to access the data stored at remote location with less message passing cost, data should
be distributed accordingly. Distribution of data is done through Fragmentation or Replication.
Fragmentation consists of breaking a relation into smaller relations or fragments and storing the
fragments, possibly at different sites. Fragmentation of data in distributed database has four
major advantages:

Efficiency: Data are stored close to where they are used and separate from other data used by
other users or applications.

Local optimization: Data can be stored to optimize performance for local access.

Ease of querying: Combining data across horizontal partitions is easy because rows are simply
merged by unions across the partitions.

Replication:In Replication several copies of a relation are stored at different sites. Replication
will help in increasing reliability, locality and performance. Various advantages of replication are
as follows:

Reliability: If one of the sites containing the relation (or database) fails, a copy can always be
found at another site without network traffic delays

17
Fast response: Each site that has a full copy can process queries locally, so queries can be
processed rapidly.

Possible avoidance of complicated distributed transaction integrity routines: Replicated


databases are usually refreshed at scheduled intervals, so most forms of replication are used
when some relaxing of synchronization across database copies is acceptable. • Node decoupling:
Each transaction may proceed without coordination across the network. Thus, if nodes are down,
busy, or disconnected (e.g., in the case of mobile personal computers), a transaction is handled
when the user desires. There are two types of replication which are as follows:

Synchronous Replication: All copies of a modified relation (fragment) must be updated before
commit. Here, the most up to date value of an item is guaranteed to the end user. There are two
different methods of synchronous replication.

Read-Any, Write-All: This method is beneficial in case well when reads are much more frequent
than writes. Read-Any: when reading an item, access any of the replicas. Write-All: when
writing an item, must update all of the replicas. When writing, update some fraction of the
replicas. When reading, read enough copies to ensure you get at least one copy of the most recent
value. Use a version number to determine which value is most recent the copies "vote" on the
value of the item

PROBLEMS IN DISTRIBUTED DATABASE

One of the major problems in distributed systems is deadlock. A deadlock is a state where a set
of processes request resources that are held by other processes in the set and none of the process
can be completed. One process can request and acquire resources in any order without knowing
the locks acquired by other processes. If the sequence of the allocations of resources to the
processes is not controlled, deadlocks can occur. Hence we focus on deadlock detection and
removal. In order to detect deadlocks, in distributed systems, deadlock detection algorithm must
be used. Each site maintains a local wait for graph. If there is any cycle in the graph, there is a
deadlock in the system. Even though there is no cycle in the local wait for graph, there can be a
deadlock. This is due to the global acquisition of resources. In order to find the global deadlocks,
global wait for graph is maintained. This is known as centralized approach for deadlock
detection. The centralized approach to deadlock detection, while straightforward to implement,
has two main drawbacks: First, the global coordinator becomes a performance bottleneck, as
well as a single point of failure secondly it is prone to detecting non-existing deadlocks, referred
to as phantom deadlocks.

18
DISTRIBUTED TRANSPERANCY

Distribution transparency is the property of distributed databases by the virtue of which the
internal details of the distribution are hidden from the users. The DDBMS designer may choose
to fragment tables, replicate the fragments and store them at different sites. However, since users
are oblivious of these details, they find the distributed database easy to use like any centralized
database. The three dimensions of distribution transparency are −

LOCATION TRANSPARENCY

Location transparency ensures that the user can query on any table(s) or fragment(s) of a table as
if they were stored locally in the user’s site. The fact that the table or its fragments are stored at
remote site in the distributed database system, should be completely oblivious to the end user.
The address of the remote site(s) and the access mechanisms are completely hidden. In order to
incorporate location transparency, DDBMS should have access to updated and accurate data
dictionary and DDBMS directory which contains the details of locations of data.

FRAGMENTATION TRANSPARENCY

Fragmentation transparency enables users to query upon any table as if it were unfragmented.
Thus, it hides the fact that the table the user is querying on is actually a fragment or union of
some fragments. It also conceals the fact that the fragments are located at diverse sites. This is
somewhat similar to users of SQL views, where the user may not know that they are using a
view of a table instead of the table itself.

REPLICATION TRANSPARENCY

Replication transparency ensures that replication of databases are hidden from the users. It
enables users to query upon a table as if only a single copy of the table exists. Replication
transparency is associated with concurrency transparency and failure transparency. Whenever a
user updates a data item, the update is reflected in all the copies of the table. However, this
operation should not be known to the user. This is concurrency transparency. Also, in case of
failure of a site, the user can still proceed with his queries using replicated copies without any
knowledge of failure. This is failure transparency.

COMBINATION OF TRANSPARENCIES

In any distributed database system, the designer should ensure that all the stated transparencies
are maintained to a considerable extent. The designer may choose to fragment tables, replicate
them and store them at different sites; all oblivious to the end user. However, complete
distribution transparency is a tough task and requires considerable design efforts

19
DEADLOCK RECOVERY

A deadlock always involves a cycle of alternating process and resource nodes in the resource
graph. The general approach for deadlock recovery is process termination. In this method, nodes
and edges of the resource graph are eliminated. In Process Termination, the simplest algorithm is
to terminate all processes involved in the deadlock. This approach is unnecessarily wasteful,
since, in most cases, eliminating a single process is sufficient to break the deadlock. Thus, it is
better to terminate processes one at a time, release their resources, and check at each step if the
deadlock still persists. Before termination of process following parameters need to be checked:
The priority of the process, The cost of restarting the process and The current state of the
process.

TWO-PHASE COMMIT PROTOCOL

The two phase commit protocol is a distributed algorithm which lets all sites in a distributed
system agree to commit a transaction. The protocol results in either all nodes committing the
transaction or aborting, even in the case of site failures and message losses. However, due to the
work by Skeen and Stonebraker, the protocol will not handle more than one random site failure
at a time. The two phases of the algorithm are broken into the COMMIT-REQUEST phase,
where the COORDINATOR attempts to prepare all the COHORTS, and the COMMIT phase,
where the COORDINATOR completes the transactions at all COHORTS.

If a portion of a transaction operation cannot be committed, all changes made at the other sites
participating in the transaction will be undone to maintain a consistent database state. The
protocol requires that each individual DP’s transaction log entry, which is maintained by unique
DP, be written before the dataset fragment is actually updated.

DO-UNDO-REDO protocol and write-ahead protocol are necessary.

DO-UNDO-REDO protocol – rolls back and/or rolls forward transactions with the help of the
system’s transaction log entries. Three types below:

DO – performs the operation and records the “before” and “after” values in the transaction log.
UNDO – reverses an operation, using the log entries written by the DO portion of the sequence.
REDO – redoes an operation, using the log entries written by the DO portion of the sequence.

Write-ahead protocol – forces the log entry to be written to permanent storage before he actual
operation takes place. The protocol defines the operations between two kinds of nodes
Coordinator

Subordinates (or cohorts)

The protocol works in the following manner: One node is designated the coordinator, which is
the master site, and the rest of the nodes in the network are called cohorts. Other assumptions of

20
the protocol include stable storage at each site and use of a write ahead log by each node. Also,
the protocol assumes that no node crashes forever, and eventually any two nodes can
communicate with each other. The latter is not a big deal since network communication can
typically be rerouted. The former is a much stronger assumption; suppose the machine blows up!

During phase 1, initially the coordinator sends a query to commit message to all cohorts. Then it
waits for all cohorts to report back with the agreement message. The cohorts, if the transaction
was successful, write an entry to the undo log and an entry to the redo log. Then the cohorts
reply with an agree message, or an abort if the transaction failed at a cohort node.

Coordinator sends message to all subordinates

Subordinates receive the message, write the transaction log using the write-ahead protocol, and
send an acknowledgement (YES/PREPARED TO COMMIT or NO/NOT PREPARED) message
to the coordinator.

Confirms all are ready to commit or abort

During phase 2, if the coordinator receives an agree message from all cohorts, then it writes a
commit record into its log and sends a commit message to all the cohorts. If all agreement
messages do not come back the coordinator sends an abort message. Next the coordinator waits
for the acknowledgement from the cohorts. When acks are received from all cohorts the
coordinator writes a complete record to its log. Note the coordinator will wait forever for all the
acknowledgements to come back. If the cohort receives a commit message, it releases all the
locks and resources held during the transaction and sends an acknowledgement to the
coordinator. If the message is abort, then the cohort undoes the transaction with the undo log and
releases the resources and locks held during the transaction. Then it sends an acknowledgement.

The coordinator broadcasts a COMMIT message to all subordinates and waits for the replies.
Each subordinate receives the COMMIT message, then updates the database using the DO
protocol.

The subordinates reply with a COMMITTED or NOT COMMITTED message to the


coordinator. The greatest disadvantage of the two phase commit protocol is the fact that it is a
blocking protocol. A node will block while it is waiting for a message. This means that other
processes competing for resource locks held by the blocked processes will have to wait for the
locks to be released. A single node will continue to wait even if all other sites have failed. If the
coordinator fails permanently, some cohorts will never resolve their transactions. This has the
effect that resources are tied up forever.

Another disadvantage is the protocol is conservative. It is biased to the abort case rather than the
complete case.

21
OBJECT ORIENTED DATABASE

In mid-1980‘s, no doubt RDBMS are very much popular but due to some limitation of relation
model and RDBMS do not support for some advanced Applications. Object Oriented Database
(OODB) comes in the picture. At that time Object Oriented Programming paradigm is very much
popular. Due to this researcher think to combine the capabilities of database and object based
paradigm for programming. In Object databases data is stored in the forms of objects. These
database management systems are not very much popular because due to the lack of standards.

An object is refers to an abstract concept that generally represents an entity of interest in the
enterprise to be modeled by a database application. An object has to reflect a state and some
behavior. The object’s state shows the internal structure of the object and the internal structure
are the properties of the object. We can view a student as an object. The state of the object has to
contain descriptive information such as an identifier, a name, and an address.

The behavior of an object is the set of methods that are used to create, access, and manipulate the
object. A student object, for example, may have methods to create the object, to modify the
object state, and to delete the object. The object may also have methods to relate the object to
other objects, such as enrolling a student in a course or assign a note to a student in a course.
Objects having the same state and behavior are described by a class.

History of data processing goes through many different changes with different technologies
along with the time. In decade there is huge increase in the volume of data that need to be
processed due to which sometimes old technology do not work and need to come with new
technology to process the data. History of database technology has used Unit Records and Punch
Card, Punch Card Proliferation, Paper Data Reels, and Data Drums, File Systems, Database
Systems, NoSQL and NewSQL databases. From last five decades, the mostly used technology is
database management systems.

After some limitations of file systems, researchers come up with new technology known as
Database Management Systems which is the collection of software or programs to maintain the
data records. Initially, two models are proposed are hierarchical and network models, but these
models don‘t get much popularity due to their complex nature. Then a researcher E.F. Codd
comes up with a new data model known as relational model in which data items are stored in a
table. Many DBMS‘s are developed on the basis of this model. This is the most popular model
till now because it has conceptually foundation from relational mathematics.

Object Oriented Database is a database in which information is represented in the form of objects
as used in object-oriented programming. OODs are different from relational databases which are
table-oriented. An object-oriented database management system (OODBMS), sometimes
shortened to ODBMS for object database management system), is a database management
system (DBMS) that supports the modelling and creation of data as objects. This includes some

22
kind of support for classes of objects and the inheritance of class properties andmethods by
subclasses and their objects.

There is currently no widely agreed-upon standard for what constitutes an OODBMS, and
OODBMS products are considered to be still in their infancy. In the meantime, the object-
relational database management system (ORDBMS), the idea that object-oriented database
concepts can be superimposed on relational databases, is more commonly encountered in
available products. An object-oriented database interface standard is being developed by an
industry group, the Object Data Management Group (ODMG). The Object Management Group
(OMG) has already standardized an object-oriented data brokering interface between systems in
a network. The requirements for a database to be Object Oriented include: It should be a
Database Management System (DBMS): This means that the OODBMS should provide the basic
features for any database system – persistence, concurrency, data recovery, secondary storage
management and ad hoc query facility and It should support Polymorphism and Inheritance: This
means that the database system should support all the requisite features of an object-oriented
system like encapsulation, complex objects, inheritance, polymorphism, extensibility.

BASIC CONCEPTS OF OO PROGRAMMING

A conceptual entity is anything that exists and can be distinctly identified. Examples, a person,
an employee, a car, a part. In an OO system, all conceptual entities are modeled as objects. An
object has structural properties defined by a finite set of attributes and behavioural properties.
The only between entity and object is that entity has only state has no behaviour, but object
contains both state and behaviour. Each object is associated with a logical non-reusable and
unique object identifier (OID). The OID of an object is independent of the values of its attributes.
All objects with the same set of attributes and methods are grouped into a class, and form
instances of that class. OID has following characteristics: It is generated by system, It is unique
to that object in the entire system. it is used only by the system, not by the user and It is
independent the state of the object.

Classes are classified as lexical classes and non-lexical classes. A lexical class contains objects
that can be directly represented by their values while A non-lexical class contains objects, each
of which is represented by a set of attributes and methods. Instances of a non-lexical class are
referred to by their OIDs. Example PERSON, EMPLOYEE, PART are non-lexical classes.

OO DATA MODEL VS HIERARCHICAL DATA MODEL

The nested structure of objects and the nested structure of records in hierarchical databases are
similar. The essential difference is that the OO data model uses logical and non-reusable OIDs to
link related objects while the hierarchical model uses physical reusable pointers to physically
link related records. Hierarchical model has no object and OID concepts.

23
OO data model allows cyclic definition within object structures. Example, a course can refer to
other courses as its pre-requisite courses. To support cyclic definition in the hierarchical data
model, dummy record types (e.g. prerequisite record) are needed. 1. OO DATA MODEL VS
NESTED RELATIONS: In the nested relation approach, an attribute of a relation can itself be a
relation. The nested relation is stored physically within the base relation. This approach does not
allow the nested relation to be shared among relations. There may be a redundant storage of data
which can lead to updating anomalies. In the OO approach, nested relations are simulated by
using the OIDs of tuples of a relation that are to be nested within a base relation. Because OIDs
are used, sharing of tuples of nested relation is possible. There is less redundancy.

ACHIEVEMENTS AND WEAKNESSES OF OODBMS

Support for User Defined data types: OODBs provides the facilities of defining new user defined
data types and to maintain them.

OODB’s allow creating new type of relationships: OODBs allow creating a new type of
relationship between objects is called inverse relationship (a binary relationship).

No need of keys for identification: Unlike, relational model, object data model uses object
identity (OID) to identify object in the system. Development of Equality predicates:

In OODBs, four types equality predicates are Identity equality, Value equality of objects, Value
equality of properties, Identity equality of properties and No need of joins for OODBMS’s:
OODBs has ability to reduce the need of joins.

WEAKNESSES

Coherency between Relational and Object Model: Relational databases are founded in every
organization. To overcome relational databases, object databases have to be providing coherent
services to users to migrate from relational database to object database. Architecture of
Relational model and Object model must be provide some coherency between them.

24
DATA MINING

In our day-to-day activities, we generate a lot of data that were previously difficult to store but
with the advent of computers, the problem of storage is eliminated. These data are stored on
disparate structures and keep increasing by the day. This issue led to the creation of structured
databases and database management systems. Managing of these data efficiently and effectively
need an effective management system. Database management systems are used in managing
large corpus of data in terms of storage and fast retrieval. Today, data from business
transactions, science, medical, personal, surveillance video, satellite sensing, games, digital
media, virtual worlds, software engineering, the World Wide Web repositories, text reports and
memos are all proliferated.

There is need for automatic summarization of data, extraction of the essence of information
stored and the discovery of patterns in raw data. Unfortunately, these massive collections of data
stored on disparate structures very rapidly became overwhelming. This initial chaos has led to
the creation of structured databases and DBMS. The efficient database management systems
have been very important assets for management of a large corpus of data and especially for
effective and efficient retrieval of particular information from a large collection whenever
needed. The proliferation of database management systems has also contributed to recent
massive gathering of all sorts of information. Today, we have far more information than we can
handle: from business transactions and scientific data, to satellite pictures, text reports and
military intelligence. Information retrieval is simply not enough anymore for decision-making.
Confronted with huge collections of data, we have now created new needs to help us make better
managerial choices. These needs are automatic summarization of data, extraction of the essence
of information stored, and the discovery of patterns in raw data

Data mining is a process of discovering patterns in large data sets involving methods at the
intersection of machine learning, statistics, and database systems. Data mining is an
interdisciplinary subfield of computer science and statistics with an overall goal to extract
information (with intelligent methods) from a data set and transform the information into a
comprehensible structure for further use. Data mining is the analysis step of the "knowledge
discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves
database and data management aspects, data pre-processing, model and inference considerations,
interestingness metrics, complexity considerations, post-processing of discovered structures,
visualization, and online updating.

The actual data mining task is the semi-automatic or automatic analysis of large quantities of
data to extract previously unknown, interesting patterns such as groups of data records (cluster
analysis), unusual records (anomaly detection), and dependencies (association rule mining,
sequential pattern mining). This usually involves using database techniques such as spatial
indices. These patterns can then be seen as a kind of summary of the input data, and may be used
in further analysis or, for example, in machine learning and predictive analytics. For example,

25
the data mining step might identify multiple groups in the data, which can then be used to obtain
more accurate prediction results by a decision support system. Neither the data collection, data
preparation, nor result interpretation and reporting is part of the data mining step, but do belong
to the overall KDD process as additional steps.

KDD STAGES

The Knowledge Discovery in Databases process comprises of a steps from raw data collections
to some form of new knowledge. It is an iterative process consisting of the following steps:

Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant data
are removed from the collection.

Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in
a common source.

Data selection: at this step, the data relevant to the analysis is decided on and retrieved from the
data collection.

Data transformation: also known as data consolidation, it is a phase in which the selected data
is transformed into forms appropriate for the mining procedure.

Data mining: it is the crucial step in which clever techniques are applied to extract patterns
potentially useful.

Pattern evaluation: in this step, strictly interesting patterns representing knowledge are
identified based on given measures.

Knowledge representation: is the final phase in which the discovered knowledge is visually
represented to the user. This essential step uses visualization techniques to help users understand
and interpret the data mining results.

It is common to combine some of these steps together. For instance, data cleaning and data
integration can be performed together as a pre-processing phase to generate a data warehouse.
Data selection and data transformation can also be combined where the consolidation of the data
is the result of the selection, or, as for the case of data warehouses, the selection is done on
transformed data.

Data mining derives its name from the similarities between searching for valuable information in
a large database and mining rocks for a vein of valuable ore. Both imply either sifting through a
large amount of material or ingeniously probing the material to exactly pinpoint where the values
reside. It is, however, a misnomer, since mining for gold in rocks is usually called “gold mining”
and not “rock mining”, thus by analogy, data mining should have been called “knowledge
mining” instead. Nevertheless, data mining became the accepted customary term, and very
rapidly a trend that even overshadowed more general terms such as knowledge discovery in
26
databases (KDD) that describe a more complete process. Other similar terms referring to data
mining are: data dredging, knowledge extraction and pattern discovery.

MINING SYSTEMS

There are many data mining systems available or being developed. Some are specialized systems
dedicated to a given data source or are confined to limited data mining functionalities, other are
more versatile and comprehensive. Data mining systems can be categorized according to various
criteria among other classification are the following:

Classification according to the type of data source mined: this classification categorizes data
mining systems according to the type of data handled such as spatial data, multimedia data, time-
series data, text data, World Wide Web, etc.

Classification according to the data model drawn on: this classification categorizes data mining
systems based on the data model involved such as relational database, object-oriented database,
data warehouse, transactional, etc.

Classification according to the kind of knowledge discovered: this classification categorizes data
mining systems based on the kind of knowledge discovered or data mining functionalities, such
as characterization, discrimination, association, classification, clustering, etc. Some systems tend
to be comprehensive systems offering several data mining functionalities together.

Classification according to mining techniques used: Data mining systems employ and provide
different techniques. This classification categorizes data mining systems according to the data
analysis approach used such as machine learning, neural networks, genetic algorithms, statistics,
visualization, database oriented or data warehouse-oriented, etc. The classification can also take
into account the degree of user interaction involved in the data mining process such as query-
driven systems, interactive exploratory systems, or autonomous systems.

In principle, data mining is not specific to one type of media or data. Data mining should be
applicable to any kind of information repository. However, algorithms and approaches may
differ when applied to different types of data. Indeed, the challenges presented by different types
of data vary significantly. Data mining is being put into use and studied for databases, including
relational databases, object-relational databases and object oriented databases, data warehouses,
transactional databases, unstructured and semistructured repositories such as the World Wide
Web, advanced databases such as spatial databases, multimedia databases, time-series databases
and textual databases, and even flat files.

DATA MINING TASKS AND TECHNIQUES

The kinds of patterns that can be discovered depend upon the data mining tasks employed. By
and large, there are two types of data mining tasks: descriptive data mining tasks that describe
the general properties of the existing data, and predictive data mining tasks that attempt to do

27
predictions based on inference on available data. The data mining functionalities and the variety
of knowledge they discover are briefly presented in the following list:

Characterization: Data characterization is a summarization of general features of objects in a


target class, and produces what is called characteristic rules. The data relevant to a user-specified
class are normally retrieved by a database query and run through a summarization module to
extract the essence of the data at different levels of abstractions.

Discrimination: Data discrimination produces what are called discriminant rules and is basically
the comparison of the general features of objects between two classes referred to as the target
class and the contrasting class. The techniques used for data discrimination are very similar to
the techniques used for data characterization with the exception that data discrimination results
include comparative measures.

Association analysis: Association analysis is the discovery of what are commonly called
association rules. It studies the frequency of items occurring together in transactional databases,
and based on a threshold called support, identifies the frequent item sets. Another threshold,
confidence, which is the conditional probability than an item appears in a transaction when
another item appears, is used to pinpoint association rules. Association analysis is commonly
used for market basket analysis.

Classification: Classification analysis is the organization of data in given classes. Also known as
supervised classification, the classification uses given class labels to order the objects in the data
collection. Classification approaches normally use a training set where all objects are already
associated with known class labels. The classification algorithm learns from the training set and
builds a model. The model is used to classify new objects.

Prediction: Prediction has attracted considerable attention given the potential implications of
successful forecasting in a business context. There are two major types of predictions: one can
either try to predict some unavailable data values or pending trends, or predict a class label for
some data. The latter is tied to classification. Once a classification model is built based on a
training set, the class label of an object can be foreseen based on the attribute values of the object
and the attribute values of the classes.

Clustering: Similar to classification, clustering is the organization of data in classes.

28
METADATA

Metadata is data that provides information about other data. In other words, it is "data about
data. Metadata has various purposes. It helps users find relevant information and discover
resources. It also helps organize electronic resources, provide digital identification, and archive
and preserve resources. Metadata allows users to access resources through "allowing resources to
be found by relevant criteria, identifying resources, bringing similar resources together,
distinguishing dissimilar resources, and giving location information. Metadata of
telecommunication activities including Internet traffic is very widely collected by various
national governmental organizations. This data is used for the purposes of traffic analysis and
can be used for mass surveillance.

Metadata refers to the descriptive bits of information on your website hidden in the codes. It’s all
the things you don’t see when visiting a page, including link descriptions, titles of photos, upload
dates, and so much more. The reality is that when it comes to optimum marketing performance,
what we do not see is just as important as the beautiful imagery on our websites that we can see.
If your brand is active on social media or if you want to be present in search engine results, you
should care about how your metadata is representing and working for your brand. Metadata tells
search engines what your web page has to offer. By using metadata correctly, you can boost your
relevancy in search results. 

Metadata provides search engines with the most important information about your web pages,
including titles and descriptions. When someone searches Google for an image, your metadata is
what tells the search engine you have the photo in need. When optimizing your content for
search engines, be sure to pay attention to these details. Use titles that are accurate
representations of the content you’re sharing and add descriptions that you’d want a search
engine to know you’re providing.

Unique metadata standards exist for different discipline (e.g., museum collections, digital audio
files, websites, etc.). Describing the contents and context of data or data files increases its
usefulness. For example, a web page may include metadata specifying what software language
the page is written in (e.g., HTML), what tools were used to create it, what subjects the page is
about, and where to find more information about the subject. This metadata can automatically
improve the reader's experience and make it easier for users to find the web page online.

Metadata was traditionally used in the card catalogs of libraries until the 1980s, when libraries
converted their catalog data to digital databases. In the 2000s, as data and information were
increasingly stored digitally, this digital data was described using metadata standards. While the
metadata application is manifold, covering a large variety of fields, there are specialized and
well-accepted models to specify types of metadata. Many distinct types of metadata exist,
including descriptive metadata, structural metadata, administrative metadata, reference metadata
and statistical metadata.

Descriptive metadata is descriptive information about a resource. It is used for discovery and
identification. It includes elements such as title, abstract, author, and keywords. Descriptive

29
metadata is typically used for discovery and identification, as information to search and locate an
object, such as title, author, subjects, keywords, publisher.

Structural metadata is metadata about containers of data and indicates how compound objects
are put together, for example, how pages are ordered to form chapters. It describes the types,
versions, relationships and other characteristics of digital materials. Structural metadata describes
how the components of an object are organized.

Administrative metadata is information to help manage a resource, like resource type,


permissions, and when and how it was created. Administrative metadata gives information to
help manage the source. Administrative metadata refers to the technical information, including
file type, or when and how the file was created. Two sub-types of administrative metadata are
rights management metadata and preservation metadata. Rights management metadata explains
intellectual property rights, while preservation metadata contains information to preserve and
save a resource.

Reference metadata is information about the contents and quality of statistical data.

Statistical metadata, also called process data, may describe processes that collect, process, or
produce statistical data. Statistical data repositories have their own requirements for metadata in
order to describe not only the source and quality of the data but also what statistical processes
were used to create the data, which is of particular importance to the statistical community in
order to both validate and improve the process of statistical data production.

DATA DICTIONARY

The terms data dictionary and data repository indicate a more general software utility than a


catalogue. A catalogue is closely coupled with the DBMS software. It provides the information
stored in it to the user and the DBA, but it is mainly accessed by the various software modules of
the DBMS itself, such as DDL and DML compilers, the query optimiser, the transaction
processor, report generators, and the constraint enforcer. On the other hand, a data dictionary is a
data structure that stores metadata, i.e., (structured) data about information. The software
package for a stand-alone data dictionary or data repository may interact with the software
modules of the DBMS, but it is mainly used by the designers, users and administrators of a
computer system for information resource management.
The term can have one of several closely related meanings pertaining to databases and database
management systems (DBMS) which include a document describing a database or collection of
databases, An integral component of a DBMS that is required to determine its structure and a
piece of middleware that extends or supplants the native data dictionary of a DBMS. The data
dictionary in general contains information about the following: Names of all the database tables
and their schemas: Details about all the tables in the database, such as their owners, their security
constraints, when they were created etc, physical information about the tables such as where they
are stored and how, table constraints such as primary key attributes, foreign key information etc.

30
Field Name Data Type Field Size for display Description Example

Employee Integer 10 Unique ID of each employee 1645000001


Number

Name Text 20 Name of the employee David Heston

Date of Birth Date/Time 10 DOB of Employee 08/03/1995

Phone Number Integer 10 Phone number of employee 6583648648

TABLE 3: DATA DICTIONARY


The different types of data dictionary are
ACTIVE DATA DICTIONARY
If the structure of the database or its specifications change at any point of time, it should be
reflected in the data dictionary. This is the responsibility of the database management system in
which the data dictionary resides.
So, the data dictionary is automatically updated by the database management system when any
changes are made in the database. This is known as an active data dictionary as it is self-
updating.
PASSIVE DATA DICTIONARY
This is not as useful or easy to handle as an active data dictionary. A passive data dictionary is
maintained separately to the database whose contents are stored in the dictionary. That means
that if the database is modified the database dictionary is not automatically updated as in the case
of Active Data Dictionary. 
So, the passive data dictionary has to be manually updated to match the database. This needs
careful handling or else the database and data dictionary are out of sync.
DATA CLEANING

Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors
and inconsistencies from data in order to improve the quality of data. Data quality problems are
present in single data collections, such as files and databases, e.g., due to misspellings during
data entry, missing information or other invalid data. When multiple data sources need to be
integrated, e.g., in data warehouses, federated database systems or global web-based information
systems, the need for data cleaning increases significantly. This is because the sources often
contain redundant data in different representations. In order to provide access to accurate and

31
consistent data, consolidation of different data representations and elimination of duplicate
information become necessary.

A data cleaning approach should satisfy several requirements. First of all, it should detect and
remove all major errors and inconsistencies both in individual data sources and when integrating
multiple sources. The approach should be supported by tools to limit manual inspection and
programming effort and be extensible to easily cover additional sources. Furthermore, data
cleaning should not be performed in isolation but together with schema-related data
transformations based on comprehensive metadata. Mapping functions for data cleaning and
other data transformations should be specified in a declarative way and be reusable for other data
sources as well as for query processing. Especially for data warehouses, a workflow
infrastructure should be supported to execute all data transformation steps for multiple sources
and large data sets in a reliable and efficient way. Data cleaning approaches In general, data
cleaning involves several phases

Data analysis: In order to detect which kinds of errors and inconsistencies are to be removed, a
detailed data analysis is required. In addition to a manual inspection of the data or data samples,
analysis programs should be used to gain metadata about the data properties and detect data
quality problems.

Definition of transformation workflow and mapping rules: Depending on the number of data
sources, their degree of heterogeneity and the ”dirtyness” of the data, a large number of data
transformation and cleaning steps may have to be executed. Sometime, a schema translation is
used to map sources to a common data model; for data warehouses, typically a relational
representation is used. Early data cleaning steps can correct single-source instance problems and
prepare the data for integration. Later steps deal with schema/data integration and cleaning multi-
source instance problems, e.g., duplicates. For data warehousing, the control and data flow for
these transformation and cleaning steps should be specified within a workflow that defines the
ETL process . The schema-related data transformations as well as the cleaning steps should be
specified by a declarative query and mapping language as far as possible, to enable automatic
generation of the transformation code. In addition, it should be possible to invoke user-written
cleaning code and special-purpose tools during a data transformation workflow. The
transformation steps may request user feedback on data instances for which they have no built-in
cleaning logic.

Verification: The correctness and effectiveness of a transformation workflow and the


transformation definitions should be tested and evaluated, e.g., on a sample or copy of the source
data, to improve the definitions if necessary. Multiple iterations of the analysis, design and
verification steps may be needed, e.g., since some errors only become apparent after applying
some transformations.

32
Transformation: Execution of the transformation steps either by running the ETL workflow for
loading and refreshing a data warehouse or during answering queries on multiple sources.

Backflow of cleaned data: After (single-source) errors are removed, the cleaned data should
also replace the dirty data in the original sources in order to give legacy applications the
improved data too and to avoid redoing the cleaning work for future data extractions. For data
warehousing, the cleaned data is available from the data staging area

DATA FUSION

Data fusion is the process of integrating multiple data sources to produce more consistent,
accurate, and useful information than that provided by any individual data source. Data fusion
processes are often categorized as low, intermediate, or high, depending on the processing stage
at which fusion takes place. Low-level data fusion combines several sources of raw data to
produce new raw data. The expectation is that fused data is more informative and synthetic than
the original inputs. For example, sensor fusion is also known as (multi-sensor) data fusion and is
a subset of information fusion.

Data fusion is a multidisplicinary area that involves several fileds and it is difficult to establish a
clear and strict classification. The concept of data fusion has origins in the evolved capacity of
humans and animals to incorporate information from multiple senses to improve their ability to
survive. For example, a combination of sight, touch, smell, and taste may indicate whether a
substance is edible.

DATA WAREHOUSING

A data warehouse (DW) is a repository of an organization's electronically stored data. Data


warehouses are designed to manage and store the data. Data warehouses differ from business
intelligence (BI) systems, because BI systems are designed to use data to create reports and
analyze the information, to provide strategic guidance to management. Metadata is an important
tool in how data is stored in data warehouses. The purpose of a data warehouse is to house
standardized, structured, consistent, integrated, correct, "cleaned" and timely data, extracted from
various operational systems in an organization.

The extracted data are integrated in the data warehouse environment to provide an enterprise-
wide perspective. Data are structured in a way to serve the reporting and analytic requirements.
The design of structural metadata commonality using a data modeling method such as entity
relationship model diagramming is important in any data warehouse development effort. They
detail metadata on each piece of data in the data warehouse. An essential component of a data
warehouse/business intelligence system is the metadata and tools to manage and retrieve the
metadata.

33
INTELLIGENT AGENTS (IA)

Intelligent Agents (IA) are software programs which represent a new technology with the
potential to become one of the most important tools of information technology in the twenty-first
century. IA can alleviate the most critical limitation of the Internet- information overflow, and
can facilitate electronic commerce. Before we look at its capabilities. It can also be described as
a software entity that conducts operations in the place of users or programs after sensing the
environment. It uses actuators to initiate action in that environment.

Several names are used to describe intelligent agents- software agents, wizards, knowbots and
softbots. The names tend to reflect the nature of the agent; the term agent is derived from the
concept of agency, which means employing someone to act on the behalf of the user. A
computerised agent represents a person and interacts with others to accomplish a predefined task.

CHARACTERISTICS OF INTELLIGENT AGENTS

Intelligent agents have the following distinguishing characteristics: They have some level of
autonomy that allows them to perform certain tasks on their own, they have a learning ability
that enables them to learn even as tasks are carried out, they can interact with other entities such
as agents, humans, and systems.

New rules can be accommodated by intelligent agents incrementally, they exhibit goal-oriented
habits. They are knowledge-based. They use knowledge regarding communications, processes,
and entities.

STRUCTURE OF INTELLIGENT AGENTS

The IA structure consists of three main parts: architecture, agent function, and agent program.

Architecture: This refers to machinery or devices that consists of actuators and sensors. The
intelligent agent executes on this machinery. Examples include a personal computer, a car, or a
camera.

Agent function: This is a function in which actions are mapped from a certain percept sequence.
Percept sequence refers to a history of what the intelligent agent has perceived.

Agent program: This is an implementation or execution of the agent function. The agent


function is produced through the agent program’s execution on the physical architecture.

34
NETWORK MANAGEMENT
Network management is the process of administering, managing, and operating a data network,
using a network management system. Modern network management systems use software and
hardware to constantly collect and analyze data and push out configuration changes for
improving performance, reliability, and security. Network management is complex and so
network administrators need all the help they can get.
A network management solution is their best bet to streamline network management. With a
myriad of network management solutions available in the market, it becomes even more
important to zero in on the right one. Comprehensive network management solutions are to be
preferred as they help reduce the dependency on multiple tools to manage networks.
NETWORK MANAGEMENT FUNCTIONS
Classically, network management consists of several functions, all of which are important to the
operation of the network:
Performance management deals with monitoring and managing the various parameters that
measure the performance of the network. Performance management is an essential function that
enables a service provider to provide quality-of-service guarantees to their clients and to ensure
that clients comply with the requirements imposed by the service provider. It is also needed to
provide input to other network management functions, in particular, fault management, when
anomalous conditions are detected in the network.
Fault management is the function responsible for detecting failures when they happen and
isolating the failed component. The network also needs to restore traffic that may be disrupted
due to the failure, but this is usually considered a separate function.
Configuration management deals with the set of functions associated with managing orderly
changes in a network. The basic function of managing the equipment in the network belongs to
this category. This includes tracking the equipment in the network and managing the
addition/removal of equipment, including any rerouting of traffic this may involve and the
management of software versions on the equipment.
Security management includes administrative functions such as authenticating users and setting
attributes such as read and write permissions on a per-user basis. From a security perspective, the
network is usually partitioned into domains, both horizontally and vertically. Vertical
partitioning implies that some users may be allowed to access only certain network elements and
not other network elements.
Accounting management is the function responsible for billing and for developing lifetime
histories of the network components. This function is the same for optical networks.
NETWORK UTILITIES 
Network utilities are software utilities designed to analyze and configure various aspects
of computer networks. The majority of them originated on Unix systems, but several later ports
to other operating systems exist. The most common tools (found on most operating systems)
include:

35
ping, ping a host to check connectivity (reports packet loss and latency, uses ICMP).

traceroute shows the series of successive systems a packet goes through en route to its
destination on a network. It works by sending packets with sequential TTLs which generate
ICMP TTL-exceeded messages from the hosts the packet passes through.

nslookup, used to query a DNS server for DNS data (deprecated on Unix systems in favour
of host and dig; still the preferred tool on Microsoft Windows systems).

vnStat, useful command to monitor network traffic from the console. vnstat allows to keep
the traffic information in a log system to be analyzed by third party tools.

netstat, displays network connections (both incoming and outgoing), routing tables, and a


number of network interface and network protocol statistics. It is used for finding problems
in the network and to determine the amount of traffic on the network as a performance
measurement.

spray, which sprays numerous packets in the direction of a host and reports results

netsh allows local or remote configuration of network devices, Microsoft Windows

36
REFERENCES

Atzeni, P., Ceri, S., Paraboschi, S., & Torlone, R. (1999). Database systems: concepts, languages
& architectures (Vol. 1). London: McGraw-Hill.

Atkinson M., Bancilhon F. ,Dewitt D. ,Dittrich K., Maier D. and Zdonik S. (1989): The Object
Oriented Database Manifesto.

Bagley, Philip (November 1968). "Extension of programming language concepts" (PDF).


Philadelphia: University City Science Center. Archived (PDF) from the original on 30 November
2012.

Batini., C., Ceri, S., and Navathe, S. (1992) Database Design: An Entity – Relationship
Approach, Bernstein, P. (1976), Synthesizing Third Normal Form Relations from Functional
Dependencies‖,TODS, 1:4, December 1976.

Bancilhon, F., (1988): Object Oriented database systems, in Proc. 7th ACM SIGART/SIGMOD
Conf.

Bertino, E., Negri, M., Pelagatti, G., and Sbattella, L., (1992): Object-Oriented Query
Languages: The Notion and the Issues, IEEE Transactions on Knowledge and Data Engineering,
Vol. 4, No. 3.

Blasch, E., Steinberg, A., Das, S., Llinas, J., Chong, C.-Y., Kessler, O., Waltz, E., White, F."
(2013). Revisiting the JDL model for information Exploitation. International Conference on
Information Fusion.

Ciuonzo, D.; Salvo Rossi, P. (2014-02-01). "Decision Fusion With Unknown Sensor Detection
Probability". IEEE Signal Processing Letters. 21 (2): 208–212. arXiv:1312.2227.
Bibcode:2014ISPL...21..208C. doi:10.1109/LSP.2013.2295054. ISSN 1070-9908. S2CID
8761982.

Ciuonzo, D.; De Maio, A.; Salvo Rossi, P. (2015-09-01). "A Systematic Framework for
Composite Hypothesis Testing of Independent Bernoulli Trials". IEEE Signal Processing Letters.
22 (9): 1249–1253. Bibcode:2015ISPL...22.1249C. doi:10.1109/LSP.2015.2395811. ISSN 1070-
9908. S2CID 15503268.

Codd, E (1970) ―A Relational Model for Large Shared Data BANks‖ CACM, 136, June 1970.
David M. Kroenke, David J. Auer (2008). Database Concepts. New Jersey. Prentice Hall Elmasri
Navathe (2003). Fundamentals of Database Systems. England.

Frawley, W. J., Piatetsky-Shapiro, G. and Matheus, C. J. (1991). Knowledge Discovery in


Databases: An Overview. In G. Piatetsky-Shapiro et al. (eds.), Knowledge Discovery in
Databases. AAAI/MIT Press.

37
Gray, J.N., "Notes on DataBase Operating Systems," Operating Systems : An Advances Course,
Springer-Verlag, 1979, New York, pp.393-481.

Guiry, John J.; van de Ven, Pepijn; Nelson, John (2014-03-21). "Multi-Sensor Fusion for
Enhanced Contextual Awareness of Everyday Activities with Ubiquitous Devices". Sensors. 14
(3): 5687–5701. doi:10.3390/s140305687. PMC 4004015. PMID 24662406.

Hardeep Singh Damesha (2015): Object Oriented Database Management Systems-Concepts,


Advantages, Limitations and Comparative Study with Relational Database Management Systems

Joshi, V., Rajamani, N., Takayuki, K., Prathapaneni, Subramaniam, L. V. (2013). Information
Fusion Based Learning for Frugal Traffic State Sensing. Proceedings of the Twenty-Third
International Joint Conference on Artificial Intelligence.

Kim, W., (1988): A foundation for object-oriented databases‖, MCC Tech. Rep., N.ACA-ST-
248-88.

Klein, Lawrence A. (2004). Sensor and data fusion: A tool for information assessment and
decision making. SPIE Press. p. 51. ISBN 978-0-8194-5435-5.

Liggins, Martin E.; Hall, David L.; Llinas, James (2008). Multisensor Data Fusion, Second
Edition: Theory and Practice (Multisensor Data Fusion). CRC. ISBN 978-1-4200-5308-1.

Moss, Elliot, Nested Transactions : An Approach to Reliable Distributed Computing, The MIT
Press, Cambridge, Massachusetts, 1985, pp.31-38.

Piateski, G., & Frawley, W. (1991). Knowledge discovery in databases. MIT press. Piatetsky-
Shapiro, G. (1996). Advances in knowledge discovery and data mining (Vol. 21). U. M. Fayyad,
P. Smyth, & R. Uthurusamy (Eds.). Menlo Park: AAAI press.

Rouse, Margaret (July 2014). "Metadata". WhatIs. TechTarget. Archived from the original on 29
October 2015.

Singhal, M. and Shivaratri, N., Advanced Concepts in Operating Systems, McGraw-Hill, 1994,
pp. 302-303, pp. 334-335, p. 337

Snidaro, Laurao; et, al. (2016). Context-Enhanced Information Fusion:Boosting Real-World


Performance with Domain Knowledge. Switzerland, AG: Springer. ISBN 978-3-319-28971-7.

Zaïane, O. R. (1999). Principles of knowledge discovery in databases. Department of Computing


Science, University of Alberta, 20.

Zeng, Marcia (2004). "Metadata Types and Functions". NISO. Archived from the original on 7
October 2016. Retrieved 5 October 2016.

38

You might also like