4. Database Management Systems
4. Database Management Systems
Database Management System or DBMS in short refers to the technology of storing and retrieving
usersí data with utmost efficiency along with appropriatesecurity measures.
Database is a collection of related data and data is a collection of facts and figures that can be
processed to produce information.
Mostly data represents recordable facts. Data aids in producing information, which is based on
facts. For example, if we have data about marks obtained by all students, we can then conclude
about toppers and average marks.
A database management system stores data in such a way that it becomes easier to retrieve,
manipulate, and produce information.
Database Management System or DBMS in short refers to the technology of storing and retrieving
usersí data with utmost efficiency along with appropriatesecurity measures.
Characteristics
Traditionally, data was organized in file formats. DBMS was a new concept then, and all the
research was done to make it overcome the deficiencies in traditional style of data management.
A modern DBMS has the following characteristics −
• Real-world entity − A modern DBMS is more realistic and uses real-world entities to design its
architecture. It uses the behavior and attributes too. For example, a school database may use
students as an entity and their ageas an attribute.
• Relation-based tables − DBMS allows entities and relations among them to form tables. A user
can understand the architecture of a database just by looking at the table names.
• Isolation of data and application − A database system is entirely different than its data. A
database is an active entity, whereas data is said to be passive, on which the database works and
organizes. DBMS also stores metadata, which is data about data, to ease its own process.
• Less redundancy − DBMS follows the rules of normalization, which splits a relation when any of
its attributes is having redundancy in values. Normalization is a mathematically rich and scientific
process that reduces data redundancy.
• Consistency − Consistency is a state where every relation in a database remains consistent.
There exist methods and techniques, which can detectattempt of leaving database in inconsistent
state. A DBMS can provide greater consistency as compared to earlier forms of data storing
applications like file-processing systems.
• Query Language − DBMS is equipped with query language, which makes it more efficient to
retrieve and manipulate data. A user can apply as many, and as different filtering options as
required to retrieve a set of data. Traditionally it was not possible where file-processing system
was used.
• ACID Properties − DBMS follows the concepts
of Atomicity, Consistency, Isolation, and Durability (normally shortened as ACID). These concepts
are applied on transactions, which manipulate data in a database. ACID properties help the
database stay healthy in multi- transactional environments and in case of failure.
784
• Multiuser and Concurrent Access − DBMS supports multi-user environment and allows them to
access and manipulate data in parallel. Though there are restrictions on transactions when users
attempt to handle the same data item, but users are always unaware of them.
• Multiple views − DBMS offers multiple views for different users. A user who is in the Sales
department will have a different view of database than a
person working in the Production department. This feature enables theusers to have a concentrate
view of the database according to their requirements.
• Security − Features like multiple views offer security to some extent where users are unable to
access data of other users and departments. DBMS offers methods to impose constraints while
entering data into the database and retrieving the same at a later stage. DBMS offers many
different levels of security features, which enables multiple users to have different views with
different features. For example, a user in the Sales department cannotsee the data that belongs to
the Purchase department. Additionally, it can also be managed how much data of the Sales
department should be displayed to the user. Since a DBMS is not saved on the disk as traditional
file systems, it is very hard for miscreants to break the code.
Users
A typical DBMS has users with different rights and permissions who use it for different purposes.
Some users retrieve data and some back it up. The users of aDBMS can be broadly categorized as
follows −
• Administrators − Administrators maintain the DBMS and are responsible for administrating the
database. They are responsible to look after its usageand by whom it should be used. They create
access profiles for users and apply limitations to maintain isolation and force security.
Administrators
also look after DBMS resources like system license, required tools, and other software and
hardware related maintenance.
• Designers − Designers are the group of people who actually work on the designing part of the
database. They keep a close watch on what data should be kept and in what format. They identify
and design the whole setof entities, relations, constraints, and views.
• End Users − End users are those who actually reap the benefits of having aDBMS. End users can
range from simple viewers who pay attention to the logs or market rates to sophisticated users
such as business analysts.
Applications of DBMS
Database is a collection of related data and data is a collection of facts and figures that can be
processed to produce information.
Mostly data represents recordable facts. Data aids in producing information, which is based on
facts. For example, if we have data about marks obtained by all students, we can then conclude
about toppers and average marks.
A database management system stores data in such a way that it becomes easier to retrieve,
manipulate, and produce information. Following are the important characteristics and
applications of DBMS.
785
• ACID Properties − DBMS follows the concepts
of Atomicity, Consistency, Isolation, and Durability (normally shortened as ACID). These concepts
are applied on transactions, which manipulate data in a database. ACID properties help the
database stay healthy in multi- transactional environments and in case of failure.
• Multiuser and Concurrent Access − DBMS supports multi-user environment and allows them to
access and manipulate data in parallel. Though there are restrictions on transactions when users
attempt to handle the same data item, but users are always unaware of them.
Multiple views − DBMS offers multiple views for different users. A user who is in the Sales
department will have a different view of database than a person working in the Production
department. This feature enables the users to have a concentrate view of the database according
to their requirements.
• Security − Features like multiple views offer security to some extent where users are unable to
access data of other users and departments. DBMS offers methods to impose constraints while
entering data into the database and retrieving the same at a later stage. DBMS offers many
different levels of security features, which enables multiple users to have different views with
different features. For example, a user in the Sales department cannotsee the data that belongs to
the Purchase department. Additionally, it can also be managed how much data of the Sales
department should be displayed to the user. Since a DBMS is not saved on the disk as traditional
file systems, it is very hard for miscreants to break the code.
DBMS - Architecture
The design of a DBMS depends on its architecture. It can be centralized or decentralized or
hierarchical. The architecture of a DBMS can be seen as either single tier or multi-tier. An n-tier
architecture divides the whole system into related but independent n modules, which can be
independently modified, altered, changed, or replaced.
In 1-tier architecture, the DBMS is the only entity where the user directly sits onthe DBMS and uses
it. Any changes done here will directly be done on the DBMS itself. It does not provide handy tools
for end-users. Database designers and programmers normally prefer to use single-tier
architecture.
If the architecture of DBMS is 2-tier, then it must have an application through which the DBMS can
be accessed. Programmers use 2-tier architecture where they access the DBMS by means of an
application. Here the application tier isntirely independent of the database in terms of operation,
design, andprogramming.
3- tier Architecture
A 3-tier architecture separates its tiers from each other based on the complexity of the users and
how they use the data present in the database. It is the most widely used architecture to design a
DBMS.
786
• Database (Data) Tier − At this tier, the database resides along with its query processing
languages. We also have the relations that define the dataand their constraints at this level.
• Application (Middle) Tier − At this tier reside the application server and the programs that
access the database. For a user, this application tier presents an abstracted view of the database.
End-users are unaware of any existence of the database beyond the application. At the other end,
the database tier is not aware of any other user beyond the application tier. Hence, the application
layer sits in the middle and acts as a mediator between the end-user and the database.
• User (Presentation) Tier − End-users operate on this tier and they know nothing about any
existence of the database beyond this layer. At this
ayer, multiple views of the database can be provided by the application. All views are generated by
applications that reside in the application tier.
Multiple-tier database architecture is highly modifiable, as almost all its components are
independent and can be changed independently.
Data Models
Data models define how the logical structure of a database is modeled. Data Models are
fundamental entities to introduce abstraction in a DBMS. Data models define how data is
connected to each other and how they are processed and stored inside the system.
The very first data model could be flat data-models, where all the data used are to be kept in the
same plane. Earlier data models were not so scientific; hence they were prone to introduce lots of
duplication and update anomalies.
Entity-Relationship Model
Entity-Relationship (ER) Model is based on the notion of real-world entities and relationships
among them. While formulating real-world scenario into the database model, the ER Model
creates entity set, relationship set, general attributes and constraints.
ER Model is best used for the conceptual design of a database.
ER Model is based on −
o Entities and their attributes.
o Relationships among entities. These concepts are explained below.
o Entity − An entity in an ER Model is a real-world entity having properties called
attributes. Every n
o attribute is defined by its set of values
called domain. For example, in a school database, a student is considered as an
entity. Student has various attributes like name, age, class, etc.
o Relationship − The logical association among entities is called relationship.
Relationships are mapped with entities in various ways. Mapping cardinalities define
the number of association between two entities.
Mapping cardinalities −
787
▪ one to one
▪ one to many
▪ many to one
▪ many to many
Relational Model
The most popular data model in DBMS is the Relational Model. It is more scientific a model than
others. This model is based on first-order predicate logic and defines a table as an n-ary relation.
Data Schemas
A database schema is the skeleton structure that represents the logical view of the entire
database. It defines how the data is organized and how the relations among them are associated.
It formulates all the constraints that are to be applied on the data.
A database schema defines its entities and the relationship among them. It contains a descriptive
detail of the database, which can be depicted by means of schema diagrams. It’s the database
designers who design the schema to help programmers understand the database and make it
useful.
It is important that we distinguish these two terms individually. Database schemais the skeleton of
database. It is designed when the database doesn't exist at all. Once the database is operational,
it is very difficult to make any changes to it. A database schema does not contain any data or
information.
A database instance is a state of operational database with data at any given time. It contains a
snapshot of the database. Database instances tend to change with time. A DBMS ensures that its
every instance (state) is in a valid state, by diligently
following all the validations, constraints, and conditions that the databasedesigners have imposed.
• Mapping is used to transform the request and response between various database levels of
architecture.
• Mapping is not good for small DBMS because it takes more time.
o The internal level has an internal schema which describes the physical storage structure of
the database.
o The internal schema is also known as a physical schema.
o It uses the physical data model. It is used to define that how the data will be stored in a
789
block.
o The physical level is used to describe complex low-level data structures indetail.
2. Conceptual Level
o he conceptual schema describes the design of a database at the conceptual level.
Conceptual level is also known as logical level.
o The conceptual schema describes the structure of the whole database.
o The conceptual level describes what data are to be stored in the database and also
describes what relationship exists among those data.
o In the conceptual level, internal details such as an implementation of the data structure are
hidden.
o Programmers and database administrators work at this level.
3. External Level
o At the external level, a database contains several schemas that sometimes called as
subschema. The subschema is used to describe the different viewof the database.
o An external schema is also known as view schema.
o Each view schema describes the database part that a particular user groupis interested and
hides the remaining database from that user group.
o The view schema describes the end user interaction with database systems.
Data Independence
If a database system is not multi-layered, then it becomes difficult to make any changes in the
database system. Database systems are designed in multi-layers aswe learnt earlier.
Data Independence
A database system normally contains a lot of data in addition to users’ data. Forexample, it stores
data about data, known as metadata, to locate and retrieve data easily. It is rather difficult to
modify or update a set of metadata once it is stored in the database. But as a DBMS expands, it
needs to change over time tosatisfy the requirements of the users. If the entire data is dependent,
it would become a tedious and highly complex job.
Metadata itself follows a layered architecture, so that when we change data at one layer, it does
not affect the data at another level. This data is independentbut mapped to each other.
790
Logical Data Independence
Logical data is data about database, that is, it stores information about how data is managed
inside. For example, a table (relation) stored in the database and all its constraints, applied on that
relation.
Logical data independence is a kind of mechanism, which liberalizes itself from actual data stored
on the disk. If we do some changes on table format, it should not change the data residing on the
disk.
Database Language
• A DBMS has appropriate languages and interfaces to express databasequeries and updates.
• Database languages can be used to read, store and update the data inthe database.
DBMS Interface
A database management system (DBMS) interface is a user interface which allows for the ability
to input queries to a database without using the query language
itself. A DBMS interface could be a web client, a local client that runs on a desktop computer, or
even a mobile app.
A database management system stores data and responds to queries using a query language,
such as SQL. A DBMS interface provides a way to query data without having to use the query
language, which can be complicated.
The typical way to do this is to create some kind of form that shows what kinds of queries users
can make. Web-based forms are increasingly common with the popularity of MySQL, but the
traditional way to do it has been local desktop apps. It is also possible to create mobile
applications. These interfaces provide a friendlier way of accessing data rather than just using the
command line.
User-friendly interfaces provide by DBMS may include the following:
792
1. Menu-Based Interfaces for Web Clients or Browsing –
These interfaces present the user with lists of options (called menus) that lead the user through
the formation of a request. Basic advantage of using menus is that they remove the tension of
remembering specific commands and syntax of any query language, rather than query is basically
composed step by step by collecting or picking options from a menu that isbasically shown by the
system. Pull-down menus are a very popular technique in Web based interfaces. They are also
often used in browsing interface which allow a user to look through the contents of a database in
an exploratory and unstructured manner.
2. Forms-Based Interfaces –
A forms-based interface displays a form to each user. Users can fill out all of the form entries to
insert a new data, or they can fill out only certain entries, in which case the DBMS will redeem
same type of data for other remaining entries. This type of forms are usually designed or created
and programmed for the users that have no expertise in operating system.
Many DBMSs have forms specification languages which are special languages that help specify
such forms.
Example: SQL* Forms is a form-based language that specifies queries using a form designed in
conjunction with the relational database schema.b>
3. Graphical User Interface –
A GUI typically displays a schema to the user in diagrammatic form. The user then can specify a
query by manipulating the diagram. In many cases, GUIs utilize both menus and forms. Most GUIs
use a pointing device such as mouse, to pick certain part of the displayed schema diagram.
4. Natural language Interfaces –
These interfaces accept request written in English or some other language and attempt to
understand them. A Natural language interface has its own schema, which is similar to the
database conceptual schema as well as a dictionary of important words.
793
The natural language interface refers to the words in its schema as well as to the set of standard
words in a dictionary to interpret the request. If the interpretation is successful, the interface
generates a high-level query corresponding to the natural language and submits it to the DBMS for
processing, otherwise a dialogue is started with the user to clarify any provided condition or
request. The main disadvantage with this is that the capabilities of this type of interfaces are not
that much advance.
5. Speech Input and Output –
There is a limited use of speech say it for a query or an answer to a question or being a result of a
request, it is becoming commonplace Applications with limited vocabularies such as inquiries for
telephone directory, flight arrival/departure, and bank account information are allowed speech for
input and output to enable ordinary folks to access thisinformation.
The Speech input is detected using a predefined words and used to set upthe parameters that are
supplied to the queries. For output, a similar conversion from text or numbers into speech take
place.
794
➢ Print server
➢ File server
➢ DBMS server
➢ Web server
➢ Email server
➢ Clients are able to access the specialized servers as needed
➢ A client program may perhaps connect to several DBMSs sometimes called the data
sources.
➢ In general data sources are able to be files or other non-DBMS software thatmanages data.
Other variations of clients are likely- example in some object DBMSs more functionality is
transferred to clients including data dictionary functions, optimization as well as recovery
across multiple servers etc.
795
e) Three-tier Architecture is able to Enhance Security:
i. Database server merely accessible via middle tier.
ii. clients can’t directly access database server.
Classification of DBMS's:
• Based on the data model used
• Traditional- Network, Relational, Hierarchical.
• Emerging- Object-oriented and Object-relational.
• Other classifications
• Single-user (typically utilized with personal computers) v/s multi-user (mostDBMSs).
• Centralized (utilizes a single computer with one database) v/s distributed (uses multiple
computers and multiple databases)
Data Modelling
Data modeling (data modelling) is the process of creating a data model for the data to be stored
in a Database. This data model is a conceptual representation of Data objects, the associations
between different data objects and the rules. Data modeling helps in the visual representation of
data and enforces business rules, regulatory compliances, and government policies on the data.
Data Models ensure consistency in naming conventions, default values, semantics, security while
ensuring quality of the data.
Data Model
Data model is defined as an abstract model that organizes data description, data semantics and
consistency constraints of data. Data model emphasizes on what data is needed and how it
should be organized instead of what operations will be performed on data. Data Model is like
architect's building plan which helps building conceptual models and set relationship between
data items.
The two types of Data Models techniques are
1. Entity Relationship (E-R) Model
2. UML (Unified Modelling Language)
796
Why use Data Model?
The primary goal of using data model is:
• Ensures that all data objects required by the database are accurately represented.
Omission of data will lead to creation of faulty reports andproduce incorrect results.
• A data model helps design the database at the conceptual, physical andlogical levels.
• Data Model structure helps to define the relational tables, primary and foreign keys and
stored procedures.
• It provides a clear picture of the base data and can be used by database developers to
create a physical database.
• It is also helpful to identify missing and redundant data.
• Though the initial creation of data model is labor and time consuming, in the long run, it
makes your IT infrastructure upgrade and maintenance cheaper and faster.
797
Data model example:
1. Customer and Product are two entities. Customer number and name are attributes of the
Customer entity
2. Product name and price are attributes of product entity
3. Sale is the relationship between the customer and product
4. Conceptual Data Model
798
• The physical data model describes data need for a single project or application though it
maybe integrated with other physical data modelsbased on project scope.
• Data Model contains relationships between tables that which addresses cardinality and
nullability of the relationships.
• Developed for a specific version of a DBMS, location, data storage or technology to be used
in the project.
• Columns should have exact datatypes, lengths assigned and default values.
• Primary and Foreign keys, views, indexes, access profiles, and authorizations, etc. are
defined.
Advantages and Disadvantages of Data Model:
Advantages of Data model:
1. The main goal of a designing data model is to make certain that dataobjects offered by the
functional team are represented accurately.
2. The data model should be detailed enough to be used for building thephysical database.
3. The information in the data model can be used for defining the relationship between tables,
primary and foreign keys, and stored procedures.
4. Data Model helps business to communicate the within and acrossorganizations.
5. Data model helps to documents data mappings in ETL process
6. Help to recognize correct sources of data to populate the model
Component of ER Diagram
ER Diagram
ER Model is represented by means of an ER diagram. Any object, for example, entities, attributes
of an entity, relationship sets, and attributes of relationship sets, can be represented with the help
of an ER diagram.
Entity
An entity can be a real-world object, either animate or inanimate, that can be easily identifiable.
799
For example, in a school database, students, teachers, classes, and courses offered can be
considered as entities. All these entities have some attributes or properties that give them their
identity.
An entity set is a collection of similar types of entities. An entity set may contain entities with
attribute sharing similar values. For example, a Students set may contain all the students of a
school; likewise a Teachers set may contain all the teachers of a school from all faculties. Entity
sets need not be disjoint.
An entity may be any object, class, person or place. In the ER diagram, an entitycan be represented
as rectangles.
Consider an organization as an example- manager, product, employee, department etc. can be
taken as an entity.
a. Weak Entity
An entity that depends on another entity called a weak entity. The weak entity doesn't contain any
key attribute of its own. The weak entity is represented by adouble rectangle.
Attributes
Entities are represented by means of their properties, called attributes. All attributes have values.
For example, a student entity may have name, class, andage as attributes.
There exists a domain or range of values that can be assigned to attributes. For example, a
student's name cannot be a numeric value. It has to be alphabetic. A student's age cannot be
negative, etc.
Attributes are the properties of entities. Attributes are represented by means of ellipses. Every
ellipse represents one attribute and is directly connected to its entity (rectangle).
If the attributes are composite, they are further divided in a tree like structure. Every node is then
connected to its attribute. That is, composite attributes are represented by ellipses that are
connected with an ellipse.
Types of Attributes
1. Simple attribute − Simple attributes are atomic values, which cannot be divided further. For
example, a student's phone number is an atomic valueof 10 digits.
2. Composite attribute − Composite attributes are made of more than one simple attribute.
For example, a student's complete name may have first name and last-named.
3. Derived attribute − Derived attributes are the attributes that do not exist in the physical
database, but their values are derived from other attributes present in the database. For
example, average salary in a department
4. should not be saved directly in the database, instead it can be derived. Foranother example,
age can be derived from data_of_birth.
800
5. Single-value attribute − Single-value attributes contain single value. For example −
Social_Security_Number.
6. Multi-value attribute − multi-value attributes may contain more than one values. For
example, a person can have more than one phone number, email address, etc.
These attribute types can come together in a way like −
o One to one
o One to many
o Many to many
An entity-relationship diagram can be used to depict the entities, their attributes and the
relationship between the entities in a diagrammatic way.
• Normalization: This is the process of optimizing the database structure. Normalization
simplifies the database design to avoid redundancy and confusion. The different normal forms
are as follows:
• First normal form
• Second normal form
• Third normal form
• Boyce-Codd normal form
• Fifth normal form
By applying a set of rules, a table is normalized into the above normal forms in a linearly
progressive fashion. The efficiency of the design gets better with each higher degree of
normalization.
Relationship
The association among entities is called a relationship. For example, an employee works at a
department, a student enrolls in a course. Here, Works atand enrolls are called relationships.
A relationship is used to describe the relation between entities. Diamond or rhombus is used to
represent the relationship.
Relationship Set
A set of relationships of similar type is called a relationship set. Like entities, arelationship too can
have attributes. These attributes are called descriptive attributes.
Degree of Relationship
The number of participating entities in a relationship defines the degree of therelationship.
• Binary = degree 2
• Ternary = degree 3
• n-ary = degree
Mapping Cardinalities
Cardinality defines the number of entities in one entity set, which can be associated with the
number of entities of other set via relationship set.
• One-to-one − One entity from entity set A can be associated with at most one entity of
entity set B and vice versa.
• One-to-many − One entity from entity set A can be associated with more than one entity of
entity set B however an entity from entity set B, can beassociated with at most one entity.
• Many-to-one − More than one entity from entity set A can be associated with at most one
entity of entity set B, however an entity from entity set B can be associated with more than
one entity from entity set A.
• Many-to-many − One entity from A can be associated with more than oneentity from B and
vice versa.
Notation of ER diagram
Database can be represented using the notations. In ER diagram, many notations are used to
express the cardinality. These notations are as follows:
Relational instance: In the relational database system, the relational instance is represented by a
finite set of tuples. Relation instances do not have duplicate tuples.
Relational schema: A relational schema contains the name of the relation andname of all columns
or attributes.
Relational key: In the relational key, each row has one or more attributes. It can identify the row in
the relation uniquely.
➢ In the given table, NAME, ROLL_NO, PHONE_NO, ADDRESS, and AGE arethe attributes.
➢ The instance of schema STUDENT has 5 tuples.
➢ t3 = <Laxman, 33289, 8583287182, Gurugram, 20>
Properties of Relations
804
Constraints on Relational database model
On modeling the design of the relational database, we can put some restrictions like what values
are allowed to be inserted in the relation, what kind of modifications and deletions are allowed in
the relation. These are the restrictionswe impose on the relational database.
In models like ER models, we did not have such features.
Constraints in the databases can be categorized into 3 main categories:
1. Constraints that are applied in the data model is called Implicit constraints.
2. Constraints that are directly applied in the schemas of the data model, byspecifying them in
the DDL (Data Definition Language). These are called as schema-based constraints or
Explicit constraints.
3. Constraints that cannot be directly applied in the schemas of the data model. We call these
Application based or semantic constraints.
4. So here we will deal with Implicit constraints.
Explanation:
In the above table, EID is the primary key, and first and the last tuple has the same value in EID i.e.,
01, so it is violating the key constraint.
805
6. Entity Integrity Constraints:
1. Entity Integrity constraints says that no primary key can take NULL value, since using primary
key we identify each tuple uniquely in a relation.
Explanation:
In the above relation, EID is made primary key, and the primary key can’t take NULL values but in
the third tuple, the primary key is null, so it is a violating EntityIntegrity constraint.
7. Referential Integrity Constraints:
1. The Referential integrity constraints is specified between two relations or tables and used
to maintain the consistency among the tuples in two relations.
2. This constraint is enforced through foreign key, when an attribute in the foreign key of
relation R1 have the same domain(s) as the primary key of relation R2, then the foreign
key of R1 is said to reference or refer to theprimary key of relation R2.
3. The values of the foreign key in a tuple of relation R1 can either take the values of the
primary key for some tuple in relation R2, or can take NULLvalues, but can’t be empty.
Explanation:
In the above, DNO of the first relation is the foreign key, and DNO in the second relation is the
primary key. DNO = 22 in the foreign key of the first table is not allowed since DNO = 22
is not defined in the primary key of the second relation. Therefore, Referential integrity constraints
is violated here
Relational Language
Relational language is a type of programming language in which the programming logic is
composed of relations and the output is computed based on the query applied. Relational
language works on relations among data and entities to compute a result. Relational language
includes features from and is similar to functional programming language.
Relational language is primarily based on the relational data model, which governs relational
database software and systems. In the relational model’s programming context, the procedures
are replaced by the relations among values. These relations are applied over the processed
arguments or values to
construct an output. The resulting output is mainly in the form of an argument or property. The
side effects emerging from this programming logic are also handled by the procedures or
relations.
Relational language is primarily based on the relational data model, which governs relational
database software and systems. In the relational model’s programming context, the procedures
are replaced by the relations among values. These relations are applied over the processed
arguments or values to
construct an output. The resulting output is mainly in the form of an argument or property. The
side effects emerging from this programming logic are also handled by the procedures or
806
relations.
1. A specific characteristic, that bears the same real-world concept, may appear in more than
one relationship with the same or a different name. For example, in Employees relation,
Employee Id (EmpId) is represented inVouchers as AuthBy and PrepBy.
2. The specific real-world concept that appears more than once in a relationship should be
represented by different names. For example, an employee is represented as subordinate
or junior by using EmpId and as a superior or senior by using SuperId, in the employee’s
relation.
3. The integrity constraints that are specified on database schema shall apply to every
database state of that schema.
809
NULL or changed to reference another default valid tuple. Notice that if a referencing attribute
that causes a viola-tion is part of the primary key, it cannot be set to NULL; otherwise, it would
violate entity integrity.
Combinations of these three options are also possible. For example, to avoid having operation 3
cause a violation, the DBMS may automatically delete alltuples from WORKS_ON and DEPENDENT
with Essn = ‘333445555’. Tuples
in EMPLOYEE with Super_ssn = ‘333445555’ and the tuple in DEPARTMENT with Mgr_ssn =
‘333445555’ can have
their Super_ssn and Mgr_ssn values changed to other valid values or to NULL.
Although it may make sense to delete automatically
the WORKS_ON and DEPENDENT tuples that refer to an EMPLOYEE tuple, it may not make sense
to delete other EMPLOYEE tuples or a DEPARTMENT tuple.
In general, when a referential integrity constraint is specified in the DDL, the DBMS will allow the
database designer to specify which of the options applies in case of a violation of the constraint.
We discuss how to specify these options in the SQL DDL in Chapter 4.
810
4. The Transaction Concept
A database application program running against a relational database typically executes one or
more transactions. A transaction is an executing program that includes some database
operations, such as reading from the database, or applying insertions, deletions, or updates to the
database. At the end of the transaction, it must leave the database in a valid or consistent state
that satisfies all the constraints spec-ified on the database schema. A single transaction may
involve any number of retrieval operations (to be discussed as part of relational algebra and
calculus in Chapter 6, and as a part of the language SQL in Chapters 4 and 5), and any number of
update operations. These retrievals and updates will together form an atomic unit of work against
the database. For example, a transaction to apply a bank with-drawal will typically read the user
account record, check if there is a sufficient bal-ance, and then update the record by the
withdrawal amount.
A large number of commercial applications running against relational databases in online
transaction processing (OLTP) systems are executing transactions at rates that reach several
hundred per second.
Relational Algebra
Relational algebra is a procedural query language. It gives a step-by-step process to obtain the
result of the query. It uses operators to perform queries.
1. Select Operation:
1. Notation: σ p(r)
Where:
σ is used for selection prediction
r is used for relation
p is used as a propositional logic formula which may use connectors like: AND ORand NOT. These
relational can use as relational operators like =, ≠, ≥, <, >, ≤.
For example: LOAN Relation
811
Perryride L-15 1500
Input:
1. σ BRANCH_NAME="perryride" (LOAN)
Output:
2. Project Operation:
o This operation shows the list of those attributes that we wish to appear inthe result. Rest of the
attributes are eliminated from the table.
o It is denoted by ∏.
1. Notation: ∏ A1, A2, An (r)
Were
A1, A2, A3 is used as an attribute name of relation r.Example: CUSTOMER RELATION
812
Hays Main Harrison
Input:
NAME CITY
Jones Harrison
Smith Rye
Hays Harrison
Curry Rye
Johnson Brooklyn
Brooks Brooklyn
3. Union Operation:
• Suppose there are two tuples R and S. The union operation contains all the
tuples that are either in R or S or both in R & S.
• It eliminates the duplicate tuples. It is denoted by 𝖴.
1. Notation: R 𝖴 S
CUSTOMER_NAME ACCOUNT_NO
Johnson A-101
Smith A-121
Mayes A-321
Turner A-176
Johnson A-273
Jones A-472
Lindsay A-284
BORROW RELATION
CUSTOMER_NAME LOAN_NO
Jones L-17
Smith L-23
Hayes L-15
Jackson L-14
Curry L-93
Smith L-11
814
Williams L-17
Input:
CUSTOMER_NAME
Johnson
Smith
Hayes
Turner
Jones
Lindsay
Jackson
Curry
Williams
Mayes
4. Set Intersection:
• Suppose there are two tuples R and S. The set intersection operation
contains all tuples that are in both R & S.
• It is denoted by intersection ∩.
1. Notation: R ∩ S
Output:
CUSTOMER_NAME
Smith
Jones
5. Set Difference:
• Suppose there are two tuples R and S. The set intersection operationcontains
all tuples that are in R but not in S.
• It is denoted by intersection minus (-).
1. Notation: R - S
Example: Using the above DEPOSITOR table and BORROW table
Input:
CUSTOMER_NAME
Jackson
Hayes
Willians
Curry
6. Cartesian product
• The Cartesian product is used to combine each row in one table with each row in the other
table. It is also known as a cross product.
• It is denoted by X.
816
1. Notation: E X DExample:
EMPLOYEE
1 Smith A
2 Harry C
3 John B
DEPARTMENT
DEPT_NO DEPT_NAME
A Marketing
B Sales
C Legal
Input:
1. EMPLOYEE X DEPARTMENT
Output:
1 Smith A A Marketing
1 Smith A B Sales
1 Smith A C Legal
2 Harry C A Marketing
817
2 Harry C B Sales
2 Harry C C Legal
3 John B A Marketing
3 John B B Sales
3 John B C Legal
7. Rename Operation:
The rename operation is used to rename the output relation. It is denotedby rho (ρ).
Example: We can use the rename operator to rename STUDENT relation toSTUDENT1.
1. ρ (STUDENT1, STUDENT)
Note: Apart from these common operations Relational algebra can be used in Joinoperations.
Relational Calculus
• Relational calculus is a non-procedural query language. In the non- procedural query
language, the user is concerned with the details of how toobtain the end results.
• The relational calculus tells what to do but never explains how to do.
Types of Relational calculus:
For example:
1. {T.name | Author(T) AND Article = 'database' }
OUTPUT: This query selects the tuples from the AUTHOR relation. It returns a tuple with 'name'
from Author who has written an article on 'database'.
TRC (tuple relation calculus) can be quantified. In TRC, we can use Existential (∃) and Universal
Quantifiers (∀).
For example:
1. The second form of relation is known as Domain relational calculus. In domain relational
calculus, filtering variable uses the domain of attributes.
2. Domain relational calculus uses the same operators as tuple calculus. Ituses logical connectives
𝖠 (and), ∨ (or) and ┓(not).
3. It uses Existential (∃) and Universal Quantifiers (∀) to bind the variable.
Notation:
1. { a1, a2, a3, ..., an | P (a1, a2, a3, ... ,an)}
Were a1, a2 are attributes
P stands for formula built by inner attributes
For example:
819
Rule 5: Comprehensive Data Sub-Language Rule
A database can only be accessed using a language having linear syntax that supports data
definition, data manipulation, and transaction management operations. This language can be
used directly or by means of some application. If the database allows access to data without any
help of this language, then it is considered as a violation.
A database must be independent of the application that uses it. All its integrity constraints can be
independently modified without the need of any change in the application. This rule makes a
database independent of the front-end applicationand its interface.
SQL
SQL is a programming language for Relational Databases. It is designed over relational algebra
and tuple relational calculus. SQL comes as a package with allmajor distributions of RDBMS.
820
SQL comprises both data definition and data manipulation languages. Using the data definition
properties of SQL, one can design and modify database schema, whereas data manipulation
properties allow SQL to store and retrieve data fromdatabase.
➢ SQL stands for Structured Query Language. It is used for storing and managing data in
relational database management system (RDMS).
➢ It is a standard language for Relational Database System. It enables a user to create, read,
update and delete relational databases and tables.
➢ All the RDBMS like MySQL, Informix, Oracle, MS Access and SQL Server use SQL as their
standard database language.
➢ SQL allows users to query the database in a number of ways, usingEnglish-like statements.
Rules:
SQL follows the following rules:
➢ Structure query language is not case sensitive. Generally, keywords of SQL are written in
uppercase.
➢ Statements of SQL are dependent on text lines. We can use a single SQL statement on one
or multiple text line.
➢ Using the SQL statements, you can perform most of the actions in adatabase.
➢ SQL depends on tuple relational calculus and relational algebra.
SQL process:
➢ When an SQL command is executing for any RDBMS, then the system figure out the best
way to carry out the request and the SQL engine determines that how to interpret the task.
➢ In the process, various components are included. These components can be optimization
Engine, Query engine, Query dispatcher, classic, etc.
➢ All the non-SQL queries are handled by the classic query engine, but SQLquery engine won't
handle logical files.
Characteristics of SQL
➢ SQL is easy to learn.
➢ SQL is used to access data from relational database managementsystems.
➢ SQL can execute queries against the database.
➢ SQL is used to describe the data.
➢ SQL is used to define the data in the database and manipulate it whenneeded.
➢ SQL is used to create and drop the database and table.
➢ SQL is used to create a view, stored procedure, function in a database.
➢ SQL allows users to set permissions on tables, procedures, and views.
SQL Datatype
➢ SQL Datatype is used to define the values that a column can contain.
➢ Every column is required to have a name and data type in the databasetable.
Datatype of SQL:
821
1. Binary Datatypes
There are Three types of binary Datatypes which are given below:
DataType Description
Var binary It has a maximum length of 8000 bytes. It contains variable-lengthbinary data.
Datatype
822
int It is used to specify an integer value.
Datatype Description
varchar It has a maximum length of 8000 characters. It contains variable- length non-unicode
characters.
823
Datatype Description
timestamp It stores the year, month, day, hour, minute, and the secondvalue.
824
1. Without specifying column name
If you want to specify all column values, you can specify or ignore the columnvalues.
Syntax
1. INSERT INTO TABLE_NAME
2. VALUES (value1, value2, value 3, Value N);
Query
1. INSERT INTO EMPLOYEE VALUES (6, 'Marry', 'Canada', 600000, 48);
Output: After executing this query, the EMPLOYEE table will look like:
Syntax
1. INSERT INTO TABLE_NAME2. [(col1, col2, col3, col N)]
3. VALUES (value1, value2, value 3, Value N);
Query
1. INSERT INTO EMPLOYEE (EMP_ID, EMP_NAME, AGE) VALUES (7, 'Jack', 40);
Output: After executing this query, the table will look like:
825
EMP_ID EMP_NAME CITY SALARY AGE
Note: In SQL INSERT query, if you add values for all columns then there is no need to specify the
column name. But you must be sure that you are entering the values in the same order as the
column exists.
826
1 Angelina Chicago 200000 30
Query
1. UPDATE EMPLOYEE
2. SET EMP_NAME = 'Emma'
3. WHERE SALARY = 500000.
Output: After executing this query, the EMPLOYEE table will look like:
827
3 Christian Denver 100000 42
Query
1. UPDATE EMPLOYEE
2. SET EMP_NAME = 'Kevin', City = 'Boston'
3. WHERE EMP_ID = 5.
Output
828
5 Kevin Boston 200000 36
Syntax
1. UPDATE table_name
2. SET column_name = value1.
Query
1. UPDATE EMPLOYEE
2. SET EMP_NAME = 'Harry';
Output
829
Syntax
1. DELETE FROM table_name WHERE some_condition;Sample Table
EMPLOYEE
830
3 Christian Denver 100000 42
Output: After executing this query, the EMPLOYEE table will look like:
831
1. DELETE FROM EMPLOYEE.
Output: After executing this query, the EMPLOYEE table will look like:
Note: Using the condition in the WHERE clause, we can delete single as well as multiple records. If
you want to delete all the records from the table, then youdon't need to use the WHERE clause.
Views in SQL
o Views in SQL are considered as a virtual table. A view also contains rows and
columns.
o To create the view, we can select the fields from one or more tables present in
the database.
o A view can either have specific rows based on certain condition or all therows of
a table.
Sample table:
Student Detail
1 Stephan Delhi
2 Kathrin Noida
3 David Ghaziabad
4 Alina Gurugram
Student_Marks
1 Stephan 97 19
832
2 Kathrin 86 21
3 David 74 18
4 Alina 90 20
5 John 96 18
1. Creating view
A view can be created using the CREATE VIEW statement. We can create a view from a single
table or multiple tables.
Syntax:
1. CREATE VIEW view_name AS
2. SELECT column1, column2....
3. FROM table name
4. WHERE condition.
Output:
NAME ADDRESS
Stephan Delhi
833
Kathrin Noida
David Ghaziabad
Stephan Delhi 97
Kathrin Noida 86
David Ghaziabad 74
Alina Gurugram 90
4. Deleting View
Syntax
1. DROP VIEW name.
834
Example:
If we want to delete the View Marks View, we can do this as:
• [OF col_name] − This specifies the column name that will be updated.
• [ON table name] − This specifies the name of the table associated with thetrigger.
• [REFERENCING OLD AS o NEW AS n] − This allows you to refer new and old values for various
DML statements, such as INSERT, UPDATE, and DELETE.
• [FOR EACH ROW] − This specifies a row-level trigger, i.e., the trigger will be executed for each
row being affected. Otherwise, the trigger will execute just once when the SQL statement is
executed, which is called a table leveltrigger.
• WHEN (condition) − This provides a condition for rows for which the trigger would fire. This
clause is valid only for row-level triggers.
835
Example
To start with, we will be using the CUSTOMERS table we had created and used in the previous
chapters −
Select * from customers.
The following program creates a row-level trigger for the customers table that would fire for
INSERT or UPDATE or DELETE operations performed on the CUSTOMERS table. This trigger will
display the salary difference between the oldvalues and new values −
CREATE OR REPLACE TRIGGER display_salary_changes
Triggering a Trigger
Let us perform some DML operations on the CUSTOMERS table. Here is one INSERT statement,
which will create a new record in the table −
INSERT INTO CUSTOMERS (ID, NAME, AGE,ADDRESS,SALARY) VALUES (7, 'Kriti', 22, 'HP', 7500.00
);
Database-specific factors
Some core features of the SQL language are implemented in the same way across popular
database platforms, and so many ways of detecting and exploiting SQL injection vulnerabilities
work identically on different types of databases.
838
However, there are also many differences between common databases. These mean that some
techniques for detecting and exploiting SQL injection work differently on different platforms. For
example:
➢ Syntax for string concatenation.
➢ Comments.
➢ Batched (or stacked) queries.
➢ Platform-specific APIs.
➢ Error messages.
Functional Dependency
The functional dependency is a relationship that exists between two attributes. It typically exists
between the primary key and non-key attribute within a table.
X → Y The left side of FD is known as a determinant; the right side of the production isknown as
a dependent.
For example:
Assume we have an employee table with attributes: Emp_Id, Emp_Name,Emp_Address.
Here Emp_Id attribute can uniquely identify the Emp_Name attribute of employee table because if
we know the Emp_Id, we can tell that employee nameassociated with it.
Functional dependency can be written as:
Emp_Id → Emp_NameWe can say that Emp_Name is functionally dependent on Emp_Id.
839
Types of Functional dependency
1. Trivial functional dependency
Example:
1. ID → Name,
2. Name → DOB
Normalization
➢ Normalization is the process of organizing the data in the database.
➢ Normalization is used to minimize the redundancy from a relation or set of relations. It is
also used to eliminate the undesirable characteristics like Insertion, Update and Deletion
Anomalies.
➢ Normalization divides the larger table into the smaller table and links them using
relationship.
➢ The normal form is used to reduce redundancy from the database table.
840
NormalForm
1NF A relation is in 1NF if it contains an atomic value.
2NF A relation will be in 2NF if it is in 1NF and all non-key attributes are fully functional
dependent on the primary key.
4NF A relation will be in 4NF if it is in Boyce Codd normal form and hasno multi-valued
dependency.
5NF A relation is in 5NF if it is in 4NF and not contains any joindependency and joining
should be lossless.
Transaction
X's Account
1. Open Account(X)
2. Old_Balance = X.balance
3. New_Balance = Old_Balance - 800
4. X.balance = New_Balance
5. Close Account(X)
Y's Account
1. Open Account(Y)
2. Old Balance = Y. Balance
3. New Balance = Old Balance + 800
4. Y. Balance = New Balance
5. Close Account(Y)
841
Operations of Transaction:
Following are the main operations of transaction:
Read(X): Read operation is used to read the value of X from the database and stores it in a buffer
in main memory.
Write(X): Write operation is used to write the value back to the database fromthe buffer.
An example to debit transaction from an account which consists of followingoperations:
1. 1. R(X);
2. 2. X = X - 500.
3. 3. W(X);
Assume the value of X before starting of the transaction is 4000.
➢ The first operation reads X's value from database and stores it in abuffer.
➢ The second operation will decrease the value of X by 500. So, buffer willcontain 3500.
➢ The third operation will write the buffer's value to the database. So, X'sfinal value will be 3500.
But it may be possible that because of the failure of hardware, software or power, etc. that
transaction may fail before finished all the operations in the set.
For example: If in the above transaction, the debit transaction fails after executing operation 2
then X's value will remain 4000 in the database which is notacceptable by the bank.
To solve this problem, we have two important operations:
Commit: It is used to save the work done permanently.
Transaction property
The transaction has the four properties. These are used to maintain consistency in a database,
before and after the transaction.
Property of Transaction
1. Atomicity
2. Consistency
3. Isolation
4. Durability
Atomicity
• It states that all operations of the transaction take place at once if not, the transaction is
aborted.
• There is no midway, i.e., the transaction cannot occur partially. Each transaction is treated
as one unit and either run to completion or is notexecuted at all.
842
• The integrity constraints are maintained so that the database is consistentbefore and after
the transaction.
• The execution of a transaction will leave a database in either its prior stable state or a new
stable state.
• The consistent property of database states that every transaction sees a consistent
database instance.
• The transaction is used to transform the database from one consistent state to another
consistent state.
Isolation
• It shows that the data which is used at the time of execution of a transaction cannot be
used by the second transaction until the first oneis completed.
• In isolation, if the transaction T1 is being executed and using the data item X, then that data
item can't be accessed by any other transactionT2 until the transaction T1 ends.
• The concurrency control subsystem of the DBMS enforced the isolationproperty.
Durability
• The durability property is used to indicate the performance of the database's consistent
state. It states that the transaction made thepermanent changes.
• They cannot be lost by the erroneous operation of a faulty transaction or by the system
failure. When a transaction is completed, then the database reaches a state known as the
consistent state. That consistent state cannot be lost, even in the event of a system's
failure.
• The recovery subsystem of the DBMS has the responsibility of Durabilityproperty.
States of Transaction
In a database, the transaction can be in one of the following states -
Active state
• The active state is the first state of every transaction. In this state, the
transaction is being executed.
• For example: Insertion or deletion or updating a record is done here. But allthe
records are still not saved to the database.
Partially committed
• In the partially committed state, a transaction executes its final operation, but the data is
still not saved to the database.
• In the total mark calculation example, a final display of the total marks step is executed in
this state.
Committed
A transaction is said to be in a committed state if it executes all its operationssuccessfully. In this
843
state, all the effects are now permanently saved on the database system.
Failed state
➢ If any of the checks made by the database recovery system fails, then the transaction is
said to be in the failed state.
➢ In the example of total mark calculation, if the database is not able to fire a query to fetch
the marks, then the transaction will fail to execute.
Aborted
➢ If any of the checks fail and the transaction has reached a failed state then the database
recovery system will make sure that the database is in its previous consistent state. If not
then it will abort or roll back the transaction to bring the database into a consistent state.
➢ If the transaction fails in the middle of the transaction then before executing the
transaction, all the executed transactions are rolled backto its consistent state.
➢ After aborting the transaction, the database recovery module will select one of the two
operations:
1. Re-start the transaction
Types of Schedules
There are two types of schedules −
➢ Serial Schedules − In a serial schedule, at any point of time, only one transaction is active,
i.e., there is no overlapping of transactions. This isdepicted in the following graph −
844
➢ Parallel Schedules − In parallel schedules, more than one transaction is active
simultaneously, i.e. the transactions contain operations that overlap at time. This is
depicted in the following graph −
Conflicts in Schedules
In a schedule comprising of multiple transactions, a conflict occurs when two active transactions
perform non-compatible operations. Two operations are said to be in conflict, when all of the
following three conditions exists simultaneously −
• The two operations are parts of different transactions.
• Both the operations access the same data item.
• At least one of the operations is a write item () operation, i.e. it tries tomodify the data item.
Serializability
A serializable schedule of ‘n’ transactions is a parallel schedule which is equivalent to a serial
schedule comprising of the same ‘n’ transactions. Aserializable schedule contains the correctness
of serial schedule while ascertaining better CPU utilization of parallel schedule.
Equivalence of Schedules
Equivalence of two schedules can be of the following types −
Example:
Here,
➢ At time t2, transaction-X reads A's value.
➢ At time t3, Transaction-Y reads A's value.
➢ At time t4, Transactions-X writes A's value on the basis of the value seenat time t2.
➢ At time t5, Transactions-Y writes A's value on the basis of the value seenat time t3.
➢ So, at time T5, the update of Transaction-X is lost because Transaction y overwrites it without
looking at its current value.
➢ Such type of problem is known as Lost Update Problem as update made by one transaction is
lost here.
2. Dirty Read
➢ The dirty read occurs in the case when one transaction updates an item ofthe database, and
then the transaction fails for some reason. The updated database item is accessed by
another transaction before it is changed backto the original value.
➢ A transaction T1 updates a record which is read by T2. If T1 aborts, then T2now has values
which have never formed part of the stable database.
Example:
➢ At time t2, transaction-Y writes A's value.
➢ At time t3, Transaction-X reads A's value.
➢ At time t4, Transactions-Y rollbacks. So, it changes A's value back to thatof prior to t1.
➢ So, Transaction-X now contains a value which has never become part of the stable
database.
➢ Such type of problem is known as Dirty Read Problem, as one transaction reads a dirty
value which has not been committed.
3. Inconsistent Retrievals Problem
847
1. σsalary>10000 (πsalary (Employee))
2. πsalary (σsalary>10000 (Employee))
After translating the given query, we can execute each relational algebra operation by using
different algorithms. So, in this way, a query processing beginsits working.
Evaluation
For this, with addition to the relational algebra translation, it is required to annotate the translated
relational algebra expression with the instructions used for specifying and evaluating each
operation. Thus, after translating the user query, the system executes a query evaluation plan.
Optimization
➢ The cost of the query evaluation can vary for different types of queries. Although the
system is responsible for constructing the evaluation plan, the user does need not to write
their query efficiently.
➢ Usually, a database system generates an efficient query evaluation plan, which minimizes
its cost. This type of task performed by the database system and is known as Query
Optimization.
➢ For optimizing a query, the query optimizer should have an estimated cost analysis of each
operation. It is because the overall operation cost depends on the memory allocations to
several operations, execution costs, and so on.
➢ Finally, after selecting an evaluation plan, the system evaluates the query and produces the
output of the query.
DEPARTMENT
DNo DName L
Example 1
Let us consider the query as the following.
$$\pi_{EmpID} (\sigma_ {EName = \small "Arun Kumar"} {(EMPLOYEE)})$$ The corresponding
Example 2
consider another query involving a join.
$\pi_{EName, Salary} (\sigma_{DName = \small "Marketing"} {(DEPARTMENT)})
\bowtie_{DNo=DeptNo}{(EMPLOYEE)}$ Following is the query tree for the above query.
• Perform select and project operations before join operations. This is done by
moving the select and project operations down the query tree. This reduces the
number of tuples available for join.
• Perform the most restrictive select/project operations at first before the other
operations.
• Avoid cross-product operation since they result in very large-sized intermediate
tables.
DBMS is a highly complex system with hundreds of transactions being executed every second.
The durability and robustness of a DBMS depends on its complex architecture and its underlying
hardware and system software. If it fails or crashes amid transactions, it is expected that the
system would follow some sort of algorithm or techniques to recover lost data.
Failure Classification
To see where the problem has occurred, we generalize a failure into various categories, as follows
−
Transaction failure
A transaction has to abort when it fails to execute or when it reaches a point fromwhere it can’t go
any further. This is called transaction failure where only a few transactions or processes are hurt.
System Crash
There are problems − external to the system − that may cause the system to stop abruptly and
cause the system to crash. For example, interruptions in power supply may cause the failure of
underlying hardware or software failure.
Examples may include operating system errors.
Disk Failure
In early days of technology evolution, it was a common problem where hard-disk drives or storage
drives used to fail frequently.
Disk failures include formation of bad sectors, unreachability to the disk, disk head crash or any
other failure, which destroys all or a part of disk storage.
Storage Structure
We have already described the storage system. In brief, the storage structure can be divided into
two categories −
• Volatile storage − As the name suggests, a volatile storage cannot survive system crashes.
Volatile storage devices are placed very close to the CPU; normally they are embedded onto the
chipset itself. For example, main memory and cache memory are examples of volatile storage.
They are fastbut can store only a small amount of information.
• Non-volatile storage − These memories are made to survive system crashes. They are huge in
data storage capacity, but slower in accessibility. Examples may include hard-disks, magnetic
tapes, flash memory, and non-volatile (battery backed up) RAM.
• It should check the states of all the transactions, which were beingexecuted.
• A transaction may be in the middle of some operation; the DBMS mustensure the atomicity
of the transaction in this case.
• It should check whether the transaction can be completed now, or it needsto be rolled back.
• No transactions would be allowed to leave the DBMS in an inconsistentstate.
• There are two types of techniques, which can help a DBMS in recovering as well as
maintaining the atomicity of a transaction −
• Maintaining the logs of each transaction and writing them onto some stable storage before
851
actually modifying the database.
• Maintaining shadow paging, where the changes are done on a volatile memory, and later,
the actual database is updated.
Log-based Recovery
Log is a sequence of records, which maintains the records of actions performed by a transaction.
It is important that the logs are written prior to the actual modification and stored on a stable
storage media, which is failsafe.
• Deferred database modification − All logs are written on to the stable storage and the database
is updated when a transaction commits.
• Immediate database modification − Each log follows an actual database modification. That is,
the database is modified immediately after every operation.
When more than one transaction is being executed in parallel, the logs are interleaved. At the time
of recovery, it would become hard for the recovery system to backtrack all logs, and then start
recovering. To ease this situation,most modern DBMS use the concept of 'checkpoints'.
Checkpoint
Keeping and maintaining logs in real time and in real environment may fill out all the memory
space available in the system. As time passes, the log file may grow too big to be handled at all.
Checkpoint is a mechanism where all the previous logs are removed from the system and stored
permanently in a storage disk.
Checkpoint declares a point before which the DBMS was in consistent state, and all the
transactions were committed.
Recovery
When a system with concurrent transactions crashes and recovers, it behaves in the following
852
manner −
• The recovery system reads the logs backwards from the end to the lastcheckpoint.
• It maintains two lists, an undo-list and a redo-list.
• If the recovery system sees a log with <Tn, Start> and <Tn, Commit> or just
<Tn, Commit>, it puts the transaction in the redo-list.
• If the recovery system sees a log with <Tn, start> but no commit or abort log found, it puts the
transaction in undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the
transactions in the redo-list and their previous logs are removed and then redone before saving
their logs.
Traditional RDBMS products concentrate on the efficient organization of data that is derived from
a limited set of datatypes. On the other hand, an ORDBMS has a feature that allows developers to
build and innovate their own data types and methods, which can be applied to the DBMS. With
this, ORDBMS intends to allow developers to increase the abstraction with which they view the
problem area.
Database Security
DB2 database and functions can be managed by two different modes of securitycontrols:
1. Authentication
2. Authorization
853
Authentication
Authentication is the process of confirming that a user logs in only in accordancewith the rights to
perform the activities he is authorized to perform. User authentication can be performed at
operating system level or database level itself. By using authentication tools for biometrics such
as retina and figure printsare in use to keep the database from hackers or malicious users.
The database security can be managed from outside the db2 database system. Here are some
types of security authentication process:
➢ Based on Operating System authentications.
➢ Lightweight Directory Access Protocol (LDAP)
For DB2, the security service is a part of operating system as a separate product. For
Authentication, it requires two different credentials, those are use rid or username, and password.
Authorization
You can access the DB2 Database and its functionality within the DB2 database system, which is
managed by the DB2 Database manager. Authorization is a process managed by the DB2
Database manager. The manager obtains information about the current authenticated user, that
indicates which databaseoperation the user can perform or access.
Here are different ways of permissions available for authorization:
Secondary permission: Grants to the groups and roles if the user is a member
Public permission: Grants to all users publicly.
Context-sensitive permission: Grants to the trusted context role. Authorization can be given to
➢ System-level authorization
➢ System administrator [SYSADM]
➢ System Control [SYSCTRL]
➢ System maintenance [SYSMAINT]
➢ System monitor [SYSMON]
854
Authorities provide controls within the database. Other authorities for database include with LDAD
and CONNECT.
➢ Object-Level Authorization: Object-Level authorization involves verifying privileges when an
operation is performed on an object.
➢ Content-based Authorization: User can have read and write access to individual rows and
columns on a particular table using Label-based accessControl [LBAC].
DB2 tables and configuration files are used to record the permissions associated with
authorization names. When a user tries to access the data, the recorded permissions verify the
following permissions:
➢ Authorization name of the user
➢ Which group belongs to the user
➢ Which roles are granted directly to the user or indirectly to a group
➢ Permissions acquired through a trusted context.
While working with the SQL statements, the DB2 authorization model considersthe combination of
the following permissions:
➢ Permissions granted to the primary authorization ID associated with theSQL statements.
➢ Secondary authorization IDs associated with the SQL statements.
➢ Granted to PUBLIC
➢ Granted to the trusted context role.
Database authorities
Each database authority holds the authorization ID to perform some action on thedatabase. These
database authorities are different from privileges. Here is the listof some database authorities:
ACCESSCTRL: allows to grant and revoke all object privileges and databaseauthorities.
BINDADD: Allows to create a new package in the database.
CONNECT: Allows to connect to the database.
856
CREATETAB: Allows to create new tables in the database.
DBADM: Act as a database administrator. It gives all other database authorities except
ACCESSCTRL, DATAACCESS, and SECADM.
EXPLAIN: Allows to explain query plans without requiring them to hold the privileges to access the
data in the tables.
SQLADM: Allows to monitor and tune SQL statements. WLMADM: Allows to act as a workload
administrator Privileges
SETSESSIONUSER
Authorization ID privileges involve actions on authorization IDs. There is only one privilege, called
the SETSESSIONUSER privilege. It can be granted to user, or a group and it allows to session user
to switch identities to any of the authorization IDs on which the privileges are granted. This
privilege is granted by user SECADMauthority.
Schema privileges
These privileges involve actions on schema in the database. The owner of the schema has all the
permissions to manipulate the schema objects like tables, views, indexes, packages, data types,
functions, triggers, procedures and aliases. A user, a group, a role, or PUBLIC can be granted any
user of the following privileges:
➢ CREATEIN: allows to create objects within the schema
➢ ALTERIN: allows to modify objects within the schema.
DROPIN
This allows to delete the objects within the schema.
CONTROL
It provides all the privileges for a table or a view including drop and grant, revoke individual table
privileges to the user.
ALTER
It allows user to modify a table.
DELETE
It allows the user to delete rows from the table or view.
INDEX
It allows the user to insert a row into table or view. It can also run import utility.
REFERENCES
It allows the users to create and drop a foreign key.
SELECT
It allows the user to retrieve rows from a table or view.
UPDATE
It allows the user to change entries in a table, view.
Package privileges
User must have CONNECT authority to the database. Package is a database object that contains
the information of database manager to access data in the most efficient way for a particular
application.
CONTROL
It provides the user with privileges of rebinding, dropping or executing packages.A user with these
privileges is granted to BIND and EXECUTE privileges.
BIND
It allows the user to bind or rebind that package.
EXECUTE
Allows to execute a package.Index privileges
858
Sequence automatically receives the USAGE and ALTER privileges on thesequence.
Routine privileges
It involves the action of routines such as functions, procedures, and methodswithin a database.
The enhanced data model offers rich features but breaks backward compatibility.
The classic model is simple, well-understood, and had been around for a long time. The enhanced
data model offers many new features for structuring data. Data producers must choose which
data model to use.
Reasons to use the classic model:
➢ Data using the classic model can be read by all existing netCDF software.
➢ Writing programs for classic model data is easier.
➢ Most or all existing netCDF conventions are targeted at the classic model.
➢ Many great features, like compression, parallel I/O, large data sizes, etc.,are available within the
classic model.
Temporal Databases
Temporal data strored in a temporal database is different from the data stored in non-temporal
database in that a time period attached to the data expresses when it was valid or stored in the
database. As mentioned above, conventional databases consider the data stored in it to be valid
at time instant now, they do not keep track of past or future database states. By attaching a time
period to thedata, it becomes possible to store different database states.
A first step towards a temporal database thus is to timestamp the data. This allows the
distinction of different database states. One approach is that a temporal database may
timestamp entities with time periods. Another approachis the timestamping of the property values
of the entities. In the relational data model, tuples are timestamped, where as in object-oriented
data models, objectsand/or attribute values may be timestamped.
What time period do we store in these timestamps? As we mentioned already, there are mainly
two different notions of time which are relevant for temporal databases. One is called the valid
time, the other one is the transaction time. Valid time denotes the time period during which a fact
is true with respect to the real world. Transaction time is the time period during which a fact is
stored in the database. Note that these two time periods do not have to be the same for a single
fact. Imagine that we come up with a temporal database storing data about the 18th century. The
valid time of these facts is somewhere between 1700 and 1799, whereas the transaction time
859
starts when we insert the facts into the database, for example, January 21, 1998.
Assume we would like to store data about our employees with respect to the real world. Then, the
following table could result:
EmpID Name Department Salary Valid Time Start Valid Time End
The above valid-time table stores the history of the employees with respect to the real world. The
attributes Valid Time Start and Valid Time End actually represent a time interval which is closed
at its lower and open at its upper bound. Thus, we see that during the time period [1985 - 1990),
employee John was working in the
research department, having a salary of 11000. Then he changed to the sales department, still
earning 11000. In 1993, he got a salary raise to 12000. The upperbound INF denotes that the tuple
is valid until further notice. Note that it is now possible to store information about past states. We
see that Paul was employed from 1988 until 1995. In the corresponding non-temporal table, this
information was (physically) deleted when Paul left the company.
Multimedia Databases
The multimedia databases are used to store multimedia data such as images, animation, audio,
video along with text. This data is stored in the form of multiple file types like .txt(text),
.jpg(images), .swf(videos), .mp3(audio) etc.
Media data
This is the multimedia data that is stored in the database such as images, videos, audios,
animation etc.
Mobile Databases
Mobile databases are separate from the main database and can easily be transported to various
places. Even though they are not connected to the maindatabase, they can still communicate with
the database to share and exchangedata.
The mobile database includes the following components −
• The main system database that stores all the data and is linked to themobile database.
• The mobile database that allows users to view information even while on the move. It
shares information with the main database.
• The device that uses the mobile database to access data. This device can be a mobile
phone, laptop etc.
• A communication link that allows the transfer of data between the mobile database and the
main database.
• The data in a database can be accessed from anywhere using a mobile database. It
provides wireless database access.
• The database systems are synchronized using mobile databases and multiple users can
access the data with seamless delivery process.
• Mobile databases require very little support and maintenance.
• The mobile database can be synchronized with multiple devices such as mobiles, computer
devices, laptops etc.
• The mobile data is less secure than data that is stored in a conventional stationary
database. This presents a security hazard.
• The mobile unit that houses a mobile database may frequently lose power because of
limited battery. This should not lead to loss of data in database.
Deductive Database
A deductive database is a database system that makes conclusions about
its data based on a set of well-defined rules and facts. This type of database was developed to
combine logic programming with relational database management systems. Usually, the language
used to define the rules and facts is the logical programming language Data log.
A Deductive Database is a type of database that can make conclusions, or we cansay deductions
using a set of well-defined rules and fact that are stored in the database. In today’s world as we
862
deal with a large amount of data, this deductive database provides a lot of advantages. It helps to
combine the RDBMS with logic programming. To design a deductive database a purely declarative
programminglanguage called Data log is used.
The implementations of deductive databases can be seen in LDL (Logic Data Language), NAIL
(Not Another Implementation of Logic), CORAL, and VALIDITY. The use of LDL and VALIDITY in a
variety of business/industrial applications are asfollows.
1. LDL Applications:
This system has been applied to the following application domains:
• Enterprise modeling:
Data related to an enterprise may result in an extended ER model containing hundreds of entities
and relationship and thousands of attributes. This domain involves modeling the structure,
processes, andconstraints within an enterprise.
• Hypothesis testing or data dredging:
This domain involves formulating a hypothesis, translating in into an LDL rule set and a query, and
then executing the query against given data to test the hypothesis. This has been applied to
genome data analysis in the field of microbiology, where data dredging consists of identifying the
DNA sequences from low-level digitized auto radio graphs from experiments performed on E.Coli
Bacteria.
•Software reuse:
A small fraction of the software for an application is rule-based and encoded in LDL (bulk is
developed in standard procedural code). The rules give rise to a knowledge base that contains, A
definition of each C module used in system and A set of rules that defines ways in which modules
can export/import functions, constraints and so on. The “Knowledge base” can be used to make
decisions that pertain to the reuse of software subsets.
This is being experimented within banking software.
2. VALIDITY Applications:
Validity combines deductive capabilities with the ability to manipulate complex objects (OIDs,
inheritance, methods, etc). It provides a DOOD data model and language called DEL (Datalog
Extended Language), an engine working along a client-server model and a set of tools for schema
and rule editing, validation, andquerying.
XML - Databases
XML Database is used to store huge amount of information in the XML format. As the use of
XML is increasing in every field, it is required to have a secured place to store the XML
documents. The data stored in the database can be queried using XQuery, serialized, and
exported into a desired format.
• XML- enabled
• Native XML (NXD)
Example
Following example demonstrates XML database −
<?xml version = "1.0"?>
<contact-info>
<contact1>
<name>Tanmay Patil</name>
<company>Tutorials Point</company>
<phone>(011) 123-4567</phone>
</contact1>
<contact2>
<name>Manisha Patil</name>
864
<company>Tutorials Point</company>
<phone> (011) 789-4567</phone>
</contact2>
</contact-info>
Here, a table of contacts is created that holds the records of contacts (contact1 and contact2),
which in turn consists of three entities − name, company and phone.
➢ Powerful and Scalable - Internet Database Applications are more robust, agile and able to
expand and scale up more easily.
Database servers that are built to serve Internet applications are designed to handle millions of
concurrent connections and complex SQL queries.
A good example is Facebook, which uses database servers that are able to handle millions of
inquiries and complex SQL queries.
Internet database applications use the same type of database server that is designed to run
Facebook. The database servers that are built to serve desktop applications usually can handle
only a limited number of connections and are not able to deal with complex SQL queries.
• Web Based - Internet Database Applications are web-based applications, therefore the data
can be accessed using a browser at any location.
• Security - Database servers have been fortified with preventive features and security
protocols have been implemented to combat today's cyber security threats and
vulnerabilities.
• Open Source, Better Licensing Terms and Cost Savings - There are many powerful
database servers that are open source. This means that there is no licensing cost. Many
large enterprise sites are using Open-Source Database Servers, such as Facebook, Yahoo,
YouTube, Flickr, Wikipedia, etc.
Open Source also creates less dependence on vendors, which is a big advantage because that
provides more product quality control and lower cost. Open source also offers easier
customization and is experiencing a fast-growing adoption rate, especially by the large and influential
enterprises.
➢ Abundant Features - There are many open-source programming languages(such as PHP, Python,
Ruby) and hundreds of powerful open-source libraries, tools and plug-ins specifically built to
interact with today's database servers.
2. Remote Sensing
3. Photogrammetry
4. Environmental Science
5. City Planning
6. Cognitive Science
As a result, GIS relies on progress made in fields such as computer science, databases, statistics,
and artificial intelligence. All the different problems and question that arises from the integration
of multiple disciplines make a more thana simple tool.
867
such data must be able to represent a complex substructure of data as well as
relationships. An additional context is provided by the structure of the biological data for
interpretation of the information.
• There is a rapid change in schemas of biological databases.
• There should be a support of schema evolution and data object migration so that there can
be an improved information flow between generations orreleases of databases.
• The relational database systems support the ability to extend the schema and a frequent
occurrence in the biological setting.
• Most biologists are not likely to have knowledge of internal structure of the database or
about schema design.
• Users need an information which can be displayed in a manner such that it can be
applicable to the problem which they are trying to address. Also the data structure should
be reflected in an easy and understandable manner. An information regarding the meaning
of the schema is not provided to the user because of the failure by the relational schemas.
A present search interfaces is provided by the web interfaces, which may limit access into
the database.
• There is no need of the write access to the database by the users ofbiological data, instead
they only require read access.
• There is limitation of write access to the privileged users called curators. There are only
small numbers of users which require write access but a wide variety of read access
patterns are generated by the users into thedatabases.
• Access to “old” values of the data are required by the users of biological data most often
while verifying the previously reported results.
• Hence system of archives must support the changes to the values of the
• data in the database. Access to both the most recent version of data value and its previous
version are important in the biological domain.
• Added meaning is given by the context of data for its use in biologicalapplications.
Whenever appropriate, context must be maintained and conveyed to the user. For the
maximization of the interpretation of a biological data value, it should be possible to integrate as
many contexts as possible.
Distributed databases
Distributed databases can be classified into homogeneous and heterogeneous databases having
further divisions.
868
• The sites use very similar software.
• The sites use identical DBMS or DBMS from the same vendor.
• Each site is aware of all other sites and cooperates with other sites to process user
requests.
• The database is accessed through a single interface as if it is a singledatabase.
• Autonomous − Each database is independent that functions on its own. They are
integrated by a controlling application and use message passing toshare data updates.
• Non-autonomous − Data is distributed across the homogeneous nodes and a central or
master DBMS co-ordinates data updates across the sites.
• Heterogeneous Distributed Databases
• In a heterogeneous distributed database, different sites have different operating systems,
DBMS products and data models. Its properties are −
• Different sites use dissimilar schemas and software.
• The system may be composed of a variety of DBMSs like relational, network, hierarchical or
object oriented.
• Query processing is complex due to dissimilar schemas.
• Transaction processing is complex due to dissimilar software.
• A site may not be aware of other sites and so there is limited co-operation in processing
user requests.
Architectural Models
Some of the common architectural models are −
• Client - Server Architecture for DDBMS
869
• Peer - to - Peer Architecture for DDBMS
• Multi - DBMS Architecture
Design Alternatives
The distribution design alternatives for the tables in a DDBMS are as follows −
• Non-replicated and non-fragmented
• Fully replicated
870
• Partially replicated
• Fragmented
• Mixed
Fully Replicated
In this design alternative, at each site, one copy of all the database tables is stored. Since, each
site has its own copy of the entire database, queries are very fast requiring negligible
communication cost. On the contrary, the massive redundancy in data requires huge cost during
update operations. Hence, this is suitable for systems where a large number of queries is required
to be handled whereas the number of database updates is low.
Partially Replicated
Copies of tables or portions of tables are stored at different sites. The distribution of the tables is
done in accordance to the frequency of access. This takes into consideration the fact that the
frequency of accessing the tables vary considerably from site to site. The number of copies of the
tables (or portions) depends on how frequently the access queries execute and the site which
generate the access queries.
Fragmented
In this design, a table is divided into two or more pieces referred to as fragments or partitions, and
each fragment can be stored at different sites. This considers the fact that it seldom happens that
all data stored in a table is required at a given site. Moreover, fragmentation increases parallelism
and provides better disaster recovery. Here, there is only one copy of each fragment in the
system, i.e., no redundant data.
The three fragmentation techniques are −
• Vertical fragmentation
• Horizontal fragmentation
• Hybrid fragmentation
Mixed Distribution
This is a combination of fragmentation and partial replications. Here, the tables are initially
fragmented in any form (horizontal or vertical), and then these fragments are partially replicated
across the different sites according to the frequency of accessing the fragments.
DBMS Architecture
In client server computing, the clients requests a resource and the server provides that resource. A
871
server may serve multiple clients at the same time while a client is in contact with only one server.
• The DBMS design depends upon its architecture. The basic client/server architecture is
used to deal with a large number of PCs, web servers, database servers and other
components that are connected with networks.
• The client/server architecture consists of many PCs and a workstationwhich are connected
via the network.
• DBMS architecture depends upon how users are connected to the database to get their
request done.
Data Warehouse:
A Data Warehouse refers to a place where data can be stored for useful mining. It is like a quick
computer system with exceptionally huge data storage capacity.
Data from the various organization's systems are copied to the Warehouse, whereit can be fetched
and conformed to delete errors. Here, advanced requests can be made against the warehouse
storage of data.
Data warehouse combines data from numerous sources which ensure the data quality, accuracy,
and consistency. Data warehouse boosts system execution by separating analytics processing
from transnational databases. Data flows into a data warehouse from different databases. A data
warehouse works by sorting out data into a pattern that depicts the format and types of data.
Query tools examine the data tables using patterns.
Data warehouses and databases both are relative data systems, but both are made to serve
different purposes. A data warehouse is built to store a huge amount of historical data and
empowers fast requests over all the data, typically using Online Analytical Processing (OLAP). A
database is made to store current transactions and allow quick access to specific transactions
for ongoing businessprocesses, commonly known as Online Transaction Processing (OLTP).
Important Features of Data Warehouse
A data warehouse is subject-oriented. It provides useful data about a subject instead of the
company's ongoing operations, and these subjects can be customers, suppliers, marketing,
product, promotion, etc. A data warehouses
usually focuses on modeling and analysis of data that helps the business organization to make
data-driven decisions.
873
2. Time-Variant:
The different data present in the data warehouse provides information for aspecific period.
3. Integrated
A data warehouse is built by joining data from heterogeneous sources, such as social databases,
level documents, etc.
4. Non- Volatile
It means, once data entered the warehouse cannot be change.
Data Mining:
Data mining refers to the analysis of data. It is the computer-supported process of analyzing huge
sets of data that have either been compiled by computer systems or have been downloaded into
the computer. In the data mining process, the computer analyzes the data and extract useful
information from it. It looks for hidden patterns within the data set and try to predict future
behavior. Data mining is primarily used to discover and indicate relationships among the data
sets.
Data mining aims to enable business organizations to view business behaviors, trends
relationships that allow the business to make data-driven decisions. It is also known as
knowledge Discover in Database (KDD). Data mining tools utilize AI, statistics, databases, and
machine learning systems to discover the relationship between the data. Data mining tools can
support business-related questions that traditionally time-consuming to resolve any issue.
i. Market Analysis:
Data Mining can predict the market that helps the business to make the decision. For example, it
predicts who is keen to purchase what type of products.
Data mining is the process ofdetermining A data warehouse is a database systemdesigned for
data patterns. analytics.
Business entrepreneurs carrydata mining Data warehousing is entirely carried out bythe
with the help of engineers. engineers.
repeatedly. periodically.
Data mining uses patternrecognition Data warehousing is the process of extracting and
techniques toidentify patterns. storing data that alloweasier reporting.
875
One of the most amazing data mining One of the advantages of the data warehouse is its
techniques is the detectionand ability to update frequently. That is the reason why it is
identification of the unwanted errors that idealfor business entrepreneurs who want up todate
occur in the system. with the latest stuff.
Companies can benefit from thisanalytical Data warehouse stores a huge amount of historical
tool by equipping suitable and accessible data that helps users to analyze different periods and
knowledge-based data. trends to make futurepredictions.
876
queries on long term information.
In contrast, data modeling in operational database systems targets efficiently supporting simple
transactions in the database such as retrieving, inserting, deleting, and changing data. Moreover,
data warehouses are designed for the customer with general information knowledge about the
enterprise, whereas operational database systems are more oriented toward use by software
specialists for creating distinct applications.
Older detail data is stored in some form of mass storage, and it is infrequently accessed and kept
at a level detail consistent with current detailed data.
Lightly summarized data is data extract from the low level of detail found at the current, detailed
level and usually is stored on disk storage. When building the data warehouse have to remember
what unit of time summarization is done over and also the components or what attributes the
summarized data will contain.
Highly summarized data is compact and directly available and can even be found outside the
warehouse.
Metadata is the final element of the data warehouses and is really of various dimensions in which
it is not the same as file drawn from the operational data,but it is used as: -
• A directory to help the DSS investigator locate the items of the datawarehouse.
• A guide to the mapping of record as the data is changed from the operational data to the
data warehouse environment.
• A guide to the method used for summarization between the current, accurate data and the
lightly summarized information and the highlysummarized data, etc.
877
Conceptual Data Model
A conceptual data model recognizes the highest-level relationships between thedifferent entities.
Characteristics of the conceptual data model
• It contains the essential entities and the relationships among them.
• No attribute is specified.
• No primary key is specified.
We can see that the only data shown via the conceptual data model is the entities that define the
data and the relationships between those entities. No other data, as shown through the
conceptual data model.
The phase for designing the logical data model which are as follows:
• Specify primary keys for all entities.
• List the relationships between different entities.
• List all attributes for each entity.
• Normalization.
• No data types are listed
Foreign keys are used to recognize relationships between tables. The steps for physical data
model design which are as follows:
878
• Convert entities to tables.
• Convert relationships to foreign keys.
• Convert attributes to columns.
Enterprise Warehouse
An Enterprise warehouse collects all the records about subjects spanning the entire organization.
It supports corporate-wide data integration, usually from one or more operational systems or
external data providers, and it's cross-functional in scope. It generally contains detailed
information as well as summarized information and can range in estimate from a few gigabytes
to hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be accomplished on traditional mainframes, UNIX super
servers, or parallel architecture platforms. It required extensive business modeling and may take
years to develop and build.
Data Mart
A data mart includes a subset of corporate-wide data that is of value to a specific collection of
users. The scope is confined to particular selected subjects. For example, a marketing data mart
may restrict its subjects to the customer, items, and sales. The data contained in the data marts
tend to be summarized.
Independent Data Mart: Independent data mart is sourced from data captured from one or more
operational systems or external data providers, or data generally locally within a different
department or geographic area.
Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data-
warehouses.
Virtual Warehouses
Virtual Data Warehouses is a set of perception over the operational database. For effective query
processing, only some of the possible summary vision may be materialized. A virtual warehouse
is simple to build but required excess capacity on operational database servers.
Concept Hierarchy
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-
level, more general concepts. Consider a concept hierarchy for the dimension location. City values
for location include Vancouver, Toronto, New York, and Chicago. Each city, however, can be
mapped to the province or state to which it belongs. For example, Vancouver can be mapped to
British Columbia, and Chicago to Illinois. The provinces and states can in turn be mapped to the
country (e.g., Canada or the United States) to which they belong. These mappings form a concept
879
hierarchy for the dimension location, mapping a set of low-level concepts (i.e., cities) to higher-
level, more general concepts (i.e., countries). This concept hierarchy is illustrated in Figure 4.9.
Figure 4.9. A concept hierarchy for location. Due to space limitations, not all ofthe hierarchy nodes
are shown, indicated by ellipses between nodes.
Many concept hierarchies are implicit within the database schema. For example, suppose that the
dimension location is described by the attributes number, street, city, province_or_state, zip code,
and country. These attributes are related by a total order, forming a concept hierarchy such as
“street < city < province_or_state< country.” This hierarchy is shown in Figure 4.10(a). Alternatively,
the attributesof a dimension may be organized in a partial order, forming a lattice. An exampleof a
partial order for the time dimension based on the attributes day, week, month, quarter, and year is
“day < {month < quarter; week} < year.”1 This lattice structure is shown in Figure 4.10(b). A
concept hierarchy that is a total or partial order among attributes in a database schema is called a
schema hierarchy.
Concept hierarchies that are common to many applications (e.g., for time) may be predefined in
the data mining system. Data mining systems should provide users with the flexibility to tailor
predefined hierarchies according to their particular needs. For example, users may want to define
a fiscal year starting on April 1 or an academic year starting on September 1.
Figure 4.10. Hierarchical and lattice structures of attributes in warehouse dimensions: (a) a
hierarchy for location and (b) a lattice for time.Concept hierarchies may also be defined by
discretizing or grouping values for a given dimension or attribute, resulting in a set-grouping
hierarchy. A total or partial order can be defined among groups of values. An example of a set-
grouping hierarchy is shown in Figure 4.11 for the dimension price, where an interval ($X…$Y]
denotes the range from $X (exclusive) to $Y (inclusive).
Figure 4.11. A concept hierarchy for price.
There may be more than one concept hierarchy for a given attribute or dimension, based on
different user viewpoints. For instance, a user may prefer to organize price by defining ranges for
inexpensive, moderately priced, and expensive.
Concept hierarchies may be provided manually by system users, domain experts, or knowledge
engineers, or may be automatically generated based on statistical analysis of the data
distribution. The automatic generation of concept hierarchies is discussed in Chapter 3 as a
preprocessing step in preparation for data mining.
OLTP and OLAP: The two terms look similar but refer to different kinds of systems. Online
transaction processing (OLTP) captures, stores, and processes data from transactions in real
time. Online analytical processing (OLAP) uses complex queries to analyze aggregated historical
data from OLTP systems.
OLTP
An OLTP system captures and maintains transaction data in a database. Each transaction
involves individual database records made up of multiple fields or columns. Examples include
banking and credit card activity or retail checkoutscanning.
880
In OLTP, the emphasis is on fast processing, because OLTP databases are read, written, and
updated frequently. If a transaction fails, built-in system logic ensures data integrity.
OLAP
OLAP applies complex queries to large amounts of historical data, aggregated from OLTP
databases and other sources, for data mining, analytics, and business intelligence projects. In
OLAP, the emphasis is on response time to these complex queries. Each query involves one or
more columns of data aggregated from many rows. Examples include year-over-year financial
performance or marketing lead generation trends. OLAP databases and data warehouses give
analysts and decision-makers the ability to use custom reporting tools to turn data into
information. Query failure in OLAP does not interrupt or delay transaction processing for
customers, but it can delay or impact the accuracy of businessintelligence insights.
OLTP OLAP
Simple standardizedqueries
Query types Complex queries
881
amount of data to
process
882
and meet legal and needed in lieu of
governance requirements regular backups
Increases productivity of
business managers, data
Increases productivity of endusers analysts, and executives
Productivity
OLTP provides an immediate record of current business activity, while OLAP generates and
validates insights from that data as it’s compiled over time. That historical perspective empowers
accurate forecasting, but as with all business intelligence, the insights generated with OLAP are
only as good as the datapipeline from which they emanate.
Association rules
Association rules are if-then statements that help to show the probability of relationships between
data items within large data sets in various types of databases. Association rule mining has a
number of applications and is widely used to help discover sales correlations in transactional data
or in medical datasets.
Association rule mining finds interesting associations and relationships among large sets of data
883
items. This rule shows how frequently a itemset occurs in a transaction. A typical example is
Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to show associations
between items. It allows retailers to identify relationships between the items that people buy
together frequently.
Given a set of transactions, we can find rules that will predict the occurrence of an item based on
the occurrences of other items in the transaction.
TID ITEMS
1 Bread, Milk
TID ITEMS
Association Rule – An implication expression of the form X -> Y, where X and Y areany 2 itemsets.
Example: {Milk, Diaper}->{Beer}
Rule Evaluation Metrics –
• Support(s) –
The number of transactions that include items in the {X} and {Y} parts of the rule as a percentage
of the total number of transactions. It is a measure of how frequently the collection of items occur
together as a percentage of alltransactions.
• Support = (X+Y) total –
It is interpreted as fraction of transactions that contain both X and Y.
• Confidence(c) –
884
It is the ratio of the no of transactions that includes all items in {B} as well as the no of
transactions that includes all items in {A} to the no of transactions that includes all items in {A}.
• Conf(X=>Y) = Supp (X Y) Supp(X) –
It measures how often each item in Y appears in transactions that containsitems in X also.
• Lift(l) –
The lift of the rule X=>Y is the confidence of the rule divided by the expected confidence,
assuming that the itemset X and Y are independent of each other.The expected confidence is the
confidence divided by the frequency of {Y}.
• Lift(X=>Y) = Conf(X=>Y) Supp(Y) –
Lift value near 1 indicates X and Y almost often appear together as expected, greater than 1
means they appear together more than expected and less than 1 means they appear less than
expected. Greater lift values indicate stronger association.
Example – From the above table, {Milk, Diaper} =>{Beer}
The Association rule is very useful in analyzing datasets. The data is collected using bar-code
scanners in supermarkets. Such databases consist of a large number of transaction records
which list all items bought by a customer on a single purchase. So the manager could know if
certain groups of items are consistently purchased together and use this data for adjusting store
layouts,cross-selling, promotions based on statistics.
Classification
Classification is a data mining function that assigns items in a collection to target categories or
classes. The goal of classification is to accurately predict the target class for each case in the
data. For example, a classification model could be used to identify loan applicants as low,
medium, or high credit risks.
A classification task begins with a data set in which the class assignments are known. For
example, a classification model that predicts credit risk could be developed based on observed
data for many loan applicants over a period of time. In addition to the historical credit rating, the
data might track employment history, home ownership or rental, years of residence, number and
type of investments, and so on. Credit rating would be the target, the other attributes would be the
predictors, and the data for each customer would constitute a case.
Classifications are discrete and do not imply order. Continuous, floating-point values would
indicate a numerical, rather than a categorical, target. A predictive model with a numerical target
uses a regression algorithm, not a classification algorithm.
The simplest type of classification problem is binary classification. In binary classification, the
target attribute has only two possible values: for example, high credit rating or low credit rating.
Multiclass targets have more than two values: for example, low, medium, high, or unknown credit
rating.
In the model build (training) process, a classification algorithm finds relationships between the
885
values of the predictors and the values of the target. Different classification algorithms use
different techniques for finding relationships. These relationships are summarized in a model,
which can then be applied to a differentdata set in which the class assignments are unknown.
Classification models are tested by comparing the predicted values to known target values in a
set of test data. The historical data for a classification project is typically divided into two data
sets: one for building the model; the other for testing the model. See "Testing a Classification
Model".
Scoring a classification model results in class assignments and probabilities for each case. For
example, a model that classifies customers as low, medium, or high value would also predict the
probability of each classification for each customer.
Classification has many applications in customer segmentation, business modeling, marketing,
credit analysis, and biomedical and drug response modeling.
Note:
Oracle Data Miner displays the generalized case ID in the DMR$CASE_ID column of the apply
output table. A "1" is appended to the column name of each predictor that you choose to include
in the output. The predictions (affinity card usage in Figure 5-2) are displayed in the PREDICTION
column. The probability of each prediction is displayed in the PROBABILITY column. For decision
trees, the node is displayed in the NODE column.
Since this classification model uses the Decision Tree algorithm, rules are generated with the
886
predictions and probabilities. With the Oracle Data Miner Rule Viewer, you can see the rule that
produced a prediction for a given node in the tree. Figure 5-3 shows the rule for node 5. The rule
states that married customerswho have a college degree (Associates, Bachelor, Masters, Ph.D., or
professional) are likely to increase spending with an affinity card.
Accuracy
Accuracy refers to the percentage of correct predictions made by the model when compared with
the actual classifications in the test data. Figure 5-4 shows the accuracy of a binary classification
model in Oracle Data Miner.
Clustering
Clustering analysis finds clusters of data objects that are similar in some sense to one another.
The members of a cluster are more like each other than they are like members of other clusters.
The goal of clustering analysis is to find high-quality clusters such that the inter-cluster similarity
is low, and the intra-cluster similarity is high.
Clustering, like classification, is used to segment the data. Unlike classification, clustering models
segment data into groups that were not previously defined. Classification models segment data
by assigning it to previously defined classes, which are specified in a target. Clustering models do
not use a target.
Clustering is useful for exploring data. If there are many cases and no obvious groupings,
clustering algorithms can be used to find natural groupings. Clustering can also serve as a useful
data-preprocessing step to identify homogeneous groups on which to build supervised models.
Clustering can also be used for anomaly detection. Once the data has been segmented into
clusters, you might find that some cases do not fit well into any clusters. These cases are
anomalies or outliers.
Interpreting Clusters
Since known classes are not used in clustering, the interpretation of clusters can present
difficulties. How do you know if the clusters can reliably be used for business decision making?
You can analyze clusters by examining information generated by the clustering algorithm. Oracle
Data Mining generates the following information about eachcluster:
• Position in the cluster hierarchy, described in "Cluster Rules"
• Rule for the position in the hierarchy, described in "Cluster Rules"
• Attribute histograms, described in "Attribute Histograms"
• Cluster centroid, described in "Centroid of a Cluster"
As with other forms of data mining, the process of clustering may be iterative and may require the
creation of several models. The removal of irrelevant attributes or the introduction of new
attributes may improve the quality of the segments produced by a clustering model.
How are Clusters Computed?
There are several different approaches to the computation of clusters. Clustering algorithms may
be characterized as:
• Hierarchical — Groups data objects into a hierarchy of clusters. The hierarchy can be
formed top-down or bottom-up. Hierarchical methods rely on a distance function to
measure the similarity between clusters.
Note:
The clustering algorithms supported by Oracle Data Mining perform hierarchicalclustering.
• Partitioning — Partitions data objects into a given number of clusters. The clusters are
formed in order to optimize an objective criterion such as distance.
888
• Locality-based — Groups neighboring data objects into clusters based onlocal conditions.
• Grid-based — Divides the input space into hyper-rectangular cells, discards the low-density
cells, and then combines adjacent high-density cells to formclusters.
Cluster Rules
Oracle Data Mining performs hierarchical clustering. The leaf clusters are the final clusters
generated by the algorithm. Clusters higher up in the hierarchy are intermediate clusters.
Rules describe the data in each cluster. A rule is a conditional statement that captures the logic
used to split a parent cluster into child clusters. A rule describes the conditions for a case to be
assigned with some probability to a cluster. For example, the following rule applies to cases that
are assigned to cluster 19:
Number of Clusters
The CLUS_NUM_CLUSTERS build setting specifies the maximum number of clusters that can be
generated by a clustering algorithm.
Attribute Histograms
In Oracle Data Miner, a histogram represents the distribution of the values of an attribute in a
cluster. Figure 7-1 shows a histogram for the distribution of occupations in a cluster of customer
data.
In this cluster, about 13% of the customers are craftsmen; about 13% are executives, 2% are
farmers, and so on. None of the customers in this cluster are in the armed forces or work in
housing sales.
Centroid of a Cluster
The centroid represents the most typical case in a cluster. For example, in a data set of customer
ages and incomes, the centroid of each cluster would be a customer of average age and average
889
income in that cluster. If the data set included gender, the centroid would have the gender most
frequently represented in the cluster. Figure 7-1 shows the centroid values for a cluster.
The centroid is a prototype. It does not necessarily describe any given case assigned to the
cluster. The attribute values for the centroid are the mean of thenumerical attributes and the mode
of the categorical attributes.
Scoring New Data Oracle Data Mining supports the scoring operation for clustering. In addition to
generating clusters from the build data, clustering models create a Bayesian probability model
that can be used to score new data.
Sample Clustering Problems
These examples use the clustering model km_sh_clus_sample, created by one of the Oracle Data
Mining sample programs, to show how clustering might be used to find natural groupings in the
build data or to score new data.
Figure 7-2 shows six columns and ten rows from the case table used to build the model. Note that
no column is designated as a target.
Regression models are tested by computing various statistics that measure the difference
between the predicted values and the expected values. The historical data for a regression project
is typically divided into two data sets: one for building the model, the other for testing the model.
Regression modeling has many applications in trend analysis, business planning, marketing,
financial forecasting, time series prediction, biomedical and drug response modeling, and
environmental modeling.
Linear Regression
A linear regression technique can be used if the relationship between the predictors and the target
can be approximated with a straight line.
Regression with a single predictor is the easiest to visualize. Simple linear regression with a single
predictor is shown in Figure 4-1.
Regression Coefficients
In multivariate linear regression, the regression parameters are often referred to as coefficients.
When you build a multivariate linear regression model, the algorithm computes a coefficient for
each of the predictors used by the model.
The coefficient is a measure of the impact of the predictor x on the target y. Numerous statistics
are available for analyzing the regression coefficients to evaluate how well the regression line fits
the data. ("Regression Statistics".)
891
Nonlinear Regression
Often the relationship between x and y cannot be approximated with a straight line. In this case, a
nonlinear regression technique may be used. Alternatively, the data could be preprocessed to
make the relationship linear.
Nonlinear regression models define y as a function of x using an equation that is more
complicated than the linear regression equation. In Figure 4-2, x and y have a nonlinear
relationship.
Confidence Bounds
A regression model predicts a numeric target value for each case in the scoring data. In addition
to the predictions, some regression algorithms can identify confidence bounds, which are the
upper and lower boundaries of an interval inwhich the predicted value is likely to lie.
When a model is built to make predictions with a given confidence, the confidence interval will be
produced along with the predictions. For example, a model might predict the value of a house to
be $500,000 with a 95% confidencethat the value will be between $475,000 and $525,000.
Note:
Oracle Data Miner displays the generalized case ID in the DMR$CASE_ID column of the apply
output table. A "1" is appended to the column name of each predictor that you choose to include
in the output. The predictions (the predicted ages in Figure 4-4) are displayed in the PREDICTION
column.
Residual Plot
A residual plot is a scatter plot where the x-axis is the predicted value of x, and the y-axis is the
residual for x. The residual is the difference between the actual value of x and the predicted value
of x.
Figure 4-5 shows a residual plot for the regression results shown in Figure 4-4. Note that most of
the data points are clustered around 0, indicating small residuals. However, the distance between
the data points and 0 increases with the value of x, indicating that the model has greater error for
people of higher ages.
Regression Statistics
The Root Mean Squared Error and the Mean Absolute Error are commonly used statistics for
evaluating the overall quality of a regression model. Different statistics may also be available
depending on the regression methods used by thealgorithm.
This formula shows the MAE in mathematical symbols. The large sigma character represents
summation; j represents the current predictor, and n represents the number of predictors.
Description of the illustration Mae Test Metrics in Oracle Data Miner Oracle Data Miner calculates
the regression test metrics shown in Figure 4-6.
Regression Algorithms
Oracle Data Mining supports two algorithms for regression. Both algorithms are particularly suited
for mining data sets that have very high dimensionality (many attributes), including transactional
and unstructured data.
894
Support Vector Machines (SVM) is a powerful, state-of-the-art algorithm with strong theoretical
foundations based on the Vapnik-Chervonenkis theory. SVM has strong regularization properties.
Regularization refers to the generalization ofthe model to new data.
Advantages of SVM
SVM models have similar functional form to neural networks and radial basis functions, both
popular data mining techniques. However, neither of these algorithms has the well-founded
theoretical approach to regularization that forms the basis of SVM. The quality of generalization
and ease of training of SVM is farbeyond the capacities of these more traditional methods.
SVM can model complex, real-world problems such as text and image classification, hand-writing
recognition, and bioinformatics and bio sequenceanalysis.
SVM performs well on data sets that have many attributes, even if there are very few cases on
which to train the model. There is no upper limit on the number of attributes; the only constraints
are those imposed by hardware. Traditional neural nets do not perform well under these
circumstances.
Usability
Usability is a major enhancement, because SVM has often been viewed as a tool for experts. The
algorithm typically requires data preparation, tuning, and optimization. Oracle Data Mining
minimizes these requirements. You do not need to be an expert to build a quality SVM model in
Oracle Data Mining. For example:
• Data preparation is not required in most cases.
• Default tuning parameters are generally adequate.
Scalability
When dealing with very large data sets, sampling is often required. However, sampling is not
required with Oracle Data Mining SVM, because the algorithm itself uses stratified sampling to
reduce the size of the training data as needed.
Oracle Data Mining SVM is highly optimized. It builds a model incrementally by optimizing small
working sets toward a global solution. The model is trained until convergence on the current
working set, then the model adapts to the new data. The process continues iteratively until the
convergence conditions are met. The Gaussian kernel uses caching techniques to manage the
working sets.
Oracle Data Mining SVM supports active learning, an optimization method that builds a smaller,
more compact model while reducing the time and memory resources required for training the
model. See "Active Learning".
Kernel-Based Learning
895
SVM is a kernel-based algorithm. A kernel is a function that transforms the input data to a high-
dimensional space where the problem is solved. Kernel functions can be linear or nonlinear.
Oracle Data Mining supports linear and Gaussian (nonlinear) kernels.
In Oracle Data Mining, the linear kernel function reduces to a linear equation on the original
attributes in the training data. A linear kernel works well when there are many attributes in the
training data.
The Gaussian kernel transforms each case in the training data to a point in an n- dimensional
space, where n is the number of cases. The algorithm attempts to separate the points into
subsets with homogeneous target values. The Gaussian kernel uses nonlinear separators, but
within the kernel space it constructs a linearequation.
Active Learning
Active learning is an optimization method for controlling model growth and reducing model build
time. Without active learning, SVM models grow as the size of the build data set increases, which
effectively limits SVM models to small and medium size training sets (less than 100,000 cases).
Active learning provides a
way to overcome this restriction. With active learning, SVM models can be built on very large
training sets.
Active learning forces the SVM algorithm to restrict learning to the most informative training
examples and not to attempt to use the entire body of data. In most cases, the resulting models
have predictive accuracy comparable to thatof a standard (exact) SVM model.
Active learning provides a significant improvement in both linear and Gaussian SVM models,
whether for classification, regression, or anomaly detection.
However, active learning is especially advantageous for the Gaussian kernel, because nonlinear
models can otherwise grow to be very large and can place considerable demands on memory and
other system resources.
896
Linear or Gaussian.The
algorithm automatically uses
SVMS_KERNEL_FUNCTION Kernel
the kernel functionthat is most
appropriate to the
data.
SVM uses the linear kernel when there are many attributes (more than 100) in the training data,
otherwise it uses the Gaussian kernel.
The number of attributes does not correspond to the number of columns in the training data.
SVM explodes categorical attributes to binary,numeric attributes. In addition, Oracle Data Mining
interprets each rowin a nested column as a separate attribute.
kernel function.
897
Amount of memory
allocated to the Gaussian
kernel cache maintained
Cache size for Gaussian
SVMS_KERNEL_CACHE_SIZE inmemory to improve
kernel
model build time.
The default cachesize is
50 MB.
learning is enabled.
898
Regularization setting that
balancesthe complexity of
the model against model
robustness toachieve good
SVMS_COMPLEXITY_FACTOR Complexity factor
generalization on new data.
SVM uses a data-driven
approach to finding the
complexity factor.
899
Data Preparation for SVM
The SVM algorithm operates natively on numeric attributes. The algorithm automatically
"explodes" categorical data into a set of binary attributes, one per category value. For example, a
character column for marital status with
values married or single would be transformed to two numeric
attributes: married and single. The new attributes could have the value 1 (true) or0 (false).
When there are missing values in columns with simple data types (not nested), SVM interprets
them as missing at random. The algorithm automatically replaces missing categorical values with
the mode and missing numerical values with the mean.
When there are missing values in nested columns, SVM interprets them as sparse. The algorithm
automatically replaces sparse numerical data with zeros and sparse categorical data with zero
vectors.
Normalization
SVM requires the normalization of numeric input. Normalization places the values of numeric
attributes on the same scale and prevents attributes with a large original scale from biasing the
solution. Normalization also minimizes the likelihood of overflows and underflows. Furthermore,
normalization brings the numerical attributes to the same scale (0,1) as the exploded categorical
data.
Note:
Oracle Corporation recommends that you use Automatic Data Preparation with SVM. The
transformations performed by ADP are appropriate for most models.
SVM Classification
SVM classification is based on the concept of decision planes that define decision boundaries. A
decision plane is one that separates between a set of objects having different class
memberships. SVM finds the vectors ("support vectors") that define the separators giving the
widest separation of classes.
SVM classification supports both binary and multiclass targets.
Class Weights
In SVM classification, weights are a biasing mechanism for specifying the relative importance of
target values (classes).
SVM models are automatically initialized to achieve the best average predictionacross all classes.
However, if the training data does not represent a realistic distribution, you can bias the model to
compensate for class values that are under-represented. If you increase the weight for a class, the
900
percent of correctpredictions for that class should increase.
The Oracle Data Mining APIs use priors to specify class weights for SVM. To use priors in training
a model, you create a priors table and specify its name as a buildsetting for the model.
Priors are associated with probabilistic models to correct for biased sampling procedures. SVM
uses priors as a weight vector that biases optimization and favors one class over another.
One-Class SVM
Oracle Data Mining uses SVM as the one-class classifier for anomaly detection. When SVM is
used for anomaly detection, it has the classification mining functionbut no target.
One-class SVM models, when applied, produce a prediction and a probability for each case in the
scoring data. If the prediction is 1, the case is considered typical. If the prediction is 0, the case is
considered anomalous. This behavior reflects thefact that the model is trained with normal data.
You can specify the percentage of the data that you expect to be anomalous with the
SVMS_OUTLIER_RATE build setting. If you have some knowledge that the number of
ÒsuspiciousÓ cases is a certain percentage of your population, then you can set the outlier rate to
that percentage. The model will identify approximately that many ÒrareÓ cases when applied to
the general population. The default is 10%, which is probably high for many anomaly detection
problems.
SVM Regression
SVM uses an epsilon-insensitive loss function to solve regression problems.
SVM regression tries to find a continuous function such that the maximum number of data points
lie within the epsilon-wide insensitivity tube. Predictions falling within epsilon distance of the true
target value are not interpreted as errors.
The epsilon factor is a regularization setting for SVM regression. It balances the margin of error
with model robustness to achieve the best generalization to newdata.
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases
based on a similarity measure (e.g., distance functions). KNN has been used in statistical
estimation and pattern
recognition already in the beginning of 1970’s as a non-parametrictechnique.
Algorithm
901
A case is classified by a majority vote of its neighbors, with the case being assigned to the class
most common amongst its K nearest neighbors measured by a distance function. If K = 1, then
the case issimply assigned to the class of its nearest neighbor.
It should also be noted that all three distance measures are only validfor continuous variables. In
the instance of categorical variables, the Hamming distance must be used. It also brings up the
issue of
standardization of the numerical variables between 0 and 1 when there is a mixture of numerical
and categorical variables in the dataset.
Choosing the optimal value for K is best done by first inspecting the data. In general, a large K
value is more precise as it reduces the overallnoise but there is no guarantee. Cross-validation is
another way to retrospectively determine a good K value by using an independent dataset to
validate the K value. Historically, the optimal K for most datasets has been between 3-10. That
produces much better results than 1NN.
Example:
Consider the following data concerning credit default. Age and Loanare two numerical variables
(predictors) and Default is the target.
We can now use the training set to classify an unknown case (Age=48and Loan=$142,000) using
Euclidean distance. If K=1 then the nearestneighbor is the last case in the training set with
Default=Y.
902
With K=3, there are two Default=Y and one Default=N out of threeclosest neighbors. The
prediction for the unknown case is again Default=Y.
Standardized Distance
One major drawback in calculating distance measures directly from thetraining set is in the case
where variables have different measurement scales or there is a mixture of numerical and
categorical variables. For example, if one variable is based on annual income in dollars, and the
other is based on age in years then income will have a much higher influence on the distance
calculated. One solution is to standardize the
training set as shown below.
Dependency Modeling
Link Analysis
Link analysis is literally about analyzing the links between objects, whether they are physical,
digital or relational. This requires diligent data gathering. For example, in the case of a website
where all of the links and backlinks that are present must be analyzed, a tool has to sift through all
of the HTML codes and various scripts in the page and then follow all the links it finds in order to
determine what sort of links are present and whether they are active or dead. This information can
be very important for search engine optimization, as it allows the analyst to determine whether the
search engine is actually able to findand index the website.
In networking, link analysis may involve determining the integrity of the connection between each
network node by analyzing the data that passes through the physical or virtual links. With the data,
analysts can find bottlenecks and possible fault areas and are able to patch them up more quickly
or even helpwith network optimization.
Link analysis has three primary purposes:
• Find new patterns of interest (for example, in social networking andmarketing and business
intelligence).
904
human (node) and relational (tie) analysis. The tie value is social
capital.
SNA is often diagrammed with points (nodes) and lines (ties) to present the intricacies related to
social networking. Professional researchers perform analysis using software and unique theories
and methodologies.
SNA research is conducted in either of the following ways:
Sequence mining
Sequence mining has already proven to be quite beneficial in many domains such as marketing
analysis or Web click-stream analysis. A sequence s is defined as a set of ordered items denote
by 〈s1, s2, ⋯, sn〉. In activity recognition problems, the sequence is typically ordered using
timestamps. The goal of sequence mining is to discover interesting patterns in data with respect
to some subjective or objective measure of how interesting it is. Typically, this task involves
discoveringfrequent sequential patterns with respect to a frequency support measure.
The task of discovering all the frequent sequences is not a trivial one. In fact, it can be quite
challenging due to the combinatorial and exponential search space. Over the past decade, a
number of sequence mining methods have
been proposed that handle the exponential search by using various heuristics. The first sequence
mining algorithm was called GSP, which was based on the a priori approach for mining frequent
item sets. GSP makes several passes over the database to count the support of each sequence
and to generate candidates.
Then, it prunes the sequences with a support count below the minimum support.
Many other algorithms have been proposed to extend the GSP algorithm. One example is the PSP
algorithm, which uses a prefix-based tree to represent candidate patterns. FREESPAN and
PREFIXSPAN are among the first algorithms to consider a projection method for mining
sequential patterns, by recursively projecting sequence databases into smaller projected
905
databases.
SPADE is another algorithm that needs only three passes over the database to discover
sequential patterns. SPAM was the first algorithm to use a vertical bitmap representation of a
database. Some other algorithms focus on discovering specific types of frequent patterns. For
example, BIDE is an efficient algorithm for mining frequent closed sequences without candidate
maintenance; there are alsomethods for constraint-based sequential pattern mining
Big Data
According to Gartner, the definition of Big Data –
“Big data” is high-volume, velocity, and variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.”
This definition clearly answers the “What is Big Data?” question – Big Data refers to complex and
large data sets that have to be processed and analyzed to uncover valuable information that can
benefit businesses and organizations.
However, there are certain basic tenets of Big Data that will make it even simplerto answer what is
Big Data:
• It refers to a massive amount of data that keeps on growing exponentiallywith time.
• It is so voluminous that it cannot be processed or analyzed using conventional data
processing techniques.
• It includes data mining, data storage, data analysis, data sharing, and datavisualization.
• The term is an all-comprehensive one including data, data frameworks, along with the tools
and techniques used to process and analyze the data.
Structured
Structured is one of the types of big data and By structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a database by simple search engine
algorithms. For instance, the employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be present in an organized manner.
Unstructured
Unstructured data refers to the data that lacks any specific form or structure whatsoever. This
makes it very difficult and time-consuming to process and analyze unstructured data. Email is an
example of unstructured data. Structuredand unstructured are two important types of big data.
Semi-structured
Semi structured is the third type of big data. Semi-structured data pertains to the data containing
both the formats mentioned above, that is, structured and unstructured data. To be precise, it
refers to the data that although has not been classified under a particular repository (database)
yet contains vital information or tags that segregate individual elements within the data. Thus, we
906
come to the end of types of data. Let’s discuss the characteristics of data.
1) Variety
Variety of Big Data refers to structured, unstructured, and semi structured data that is gathered
from multiple sources. While in the past, data could only be collected from spreadsheets and
databases, today data comes in an array of forms such as emails, PDFs, photos, videos, audios,
SM posts, and so much more.Variety is one of the important characteristics of big data.
2) Velocity
Velocity essentially refers to the speed at which data is being created in real-time. In a broader
prospect, it comprises the rate of change, linking of incoming data sets at varying speeds, and
activity bursts.
3) Volume
Volume is one of the characteristics of big data. We already know that Big Data indicates huge
‘volumes’ of data that is being generated on a daily basis from various sources like social media
platforms, business processes, machines, networks, human interactions, etc. Such a large amount
of data is stored in datawarehouses. Thus, comes to the end of characteristics of big data.
o One of the biggest advantages of Big Data is predictive analysis. Big Data analytics
tools can predict outcomes accurately, thereby, allowing businesses and
organizations to make better decisions, while simultaneously optimizing their
operational efficiencies and reducing risks.
o By harnessing data from social media platforms using Big Data analytics tools,
businesses around the world are streamlining their digital marketing strategies to
enhance the overall consumer experience. Big Data provides insights into the
customer pain points and allows companies to improve upon their products and
services.
o Being accurate, Big Data combines relevant data from multiple sources to produce
highly actionable insights. Almost 43% of companies lack the necessary tools to filter
out irrelevant data, which eventually costs them millions of dollars to hash out useful
data from the bulk. Big Data tools can help reduce this, saving you both time and
money.
o Big Data analytics could help companies generate more sales leads which would
naturally mean a boost in revenue. Businesses are using Big Data analytics tools to
understand how well their products/services are doing in the market and how the
907
customers are responding to them. Thus, the can understand better where to invest
their time and money.
o With Big Data insights, you can always stay a step ahead of your competitors. You can
screen the market to know what kind of promotions and offers your rivals are providing,
and then you can come up with better offers for your customers. Also, Big Data insights
allow you to learn customer behavior to understand the customer trends and provide a
highly‘personalized’ experience to them.
2) Academia
Big Data is also helping enhance education today. Education is no more limited to the physical
bounds of the classroom – there are numerous online educational courses to learn from.
Academic institutions are investing in digital courses powered by Big Data technologies to aid the
all-round development of budding learners.
3) Banking
The banking sector relies on Big Data for fraud detection. Big Data tools can efficiently detect
fraudulent acts in real-time such as misuse of credit/debit cards, archival of inspection tracks,
faulty alteration in customer stats, etc.
4) Manufacturing
According to TCS Global Trend Study, the most significant benefit of Big Data in manufacturing is
improving the supply strategies and product quality. In the manufacturing sector, Big data helps
create a transparent infrastructure, thereby, predicting uncertainties and incompetencies that can
affect the business adversely.
5) IT
One of the largest users of Big Data, IT companies around the world are using BigData to optimize
their functioning, enhance employee productivity, and minimize risks in business operations. By
combining Big Data technologies with ML and AI, the IT sector is continually powering innovation
to find solutions even for the most complex of problems.
6. Retail
908
Big Data has changed the way of working in traditional brick and mortar retail stores. Over the
years, retailers have collected vast amounts of data from local demographic surveys, POS
scanners, RFID, customer loyalty cards, store inventory, and so on. Now, they’ve started to
leverage this data to create personalized customer experiences, boost sales, increase revenue,
and deliveroutstanding customer service.
Retailers are even using smart sensors and Wi-Fi to track the movement of customers, the most
frequented aisles, for how long customers linger in the aisles, among other things. They also
gather social media data to understand what customers are saying about their brand, their
services, and tweak their product design and marketing strategies accordingly.
7. Transportation
Big Data Analytics holds immense value for the transportation industry. In countries across the
world, both private and government-run transportation companies use Big Data technologies to
optimize route planning, control traffic, manage road congestion and improve services.
Additionally, transportation services even use Big Data to revenue management, drive
technological innovation, enhance logistics, and of course, to gain the upper hand in the market.
1. Walmart
Walmart leverages Big Data and Data Mining to create personalized product recommendations
for its customers. With the help of these two emerging technologies, Walmart can uncover
valuable patterns showing the most frequently bought products, most popular products, and even
the most popular product bundles (products that complement each other and are usually
purchased together).
Based on these insights, Walmart creates attractive and customized recommendations for
individual users. By effectively implementing Data Mining techniques, the retail giant has
successfully increased the conversion rates and improved its customer service substantially.
Furthermore, Walmart
uses Hadoop and NoSQL technologies to allow customers to access real-time data accumulated
from disparate sources.
2. American Express
The credit card giant leverages enormous volumes of customer data to identify indicators that
could depict user loyalty. It also uses Big Data to build advanced predictive models for analyzing
historical transactions along with 115 different variables to predict potential customer churn.
Thanks to Big Data solutions and tools, American Express can identify 24% of the accounts that
are highly likely toclose in the upcoming four to five months.
3. General Electric
In the words of Jeff Immelt, Chairman of General Electric, in the past few years, GE has been
successful in bringing together the best of both worlds – “the physical and analytical worlds.” GE
909
thoroughly utilizes Big Data. Every machine operating under General Electric generates data on
how they work. The GE analytics team then crunches these colossal amounts of data to extract
relevantinsights from it and redesign the machines and their operations accordingly.
Today, the company has realized that even minor improvements, no matter how small, play a
crucial role in their company infrastructure. According to GE stats, Big Data has the potential to
boost productivity by 1.5% in the US, which compiled over a span of 20 years could increase the
average national income by a staggering 30%!
4. Uber
Uber is one of the major cab service providers in the world. It leverages customerdata to track and
identify the most popular and most used services by the users.
Once this data is collected, Uber uses data analytics to analyze the usage patterns of customers
and determine which services should be given more emphasis and importance.
Apart from this, Uber uses Big Data in another unique way. Uber closely studies the demand and
supply of its services and changes the cab fares accordingly. It is the surge pricing mechanism
that works something like this – suppose when you are in a hurry, and you have to book a cab
from a crowded location, Uber will charge you double the normal amount!
5. Netflix
Netflix is one of the most popular on-demand online video content streaming platform used by
people around the world. Netflix is a major proponent of the recommendation engine. It collects
customer data to understand the specific needs, preferences, and taste patterns of users. Then it
uses this data to predict what individual users will like and create personalized content
recommendationlists for them.
Today, Netflix has become so vast that it is even creating unique content for users. Data is the
secret ingredient that fuels both its recommendation engines and new content decisions. The
most pivotal data points used by Netflix include titles that users watch, user ratings, genres
preferred, and how often users stop the playback, to name a few. Hadoop, Hive, and Pig are the
three core components of the data structure used by Netflix.
7. IRS
910
Yes, even government agencies are not shying away from using Big Data. The
US Internal Revenue Service actively uses Big Data to prevent identity theft, fraud, and untimely
payments (people who should pay taxes but don’t pay them in due time).
The IRS even harnesses the power of Big Data to ensure and enforce compliance with tax rules
and laws. As of now, the IRS has successfully averted fraud and scams involving billions of
dollars, especially in the case of identity theft. In the past three years, it has also recovered over
US$ 2 billion.
Introduction to MapReduce
MapReduce is a programming model for processing large data sets with a parallel, distributed
algorithm on a cluster (source: Wikipedia). Map Reduce when coupled with HDFS can be used to
handle big data. The fundamentals of this HDFS-MapReduce system, which is commonly referred
to as Hadoop.
The basic unit of information, used in MapReduce is a (Key,value) pair. All types of structured and
unstructured data need to be translated to this basic unit, before feeding the data to MapReduce
model. As the name suggests, MapReduce model consist of two separate routines, namely Map-
function and Reduce-function. This article will help you understand the step by step functionality
of Map-Reduce model. The computation on an input (i.e. on a set of pairs) in MapReduce model
occurs in three stages:
Step 1: The map stage Step 2 : The shuffle stage Step 3 : The reduce stage. Semantically, the
map and shuffle phases distribute the data, and the reduce phase performs the computation. In
this article we will discuss about each ofthese stages in detail.
911
[stextbox id=” section”] The Reduce stage [/stextbox]
In the reduce stage, the reducer takes all of the values associated with a single key k and outputs
any number of (key, value) pairs. This highlights one of the sequential aspects of MapReduce
computation: all of the maps need to finish before the reduce stage can begin. Since the reducer
has access to all the values with the same key, it can perform sequential computations on these
values. In the reduce step, the parallelism is exploited by observing that reducers operating on
different keys can be executed simultaneously. To summarize, for the reduce phase, the user
designs a function that takes in input a list of values associated with a single key and outputs any
number of pairs. Often the output keys of a reducer equal the input key (in fact, in the original
MapReduce paper the outputkey must equal to the input key, but Hadoop relaxed this constraint).
Overall, a program in the MapReduce paradigm can consist of many rounds(usually called jobs) of
different map and reduce functions, performed sequentially one after another.
Our objective is to count the frequency of each word in all the sentences. Imagine that each of
these sentences acquire huge memory and hence are allotted to different data nodes. Mapper
takes over this unstructured data and creates key value pairs. In this case key is the word and
value are the count of this word in the text available at this data node. For instance, the 1st Map
node generates 4 key- value pairs: (the,1), (brown,1), (fox,1), (quick,1). The first 3 key-value pairs
go to the first Reducer and the last key-value go to the second Reducer.
Similarly, the 2nd and 3rd map functions do the mapping for the other two sentences. Through
shuffling, all the similar words come to the same end. Once, the key value pairs are sorted, the
reducer function operates on this structured data to come up with a summary.
[stextbox id=” section”] End Notes: [/stextbox]
The constraint of using Map-reduce function is that user has to follow a logic format. This logic is
to generate key-value pairs using Map function and then summarize using Reduce function. But
luckily most of the data manipulation operations can be tricked into this format. In the next article
we will take some examples like how to do data-set merging, matrix multiplication, matrix
transpose,etc. using Map-Reduce.
Introduction to Hadoop
Hadoop is a complete eco-system of open-source projects that provide us the framework to deal
with big data. Let’s start by brainstorming the possible challenges of dealing with big data (on
traditional systems) and then look at thecapability of Hadoop solution.
Following are the challenges I can think of in dealing with big data :
1. High capital investment in procuring a server with high processing capacity.
2. Enormous time taken
3. In case of long query, imagine an error happens on the last step. You will waste so much
time making these iterations.
4. Difficulty in program query building
5. Here is how Hadoop solves all of these issues:
1. High capital investment in procuring a server with high processing capacity: Hadoop clusters
work on normal commodity hardware and keep
multiple copies to ensure reliability of data. A maximum of 4500 machines can be connected
together using Hadoop.
2. Enormous time taken: The process is broken down into pieces and executed in parallel, hence
saving time. A maximum of 25 Petabyte (1 PB = 1000 TB) data can be processed using Hadoop.
3. In case of long query, imagine an error happens on the last step. You will waste so much time
making these iterations: Hadoop builds back up datasets at every level. It also executes query on
duplicate datasets to avoid process loss in case of individual failure. These steps make Hadoop
processing more precise andaccurate.
4. Difficulty in program query building: Queries in Hadoop are as simple as coding in any
language. You just need to change the way of thinking around building a query to enable parallel
processing.
Background of Hadoop
With an increase in the penetration of internet and the usage of the internet, the data captured by
Google increased exponentially year on year. Just to give you an estimate of this number, in 2007
Google collected on an average 270 PB of data every month. The same number increased to
20000 PB every day in 2009.
Obviously, Google needed a better platform to process such an enormous data. Google
implemented a programming model called MapReduce, which could process this 20000 PB per
day. Google ran these MapReduce operations on a special file system called Google File System
913
(GFS). Sadly, GFS is not an open source.
Doug cutting and Yahoo! reverse engineered the model GFS and built a parallel Hadoop
Distributed File System (HDFS). The software or framework that supports HDFS and MapReduce
is known as Hadoop. Hadoop is an open source anddistributed by Apache.
Hadoop works in a similar format. On the bottom we have machines arranged in parallel. These
machines are analogous to individual contributor in our analogy. Every machine has a data node
and a task tracker. Data node is also known as HDFS (Hadoop Distributed File System) and Task
tracker is also known as map- reducers.
Data node contains the entire set of data and Task tracker does all the operations. You can
imagine task tracker as your arms and leg, which enables you to do a task and data node as your
brain, which contains all the information which you want to process. These machines are working
in silos, and it is very essential to coordinate them. The Task trackers (Project manager in our
analogy) in different machines are coordinated by a Job Tracker. Job Tracker makes sure that
each operation is completed and if there is a process failure at any node, it needs to assign a
duplicate task to some task tracker. Job tracker also distributes the entire task to all the
machines.
A name node on the other hand coordinates all the data nodes. It governs the distribution of data
going to each machine. It also checks for any kind of purging which have happened on any
machine. If such purging happens, it finds the duplicate data which was sent to other data node
and duplicates it again. You can think of this name node as the people manager in our analogy
which is concernedmore about the retention of the entire dataset.
One process involved in implementing the DFS is giving access control and storage management
controls to the client system in a centralized way, managed by the servers. Transparency is one of
the core processes in DFS, so files are accessed, stored, and managed on the local client
machines while the process itself is actually held on the servers. This transparency brings
convenience to the end user on a client machine because the network file system efficiently
manages all the processes. Generally, a DFS is used in a LAN, but it can be used in a WAN or over
the Internet.
A DFS allows efficient and well-managed data and storage sharing options on a network
compared to other options. Another option for users in network-based computing is a shared disk
file system. A shared disk file system puts the access control on the client’s systems, so the data
is inaccessible when the client system goes offline. DFS is fault-tolerant, and the data is
accessible even if some of the network nodes are offline.
A DFS makes it possible to restrict access to the file system depending on access lists or
capabilities on both the servers and the clients, depending on how the protocol is designed.
HDFS
Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-
cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the
files are stored across multiple machines. These files are stored in redundant fashion to rescue
the system from possible data losses in case of failure. HDFS also makes applications available
to parallel processing.
Features of HDFS
o It is suitable for the distributed storage and processing.
o Hadoop provides a command interface to interact with HDFS.
o The built-in servers of namenode and datanode help users to easily check the status
of cluster.
o Streaming access to file system data.
915
o HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture, and it has the following elements.
Name node
The name node is the commodity hardware that contains the GNU/Linux operating system and
the name node software. It is a software that can be run on commodity hardware. The system
having the name node acts as the master serverand it does the following tasks −
o Manages the file system namespace.
o Regulates client’s access to files.
o It also executes file system operations such as renaming, closing, and opening files
and directories.
Data node
The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a
cluster, there will be a datanode. These nodes manage the data storage of theirsystem.
• Datanodes perform read-write operations on the file systems, as per client
request.
• They also perform operations such as block creation, deletion, and replication
according to the instructions of the name node.
Block
Generally, the user data is stored in the files of HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual data nodes.
These file segments are called as blocks. In other words, the minimum amount of data that HDFS
can read or write is called a Block. The default block size is 64MB, but it can be increased as per
the need to change in HDFS configuration.
Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity hardware,
failure of components is frequent. Therefore, HDFS should have mechanisms for quick and
automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications
having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the network traffic and
916
increases the throughput.
NoSQL
NoSQL databases (aka "not only SQL") are non-tabular, and store data differently than relational
tables. NoSQL databases come in a variety of types based on theirdata model. The main types are
document, key-value, wide-column, and graph.
They provide flexible schemas and scale easily with large amounts of data andhigh user loads.
What is NoSQL?
When people use the term “NoSQL database”, they typically use it to refer to any non-relational
database. Some say the term “NoSQL” stands for “non-SQL” while others say it stands for “not
only SQL.” Either way, most agree that NoSQL databases are databases that store data in a format
other than relational tables.
A common misconception is that NoSQL databases or non-relational databases don’t store
relationship data well. NoSQL databases can store relationship data— they just store it differently
than relational databases do. In fact, when compared with SQL databases, many find modeling
relationship data in NoSQL databases to be easier than in SQL databases, because related data
doesn’t have to be split between tables.
NoSQL data models allow related data to be nested within a single data structure.
NoSQL databases emerged in the late 2000s as the cost of storage dramatically decreased. Gone
were the days of needing to create a complex, difficult-to- manage data model simply for the
purposes of reducing data duplication.
Developers (rather than storage) were becoming the primary cost of software development, so
NoSQL databases optimized for developer productivity.
NoSQL databases often leverage data models more tailored to specific use cases, making them
better at supporting those workloads than relational databases. For example, key-value databases
support simple queries very efficiently while graph databases are the best for queries that involve
identifying complex relationships between separate pieces of data.
Performance
NoSQL databases can often perform better than SQL/relational databases for your use case. For
example, if you’re using a document database and are storing all the information about an object
in the same document (so that it matches the objectsin your code), the database only needs to go
to one place for those queries. In a SQL database, the same query would likely involve joining
multiple tables and records, which can dramatically impact performance while also slowing down
how quickly developers write code.
Scalability
917
SQL/relational databases were originally designed to scale up and although there are ways to get
them to scale out, those solutions are often bolt-ons, complicated, expensive to manage, and hard
to evolve. Some core SQL functionality also only really works well when everything is on one
server. In contrast, NoSQL databases are designed from the ground up to scale-out horizontally,
making it much easier to maintain performance as your workload grows beyond the limits of a
single server.
Data Distribution
Because NoSQL databases are designed from the ground up as distributed systems, they can
more easily support a variety of business requirements. For example, suppose the business needs
a globally distributed application that provides excellent performance to users all around the
world. NoSQL databases can allow you to deploy a single distributed cluster to support that
application and ensure low latency access to data from anywhere. This approach also makes it
much easier to comply with data sovereignty mandates required by modern privacy regulations.
Reliability
NoSQL databases ensure high availability and uptime with native replication andbuilt-in failover for
self-healing, resilient database clusters. Similar failover systems can be set up for SQL databases
but since the functionality is not native to the underlying database, this often means more
resources to deploy and maintain a separate clustering layer that then takes longer to identify and
recoverfrom underlying systems failures.
Flexibility
NoSQL databases are better at allowing users to test new ideas and update data structures. For
example, MongoDB, the leading document database, stores data in flexible, JSON-like documents,
meaning fields can vary from document to document and the data structures can be easily
changed over time, as application requirements evolve. This is a better fit for modern
microservices architectures where developers are continuously integrating and deploying new
application functionality.
Queries Optimization
Queries can be executed in many different ways. All paths lead to the same queryresult. The Query
optimizer evaluates the possibilities and selects the efficient plan. Efficiency is measured in
latency and throughput, depending on the workload. The cost of Memory, CPU, disk usage is
added to the cost of a plan in acost-based optimizer.
Now, most NoSQL databases have SQL-like query language support. So, a good optimizer is
mandatory. When you don't have a good optimizer, developers haveto live with feature restrictions
and DBAs have to live with performance issues.
Database Optimizer
A query optimizer chooses an optimal index and access paths to execute the query. At a very high
level, SQL optimizers decide the following before creatingthe execution tree:
918
1. Query rewrite based on heuristics, cost or both.
2. Index selection.
• Selecting the optimal index(es) for each of the table (key spaces inCouchbase
N1QL, collection in case of MongoDB)
• Depending on the index selected, choose the predicates to push down, see
the query is covered or not, decide on sort and paginationstrategy.
3. Join reordering
4. Join type
Queries Optimization
Query optimization is the science and the art of applying equivalence rules to rewrite the tree of
operators evoked in a query and produce an optimal plan. Aplan is optimal if it returns the answer
in the least time or using the least space. There are well known syntactic, logical, and semantic
equivalence rules used during optimization. These rules can be used to select an optimal plan
among semantically equivalent plans by associating a cost with each plan and
selecting the lowest overall cost. The cost associated with each plan is generated using accurate
metrics such as the cardinality or the number of result tuples in the output of each operator, the
cost of accessing a source and obtaining results fromthat source, and so on. One must also have
a cost formula that can calculate the processing cost for each implementation of each operator.
The overall cost is typically defined as the total time needed to evaluate the query and obtain all of
the answers.
The characterization of an optimal, low-cost plan is a difficult task. The complexityof producing an
optimal, low-cost plan for a relational query is NP-complete.
However, many efforts have produced reasonable heuristics to solve this problem. Both dynamic
programming and randomized optimization based onsimulated annealing provide good solutions.
A BIS could be improved significantly by exploiting the traditional database technology for
optimization extended to capture the complex metrics presented in Section 4.4.1. Many of the
systems presented in this book address optimization at different levels. K2 uses rewriting rules
and a cost model. P/FDM combines traditional optimization strategies, such as query rewriting
and selection of the best execution plan, with a query-shipping approach. DiscoveryLink performs
two types of optimizations: query rewriting followed by a cost-based optimization plan. KIND is
addressing the use of domain knowledge into executable meta-data. The knowledge of biological
resources can be used to identify the best plan with query
919
returned may generate further accesses to (other) sources. Web accesses are costly and should
be as limited as possible. A plan that limits the number of accesses is likely to have a lower cost.
Early selection is likely to limit the number of accesses. For example, the call to PubMed in the
plan illustrated in Figure 4.1 retrieves 81,840 citations, whereas the call to GenBank in the plan in
Figure 4.2 retrieves 1616 sequences. (Note that the statistics and results cited in this paper were
gathered between April 2001 and April 2002 and may no longer be up to date.) If each of the
retrieved documents (from PubMed or GenBank) generated an additional access to the second
source, clearly the second plan has the potential to be much less expensive when compared to
the first plan.
The size of the data sources involved in the query may also affect the cost of the evaluation plan.
As of May 4, 2001, Swiss-Port contained 95,674 entries whereas PubMed contained more than 11
million citations; these are the values of cardinality for the corresponding relations. A query
submitted to PubMed (as used in the first plan) retrieves 727,545 references that mention brain,
whereas itretrieves 206,317 references that mention brain and were published since 1995.
This is the selectivity of the query. In contrast, the query submitted to Swiss-Prot in the second
plan returns 126 proteins annotated with calcium channel.
In addition to the previously mentioned characteristics of the resources, the order of accessing
sources and the use of different capabilities of sources also affects the total cost of the plan. The
first plan accesses PubMed and extracts values for identifiers of records in Swiss-Prot from the
results. It then passes these values to the query on Swiss-Prot via the join operator. To pass each
value, the plan may have to send multiple calls to the Swiss-Prot source, one for each value, and
this can be expensive. However, by passing these values of identifiers to Swiss-Prot, the Swiss-
Prot source has the potential to constrain the query, and this could reduce the number of results
returned from Swiss-Prot. On the other hand, the second plan submits queries in parallel to both
PubMed and Swiss-Prot. It does not pass values of identifiers of Swiss-Prot records to Swiss-Prot;
consequently, more results may be returned from Swiss-Prot. The results from both PubMed and
Swiss-Prot have to be processed (joined) locally, and this could be computationally expensive.
Recall that for this plan, 206,317 PubMed references and 126 proteins from Swiss-Prot are
processed locally. However, the advantage is that a single query has been submitted to Swiss-
Prot in the second plan. Also, both sources are accessed in parallel.
Although it has not been described previously, there is a third plan that should be considered for
this query. This plan would first retrieve those proteins annotated with calcium channel from
Swiss-Prot and extract MEDLINE identifiers from these records. It would then pass these
identifiers to PubMed and restrict the results to those matching the keyword brain. In this
particular case, this third plan has the potential to be the least costly. It submits one sub-query to
Swiss-Prot, and it will not download 206,317 PubMed references. Finally, it will not join 206,317
PubMed references and 126 proteins from Swiss-Prot locally.
Optimization has an immediate impact in the overall performance of the system. The
consequences of the inefficiency of a system to execute users’ queries may affect the
satisfaction of users as well as the capabilities of the system to returnany output to the user.
920
NoSQL Database
NoSQL Database
Advantages of NoSQL
o It supports query language.
o It provides fast performance.
o It provides horizontal scalability.
• The second column is the Data Reference or Pointer which contains a set of
pointers holding the address of the disk block where that particular key value can be found.
921
The indexing has various attributes:
• Access Types: This refers to the type of access such as value based search, range access,
etc.
• Access Time: It refers to the time needed to find particular data element orset of elements.
• Insertion Time: It refers to the time taken to find the appropriate space and insert a new
data.
• Deletion Time: Time taken to find an item and delete it as well as update the index
structure.
• Space Overhead: It refers to the additional space required by the index.
• In general, there are two types of file organization mechanism which are followed by the
indexing methods to store the data:
1. Sequential File Organization or Ordered Index File: In this, the indices are based on a sorted
ordering of the values. These are generally fast and a more traditional type of storing mechanism.
These Ordered or Sequential file organization might store the data in a dense or sparse format:
o Dense Index:
o For every search key value in the data file, there is an indexrecord.
o This record contains the search key and also a reference to the first data record
with that search key value.
o Sparse Index:
o The index record appears only for a few items in the data file. Each item points to
a block as shown.
o To locate a record, we find the index record with the largest search key value less
than or equal to the search key value weare looking for.
o We start at that record pointed to by the index record and proceed along with the
pointers in the file (that is, sequentially) until we find the desired record.
2. Hash File organization: Indices are based on the values being distributed uniformly across a
range of buckets. The buckets to which a value is assigned is determined by a function called a
hash function.
1. Clustered Indexing
When more than two records are stored in the same file these types of storing known as cluster
indexing. By using the cluster indexing we can reduce the cost of searching reason being multiple
records related to the same thing are stored at one place and it also gives the frequent joing of
more than two tables(records).
Clustering index is defined on an ordered data file. The data file is ordered on a non-key field. In
some cases, the index is created on non-primary key columns which may not be unique for each
record. In such cases, in order to identify the records faster, we will group two or more columns
together to get the unique values and create index out of them. This method is known as the
clustering index. Basically, records with similar characteristics are grouped together and indexes
are created for these groups.
For example, students studying in each semester are grouped together. i.e., 1st Semester students,
2nd semester students, 3rd semester students etc. aregrouped. Clustered index sorted according to
first name (Search key)
Primary Indexing:
This is a type of Clustered Indexing wherein the data is sorted according to the search key and the
primary key of the database table is used to create the index. It is a default format of indexing
where it induces sequential file organization. Asprimary keys are unique and are stored in a sorted
manner, the performance of the searching operation is quite efficient.
3. Multilevel Indexing
With the growth of the size of the database, indices also grow. As the index is stored in the main
memory, a single-level index might become too large a size to store with multiple disk accesses.
The multilevel indexing segregates the main block into various smaller blocks so that the same
can stored in a single block. The outer blocks are divided into inner blocks which in turn arepointed
923
to the data blocks. This can be easily stored in the main memory with fewer overheads.
NOSQL in Cloud
With the current move to cloud computing, the need to scale applications presents itself as a
challenge for storing data. If you are using a traditional relational database, you may find yourself
working on a complex policy for distributing your database load across multiple database
instances. This solution will often present a lot of problems and probably won’t be great at
elastically scaling.
As an alternative you could consider a cloud-based NoSQL database. Over the past few weeks, I
have been analysing a few such offerings, each of which promises to scale as your application
grows, without requiring you to think abouthow you might distribute the data and load.
Specifically, I have been looking at Amazon’s DynamoDB, Google’s Cloud Datastore and Cloud
Bigtable. I chose to take a look into these 3 databases because we have existing applications
running in Google and Amazon’s clouds and I can see the advantage these databases can offer. In
this post I’ll report on what I’ve learnt.
All three databases also provide strongly consistent operations which guarantee that the latest
version of the data will always be returned.
DynamoDB achieves this by ensuring that writes are written out to the majority of nodes before a
success result is returned. Reads are also done in a similar way — results will not return until the
924
record is read from more than half of the nodes.
This is to ensure that the result will be the latest copy of the record.
All this occurs at the expense of availability, where a node being inaccessible can prevent the
verification of the data’s consistency if it occurs a short time after the write operation. Google
achieves this behaviour in a slightly different way by using a locking mechanism where a read
can’t be completed on a node until it has the latest copy of the data. This model is required when
you need to guarantee the consistency of your data. For example, you would not want a financial
transaction being calculated on an old version of the data.
OK, now that we’ve got the hard stuff out of the way, let’s move onto some of the more practical
questions that might come up when using a cloud-based database.
Local Development
Having a database in the cloud is cool, but how does it work if you’ve got a team of developers,
each of whom needs to run their own copy of the database locally? Fortunately, DynamoDB,
Bigtable and Cloud Datastore all have the option of downloading and running a local development
server. All three local development environments are really easy to download and get started with.
They are designedto provide you with an interface that matches the production environment.
Querying
An important thing to understand about all of these NoSQL databases is that they don’t provide a
full-blown query language.
Instead, you need to use their APIs and SDKs to access the database. By using simple query and
scan operations you can retrieve zero or more records from a given table. Since each of the three
databases I looked at provide a slightly different way of indexing the tables, the range of features
in this space varies.
925
DynamoDB for example provides multiple secondary indexes, meaning there is the ability to
efficiently scan any indexed column. This is not a feature in either ofGoogle’s NoSQL offerings.
Furthermore, unlike SQL databases, none of these NoSQL databases give you a means of doing
table joins, or even having foreign keys. Instead, this is something that your application has to
manage itself.
That’s said, one of the main advantages in my opinion of NoSQL is that there is no fixed schema.
As your needs change you can dynamically add new attributes to records in your table.
For example, using Java and DynamoDB, you can do the following, which will return a list of users
that have the same username as a given user:
User = new User(username); DynamoDBQueryExpression<User> queryExpression =
new DynamoDBQueryExpression<User>().withHashKeyValues(user);
List<User> itemList = Properties.getMapper().query(User.class, queryExpression);
Distributed Database Design
The main benefit of NoSQL databases is their ability to scale, and to do so in an almost seamless
way. But, just like a SQL database, a poorly designed NoSQL database can give you slow query
response times. This is why you need to consider your database design carefully.
In order to spread the load across multiple nodes, distributed databases need to spread the
stored data across multiple nodes. This is done in order for the load to be balanced. The flip-side
of this is that if frequently-accessed data is on a small subset of nodes, you will not be making full
use of the available capacity.
Consequently, you need to be careful of which columns you select as indexes. Ideally you want to
spread your load across the whole table as opposed to accessing only a portion of your data.
A good design can be achieved by picking a hash key that is likely to be randomly accessed. For
example if you have a users table and choose the username as the hash key it will be likely that
load will distributed across all of the nodes. This is due to the likeliness that users will be
randomly accessed.
In contrast to this, it would, for example, be a poor design to use the date as the hash key for a
table that contains forum posts. This is due to the likeliness that most of the requests will be for
the records on the current day so the node or nodes containing these records will likely be a small
subset of all the nodes. Thisscenario can cause your requests to be throttled or hang.
Pricing
Since Google does not have a data centre in Australia, I will only be looking atpricing in the US.
DynamoDB is priced on storage and provisioned read/write capacity. In the Oregon region storage
is charged at $0.25 per GB/Month and at $0.0065 per hourfor every 10 units of Write Capacity and
the same price for every 50 units of read capacity.
Google Cloud Datastore has a similar pricing model. With storage priced at $0.18 per GB of data
per month and $0.06 per 100,000 read operations. Write operations are charged at the same rate.
Datastore also have a Free quota of 50,000 read and 50,000 write operations per day. Since
Datastore is a Beta product it currently has a limit of 100 million operations per day, however you
canrequest the limit to be increased.
The pricing model for Google Bigtable is significantly different. With Bigtable you are charged at a
926
rate of $0.65 per instance/hour. With a minimum of 3 instances required, some basic arithmetic
gives us a starting price for Bigtable of $142.35 per month. You are then charged at $0.17 per
GB/Month for SSD-backed storage. A cheaper HDD-backed option priced at $0.026 per GB/Month
is yet to be released.
Finally you are charged for external network usage. This ranges between 8 and 23 cents per GB of
traffic depending on the location and amount of data transferred. Traffic to other Google Cloud
Platform services in the same region/zone is free.
5. For each attribute of a relation, there is a set of permitted values, called the of that attribute.
a) Domain
b) Relation
927